Network Gateway Debugging Notes

Way too often after working out a tough problem, I'm just too confused or tired to blog about it. Tonight is like that but I'm writing anyway. I just got my gateway machine routing again and it was disgustingly simple (considering the amount of time it was offline). I'm not going to explain all the details today but basically I have a machine that serves as a gateway between the LAN here in the supersecret headquarters of Late Night PC and the rest of the Internet at large. Asterisk, FreePBX, an iptables firewall, DNS server, and DHCP server all run on this CentOS box. The gateway machine connects between the DSL modem on eth1 and the last router for the LAN on eth0.

I had everything running nice and smooth. I was even working the kinks out of a traffic shaping script so VoIP, online games and big downloads wouldn't interfere with one another. There are a few machines in here and a few users. Sometimes the best part of having so many people at home is all the network problem-solving I get to do. It's a big lab for me to experiment in. And of course I love my family too. I enjoy their company and not just their network traffic.

A month or so ago I was in Toronto visiting my sister and I got a call that the LAN couldn't reach the Internet. Or some of it couldn't. That sucked but it could have waited until I got back. Except that it appeared that my development machine, which can be seen from the Internet, was serving up all kinds of nasties. By which I mean dirty pictures. By which I mean... nevermind, I couldn't look at the page long enough to see. I couldn't wrap my head around what was happening. When I tried to SSH in I got no response. The web server signature didn't match mine but when I asked Candace to do a ps aux, there didn't seem to be another web server running. The giveaway was when she disconnected my machine from the router and the nasties didn't stop. The DNS record for my server was pointing somewhere else. I use dynamic DNS for this one and all I could guess was that my record wasn't updated and someone else with an infected machine got my old IP address. Anyhow, Google saw it. I finally removed that stuff from Google a little while ago (and Google responded very quickly).

At the time though, I talked Candace through bypassing the gateway machine. I just undid that tonight, I have a few random network bits that I'd like to make a note of and share with anyone interested.

First off, I was getting this

[root@ruby ~]# ping google.com
connect: Network is unreachable

The immediate solution here is to add a route. To add a route for the time being, I just did something like this
[root@ruby ~]# route add -net 0.0.0.0 gw 192.168.3.1 eth0
[root@ruby ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.3.0     0.0.0.0         255.255.255.0   U     0      0        0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U     0      0        0 eth0
0.0.0.0         192.168.3.1     0.0.0.0         UG    0      0        0 eth0

In this case the route says to use 192.168.3.1 on eth0 for packets that aren't destined for one of the first two subnets. The destination 0.0.0.0 is the default route. I say this is temporary because the route won't be created after the next boot. After creating that route I could send things to the Internet via eth0 (I was using a temporary hardware setup and connecting through a router on eth0). Or I should have been able to.
[root@ruby ~]# ping 64.233.187.99
PING 64.233.187.99 (64.233.187.99) 56(84) bytes of data.
ping: sendmsg: Operation not permitted

This was tougher to get. My firewall was misconfigured. I set up Bob Sully's very classy iptabes firewall. This is not a firewall for the faint of heart. It's perfect for my situation though: it handles PPPoE easily, it has no goofy GUI (this machine is headless, no X Windows), and it's very accessible to tweaking through configuration files. My trouble here actually came from the fact that I was misusing the firewall. Turning it off allowed me to send pings. The reason it was blocking me is the temporary setup I mentioned earlier - the firewall was looking for an IP address for ppp0 but ppp0 wasn't connected at the time.

So I had screwed up my testing by trying to simplify it. I moved the network cabling around to put the gateway machine back inline in its proper place. I rebooted it to make sure I'd get normal reboot conditions and not rely on anything temporary (at least not without knowing it).

After rebooting, ppp0 came up but I still couldn't resolve names.

[root@ruby ~]# ping google.com
ping: unknown host google.com
[root@ruby ~]# digg google.com
-bash: digg: command not found
[root@ruby ~]# dig google.com

; <<>> DiG 9.3.4-P1 <<>> google.com
;; global options:  printcmd
;; connection timed out; no servers could be reached

Yeah that typo actually happened (I'm not proud). But DNS was clearly running on the DNS server that I wanted (ruby - the gateway machine).

[root@ruby ~]# dig google.com @ruby

; <<>> DiG 9.3.4-P1 <<>> google.com @ruby
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28530
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             300     IN      A       64.233.187.99
google.com.             300     IN      A       72.14.207.99
google.com.             300     IN      A       209.85.171.99

;; AUTHORITY SECTION:
google.com.             345600  IN      NS      ns1.google.com.
google.com.             345600  IN      NS      ns2.google.com.
google.com.             345600  IN      NS      ns3.google.com.
google.com.             345600  IN      NS      ns4.google.com.

;; Query time: 21 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Mon Oct 13 00:04:41 2008
;; MSG SIZE  rcvd: 148

I figured out that the DNS server was running but /etc/resolv.conf was looking at the router for DNS instead of ruby. The temporary solution to this is to tweak it:

[root@ruby ~]# vi /etc/resolv.conf

Fill in the address of my machine running named
nameserver 192.168.3.102

My resolv.conf has a line that says it was written by dhclient-script. What happens is, at some point, ifup ppp0 runs. This script uses information in /etc/sysconfig/network-scripts/ifcfg-ppp0 to configure the interface and make the PPPoE connection. If you're trying to connect to your ISP with PPPoE on DSL then the "normal" way to do it from the command line is using ifup. I think everything else (GUIs and all that) is a wrapper around that script. So when ifup ppp0 runs, it uses dhclient to work as a DHCP client and get an IP address for this interface. This can be a little confusing for a gateway machine since (at least in my case) it's also running the DHCP server for the LAN.

When dhclient runs it somehow gets the idea that 192.168.3.1 is the nameserver it should list in resolv.conf. I don't know where it gets that idea at the moment and I'm out of steam for tonight. I can see though, from the dhclient-script manpage, that it supports some hooks. In this case, I could create a script (/etc/dhclient-up-ppp0-hooks) which dhclient-script would find and run before creating a new resolv.conf. In that script I'd have access to the nameservers and could tweak them in the variable $new_domain_name_servers,

So why am I blogging about this instead of writing that script? Frankly I'm not convinced it's the right solution. The DHCP server seems to be configured to send out the right address. Other DHCP clients get the right nameserver address but this one doesn't. So I'm going to investigate more before giving in to this idea.

Oh and while I'm on the subject of DHCP, dhcpd wasn't starting up earlier tonight. I think dhcpd was disabled so that I could use the dhcp server built in to the router while my gateway was out of commission. To get it running at boot I just did this

[root@ruby init.d]# chkconfig --list | grep dhcp
dhcpd           0:off   1:off   2:off   3:off   4:off   5:off   6:off
[root@ruby init.d]# chkconfig --level 35 dhcpd on
[root@ruby init.d]# chkconfig --list | grep dhcp
dhcpd           0:off   1:off   2:off   3:on    4:off   5:on    6:off

You can see that first I checked to see if it was starting up then I used chkconfig to make dhcpd start according to some info I found inside the /etc/init.d/dhcpd script. Since I changed boot settings (and it was late at night & nobody but me was online) I had to reboot the gateway box to ensure I did it right. Seems to have done it.

I know this isn't all perfectly clear, if you've got a question about the stuff I touched on here just leave a comment below.

3
Your rating: None Average: 3 (1 vote)