"It's the DNS, fix the DNS"
Nope turned out was more annoying than that. A conspiracy of events you might say. Friday I go to install new server and hook up a new internet connection for a client. Sat evening I am working on the new server when poof I lose connection. But looking at my SimpleHelpserver I can see some machines are still online but links to the servers gone. A switch has given out, I surmised. So early Sunday, I call the owner say you have a catastrophic failure and I need to go in. When I arrive switch powered off and one of the UPS's was screaming like a banshee.
Hmmmm. A UPS has failed wonder if it took the switch with it? So I take switch out of the rack to test it separately (I had brought along a new POE switch to swap out if it really was dead) but it booted with no errors. All lights green. Hmm ok. So I start putting everything back but things are not a 100% working right.
The servers seem fine but the windows machines upstairs and the ones downstairs are having connectivity issues. The keep switching their networking profile to saying they are on a public network and stop sharing. I was fairly sure I'd plugged back everything as I found it. So I go through the cabinet looking for looped cables or just something that was not supposed to be plugged where it is and find 5 redundant cables.
Cables not connected to anything and just looking like trouble so I remove them, but still no joy. But some of the machines still seem ok. So ones in a locked office. One on the first floor these all have internet and can see and mount the server. WTF? Then to get weirder I can reach a Raspberry Pi print server which is downstairs over the tinc vpn but when I login to it I cannot ping anything on the network, not even the gateway (which I assume I came in on or tinc would not work) or even it's immediate neighbours who are on the same switch. All these machines seem to pick up their dhcp info fine, set the right address then fail to see the server it just foot dhcp info from.
I was pulling my hair out. What in the world was going on?
By this time it had become very very late and I needed sleep but the business starts working at 5am next day (they are importers/exporters and need to be there for delivers come Mon)
So I resolve to go back in with my physical networking test kit and check all the cables, replace switches, anything that might have been affected when the UPS blew out and took a POE switch with it. Maybe there was some sort of power surge and it went down the wires and fragged anything connected to it in some subtle way?
The machines up stairs I seemed to fix by plugging them directly into another 24 port switch which was not the POE one but had the server and the router plugged into it. Maybe POE switch was damaged more than I realised and was corrupting packets somehow?
So I start probing the network first with etherape see if there was any obvious and unusual traffic but then I loaded up a live Ubuntu mate CD onto one of the machine downstairs that was affected and could not see it's neighbors on the same switch.
This will sort it.
"Linux is so much better than windows."
Ubuntu booted fine and worked. It got a dhcp address and could access the internet and could see the servers on the network.... until it couldn't. Then the same things started happening under Linux. I could not ping gateway. A Nmap scan shows nothing I can see on the subnet. I can't even reach the other Linux box plugged into the same switch!
Ok this was getting crazy. So I fire up wireshark and start looking for dropped packets. This is when I noticed something very very odd. The router was sending out lots of arp requests in the form of
"Who has x.x.x.x Address tell <router ip> please"
for totally seemingly random IP addresses outside of the range of the network it was on.. It was flooding the network with them. Did I have some sort of loop in the router? Impossible! There was only one cable into the router so nothing can loop back on itself. Hmm I wonder what happens if I unplug it? BANG The network is back to normal. WTF?
So I plug directly into the router with my laptop and watched as it flooded the connection with 10,000 arp requests a second.
Screw that! I factory reset the router then proceeded to turned off anything that might cause it to flood arp. DHCP Server and client, DNS server, ipv6 auto configuration, service, anything.
Then I noticed something odd. Under the DHCP server settings in order to save that page, even with the button to turn it off checked, it still forces you to put in a DHCP range. It has a list of default value and one was 10.0.0.0-10.0.0.254. Even when the DHCP server was off it was sending out ARP requests.
WTF?
So I changed to something more sane and still having made sure the DHCP server was turned off and saved the page. I analysed the traffic like before. It was quiet. Soooo quiet.
Belgium man! I'd done it!
It was the router all along and the UPS going out and the POE switch was all just a distraction.
For the record I suspect the reason that I could see the Raspberry Pi on the vpn is because it had connected over IPv6 without needing an IPv4 gateway. Because TINC too was using UDP I assume it's traffic got swept along with the routers instead of being drowned by it. But , and this is the take away all machines were affected but depending on where in the switch they were plugged and how far away physically they were from the port with the router plugged into it they were able to communicate with their neighbors. Not all switch ports are equal and the switch will always try and route packets by the shortest route. Hence some machines worked and some connected to other switches did not.
Networking is fun!
So yeah. Not as simple as a DNS issue. 😁
This calls for a basic diagnostics suite to start investigations with :)
We have a tests suite running docker containers and it has become much much more useful ever since we created a bash script trying to do basic analysis of results and giving hints on what failures may mean.
Right now even when I don't have the confidence to automate writing tasks, I spend an extra day or two making scripts to dig around and gather data for me to make a decision. It's such a time saver!
I suspect there are tools that integrate with pcaps to analyse the traffic and detect suspicious symptoms.
Yeah.. well it was a confluence of events really. The fact that the UPS blew out really threw me for a loop because as far as I knew everything was working the night before. Turns out I was soooooooooooooo wrong but only after pulling my hair out at these really weird symptoms. How would you deal with a device that is on a VPN over the internet but not connected to any other devices including those plugged into the same switch.
I suspect that "I would ask ops for help" isn't a good answer to the last question ;)
Not when I am the ops, no ;) Still it is a good war story and I've learned a valuable lesson even after a decade + doing this sort of thing. You cannot take anything for granted no matter how obvious it looks.
Congratulations @veritanuda! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :
Award for the number of comments received
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
Do not miss the last post from @steemitboard:
Congratulations @veritanuda! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word
STOP
Do not miss the last post from @steemitboard: