Last Updated: 2022-07-15 16:55:08 UTC
by Rob VandenBrink (Version: 1)
This all started with a text from a client, that their network was down (that's how these always start). The first set of checks showed that the network had connectivity and there were both PCs and phones communicating. The client noted that their switches were all flashing "way more rapidly than usual" - like a fool I discounted that observation, because of course your switch LEDS flash rapidly right? (more on this later)
An bit more investigation showed that DHCP wasn't working. The server admin was seeing some "wierd entries" on their DHCP server, and was inclined to blame Azure and call this a server problem. I reviewed the switch logs and was seeing that routing peers were connecting / disconnecting at odd intervals, so I was inclined to call this a carrier issue.
A few hours go by, and still no progress. We were seeing IP phones lose registration and new computers still weren't getting an IP from DHCP. I tossed a few switches in my back seat and got in the car for an on-premise visit.
DHCP still wasn't working, so I put a temporary DHCP server on the core switch, which STILL didn't work!
A PCAP showed that I was seeing kazillions of DHCP "Discover" packets, and nothing else. DHCP has four steps in the sequence, often called the "DORA" sequence for:
- DISCOVER (from the client)
- OFFER (from the server)
- REQUEST (from the client)
- ACKNOWLEDGE (from the server)
I was seeing only the first packet in each sequence. At this point I notice that the switch is at 100% CPU, and the penny drops.
This client site had never implemented the Layer 2 protections that I routinely install. At this point you might thing "rogue DHCP server?", but you'd be off base.
Yup, this was the mythical "broadcast storm", something I've only seen three times in real life, on average once every 10-15 years .. I thought with modern switches we were done with this.
The sequence of events was:
- When we put the switches in, we eradicated with extreme prejudice all of the old, unmanaged switches and hubs (yes, hubs - figure 10 years ago).
- However, like so many clients, "throw away" is a relative thing, apparently these went to a pile in the corner, which someone would take care of "later"
- As in so many examples, later never comes. 10 years later, a new printer is needed where there isn't a spare drop, and the person doing the install pulls a switch from the pile, because they weren't there 10 years ago, and nobody told them it was the junk pile.
- All is well, until someone had a workstation problem, and in the dark, under the desk, accidentally plugs one switch port into the other on that old piece of gear.
- The final piece of the puzzle is that desktop switch wasn't just an unmanaged switch, it also didn't support spanning tree
Now you see where this is going. The very next DHCP request that hits the network is a broadcast. It reaches this unmanaged switch, and hits the loop. It goes round and round that loop without end, creating a new broadcast each time around the merry-go-round. And so does the next DHCP request, and so on, and so on.
Everything on the network sees those broadcasts, and every broadcast is handled by the CPU of each station (because switch ASICs and on-board NIC features are mostly about forwarding and packet optimization). So the CPU utilization of everything on the network goes up, but most notably the core switch goes to 100% and stays there. The DHCP server doesn't have enough CPU to work, and routing protocols start to drop at odd intervals for the same reason.
The fix? We implemented broadcast controls. I usually cap this in "packets-per-second", rather than in "percent of bandwidth" - mainly because on 1 and 10G switches even 1% is a lot! On a Cisco, a simple implementation usually looks something like this:
storm-control broadcast level pps 200 50
storm-control multicast level pps 200 50
storm-control action trap
storm-control action shutdown
There are more knobs you can tweak of course, but this usually does the job.
On an HP Procurve:
storm-constrain broadcast pps 200 50
storm-constrain multicast pps 200 50
storm-constrain control block (shutdown is also an option)
storm-constrain enable trap
storm-constrain enable log
The offending port was immediately found went to ERR-DISABLE state, and everything else on the network immediately went back to normal.
The moral of the story? Some people will pathologically see any network problem, and think "it must be a broadcast storm" or "it must be the firewall". Just saying, sometimes they're right, even a broken clock is right twice a day, and this was one of those days. I've seen an actual broadcast storm 3 times in 40 years - this one, a similar one where the remote switch was a voip phone (and the PC cable through ethernet was plugged back into the wallplate), and one where an engineer was running a test suite on a movie camera under development, on the PROD network.
The other moral of the story? Quite often your attacker is "inside the house"