← Revision 6 as of 2005-11-30 15:08:31
Size: 2887
Comment:
|
← Revision 7 as of 2005-11-30 15:12:09 →
Size: 3211
Comment: outage news
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
'''November 30, 2005''': Metrix-naya-sw hung at about 3:00am when I ran "athdebug +recv" in order to collect information on MAC addresses going through each node. It is not passing traffic, so metrix-commons and metrix-west are currently unreachable. So, metrix-naya-sw needs a power cycle as soon as possible. --RussellSenior |
|
Line 12: | Line 14: |
---- |
November 30, 2005: Metrix-naya-sw hung at about 3:00am when I ran "athdebug +recv" in order to collect information on MAC addresses going through each node. It is not passing traffic, so metrix-commons and metrix-west are currently unreachable. So, metrix-naya-sw needs a power cycle as soon as possible. --RussellSenior
November 29, 2005: Buick got sick and was rebooted. In fact, it is still sick and will be replaced, hopefully on Wednesday evening, with a nucab, at least temporarily. We also power cycled the Edimax AP in the dog shop in Mississippi Commons. It appears to be functioning now.
November 21, 2005: RussellSenior built a freshened kernel (2.6.14.2) and madwifi-ng (rev 1329), installed them on metrix-naya-sw, metrix-commons, and metrix-west, and rebooted. The new madwifi-ng rev was built in a metrix-compatible chroot environment and so the madwifi-utils in /usr/local/bin are now linked properly. The other two metrixes, metrix-naya-nw and metrix-ed are still running the original 2.6.12.3-metrix kernel and the WDS-branch madwifi drivers from late July. It is possible to connect with essentially zero packet loss from buick to metrix-west if you simultaneously "ping -f 10.11.104.2" from buick. Metrix-west was apt-get upgraded that way.
November 20, 2005: RussellSenior thinks he's figured out what is going wrong. It is an effect caused by client-node to client-node when the traffic needs to pass through one of the client bridges. As mentioned earlier, when a client-node sends to a client-node, it sees the traffic twice, once when it is sends it and once (in promiscuous mode) when the master rebroadcasts it. When the traffic is passing from the other side of the bridge (say, from buick on eth0), and it sees the rebroadcast packet it just sent with a SRC MAC on ath0, the bridge is reassigning that MAC to the bridge port associated with ath0, not eth0. When packets return headed for that MAC, they get to the bridge and the bridge fails to deliver to the port where that MAC actually lives. Boom. This problem does not occur when communicating client-to-master (or master-to-client), because these packets are not rebroadcast. The problem doesn't occur when the communication is strictly client-to-client, because even though the client still sees the rebroadcast packet, the bridge is smart enough to know not to reassign local MAC addresses to a different port.
RussellSenior tested this model this morning by ping flooding from buick to metrix-commons (thus keeping metrix-naya-sw's bridge refreshed with where buick's MAC should properly live) while pinging the problematic metrix-west. Still some lossage, but far less than the usual 98%, only about 17%.
Now the question is, what is the solution? One temporary solution might be to use ebtables filtering to drop packets at metrix-naya-sw where buick's MAC shows up on ath0 as a SRC MAC. But there are other situations where we'll see the same phenomenon, e.g. 11b/g clients of the 11a client nodes. The real solution is to get the sending bridges to ignore the rebroadcasts altogether.
See ["MississippiNetworkNewsArchive"] for archived news items.