Differences between revisions 15 and 16
Revision 15 as of 2005-12-16 17:03:50
Size: 6555
Comment: metrix-west up and functioning!
Revision 16 as of 2005-12-16 18:04:28
Size: 3418
Comment: trimmed older news, moved to MississippiNetworkNewsArchive
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
'''November 30, 2005''': Metrix-naya-sw hung at about 3:00am when I ran "athdebug +recv" in order to collect information on MAC addresses going through each node. It is not passing traffic, so metrix-commons and metrix-west are currently unreachable. So, metrix-naya-sw needs a power cycle as soon as possible. --RussellSenior
Line 15: Line 14:
'''November 29, 2005''': Buick got sick and was rebooted. In fact, it is still sick and will be replaced, hopefully on Wednesday evening, with a nucab, at least temporarily. We also power cycled the Edimax AP in the dog shop in Mississippi Commons. It appears to be functioning now.

'''November 21, 2005''': RussellSenior built a freshened kernel (2.6.14.2) and madwifi-ng (rev 1329), installed them on metrix-naya-sw, metrix-commons, and metrix-west, and rebooted. The new madwifi-ng rev was built in a metrix-compatible chroot environment and so the madwifi-utils in /usr/local/bin are now linked properly. The other two metrixes, metrix-naya-nw and metrix-ed are still running the original 2.6.12.3-metrix kernel and the WDS-branch madwifi drivers from late July. It is possible to connect with essentially zero packet loss from buick to metrix-west if you simultaneously "ping -f 10.11.104.2" from buick. Metrix-west was apt-get upgraded that way.

'''November 20, 2005''': RussellSenior thinks he's figured out what is going wrong. It is an effect caused by client-node to client-node when the traffic needs to pass through one of the client bridges. As mentioned earlier, when a client-node sends to a client-node, it sees the traffic twice, once when it is sends it and once (in promiscuous mode) when the master rebroadcasts it. When the traffic is passing from the other side of the bridge (say, from buick on eth0), and it sees the rebroadcast packet it just sent with a SRC MAC on ath0, the bridge is reassigning that MAC to the bridge port associated with ath0, not eth0. When packets return headed for that MAC, they get to the bridge and the bridge fails to deliver to the port where that MAC actually lives. Boom. This problem does not occur when communicating client-to-master (or master-to-client), because these packets are not rebroadcast. The problem doesn't occur when the communication is strictly client-to-client, because even though the client still sees the rebroadcast packet, the bridge is smart enough to know not to reassign local MAC addresses to a different port.

RussellSenior tested this model this morning by ping flooding from buick to metrix-commons (thus keeping metrix-naya-sw's bridge refreshed with where buick's MAC should properly live) while pinging the problematic metrix-west. Still some lossage, but far less than the usual 98%, only about 17%.

Now the question is, what is the solution? One temporary solution might be to use ebtables filtering to drop packets at metrix-naya-sw where buick's MAC shows up on ath0 as a SRC MAC. But there are other situations where we'll see the same phenomenon, e.g. 11b/g clients of the 11a client nodes. The real solution is to get the sending bridges to ignore the rebroadcasts altogether.

December 16, 2005: As of about 3:30pm, the southern branch of the Mississippi Network was converted to a WDS configuration, and simultaneously, the problematic metrix-west (N Missouri and Failing) began to function properly. RussellSenior visited the neighborhood and confirmed the ability to get DHCP resolution from metrix-west and was able to roam seemlessly to the nodes at Mississippi Commons and the NAYA building (Mississippi and Shaver). Will need to convert the northern branch as well now. Thanks for everyone's patience as we sorted through the problem. We are now poised to further grow the network with much less turmoil and delay.

December 12, 2005: RussellSenior has autogenerated /etc/network/interfaces for each of the metrixes for using a WDS configuration. On the test rig, he has been running a ping for the last 20 hours from one client to another (as described in the December 4 entry) and seem to have a consistent 3.4% ping loss rate. We are still seeing a kernel panic after ifdown/ifup'ing the interface (as described [http://madwifi.org/ticket/222 here]), but believe the problem is tolerable since the metrixes are rebooting themselves on panic. The goal is to get the WDS configurations installed this week, possibly on Thursday.

December 9, 2005: Last night, MichaelWeinberg, JenSedell, and RussellSenior distributed flyers at the Mississippi Art Walk. We may have located another willing roof host at the furniture shop on Mississippi down near Fremont. Russell continues to work on a metrix configuration that will work reliably. Current status is that WDS is working, bridging works, slightly lossy, panics on ifdown/ifup, but at least it is rebooting itself and coming back up in good shape. Maybe an interim solution is just always rebooting to ifup interfaces.

December 4, 2005: I have had partial success using WDS bridging on a test bed consisting of two metrixes and a router/AP using the madwifi-ng drivers and a multiple VAP configuration. I am able to ping from a client-11g -> WDS-11a -> WDS-11a -> WDS-11a -> client-11b, which is essentially what wasn't working before. Pings aren't without a few dropped packets, but relatively few (~3%). The most significant problem now is that I am having trouble getting the backhaul radios to consistently come up in 11a mode. Perhaps some timing issue. Also, I've seem some oopses, not always fatal. I should probably sync everyone up to the latest rev of madwifi-nw. Anyway, hopeful news! With luck, this will get ironed out in the next few days and we'll be able to deploy it.

December 2, 2005: Became aware in the late afternoon that the nucab's DHCP server was not running. The connection was fine, but clients weren't getting configured, which, uh, reduced utility. AaronBaer patched up the deficiencies and as of about 3:40pm the DHCP server appears to be runnning again. We are talking about ways to facilitate more expeditious outage reports. --RussellSenior

December 1, 2005: Buick replaced with a nucab box. TroyJaqua and RussellSenior fixed a small bug consisting of a missing /etc/network/nat.sh script and it started working. Network functioning again. Modified ebtables on metrix-naya-sw to reflect the new gateway (substituting its mac address for buick's eth1).

See ["MississippiNetworkNewsArchive"] for archived news items.

MississippiNetworkNews (last edited 2020-12-19 15:02:20 by RussellSenior)