Going from linux 4.11 to 4.12, currently 4.12.4, the bonding module received a patch, bonding: fix active-backup transition https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit?id=3f3c278c94dd994fe0d9f21679ae19b9c0a55292 which requires correct reporting of link speed using the kernel net/core/ethtool.c __ethtool_get_link_ksettings() giving an error with all wireless drivers tested, presumably at 581 err = dev->ethtool_ops->get_settings(dev, &cmd); The consequences are two-fold. Note: drivers/net/bonding/bond_main.c if (bond_update_speed_duplex(slave)) { slave->link = BOND_LINK_DOWN; netdev_warn(bond->dev, "failed to get link speed/duplex for %s\n", slave->dev->name); continue; } 1) While the wireless drivers work perfectly well alone, and the wired network interfaces continue to work with the bonding module, when used in conjunction with the bonding module, a wireless interface will be put into the "down" state, and will not work with the bonding module. 2) This "bond_update_speed_duplex(slave)" function executes 10 times per second, and a) the log file will be "spammed" with "failed to get link speed/duplex for blah" warnings continuously, 10 times per second, and b) these log messages may be sent to the console, 10 times per second, effectively creating a "Denial of Service" at the console. A remote terminal is then needed to reconfigure networking, to remove the wireless slave from the bonding module. After two weeks, Mahesh Bandewar, the Author of this patch has been entirely unresponsive to direct email. Unless Mahesh has been on vacation, or suffering some life crisis, I will infer that he is permanently MIA. There also has been no response from Matthew Wilcox <matthew@wil.cx>, the Author of the kernel ethtool, which the bonding module uses to communicate with the wireless drivers. Other people are beginning to complain on the Arch bug tracker about the bugs caused by this patch. This patch, as it is, simply breaks wireless bonding! The bonding module is commonly used to provide automatic switching between wired and wireless networking on laptop/notebook computers. It seems to me that, at least, Mahesh's patch should be reverted immediately, to get wireless bonding working again. If some additional fix can be created quickly, that would be nice, but wireless bonding should not be left broken. It is not entirely clear whether the proper fix for the "fix active-backup transition" problem involves only changes to the bonding module, or also changes to the kernel ethtool, or to all the wireless drivers. But that determination should be made *after* Mahesh's patch has been reverted and reworked.
James, can you please provide a link to the bugtracker for Arch Linux? While bonding with wireless isn't necessarily something that there was an expectation that people would want to use, we should be able to come up with something to satisfy both you and Mahesh. I was also hoping that Mahesh would have a chance to respond to this before jumping-in and pushing for a revert since this was clearly fixing something that was critical to him and his infrastructure.
Arch Linux FS#54922 - [linux] 4.12 - bonding module not working with wireless https://bugs.archlinux.org/task/54922 BTW, from the patch, Mahesh says: "link-state transition is used to change the active link ... Also when speed and duplex settings for a link are updated during a link-event, the link-status should not be changed to invoke correct transition logic." Andy, please maybe translate that, or contextualize that statement, so that we can understand the alleged problem, tersely summarized at "The above mentioned patch broke that logic." I'm trying to understand the original problem "an attempt to keep slave state consistent with speed and duplex settings.", the attempted "fix", "This patch fixes this issue by moving the link-state update outside of the bond_update_speed_duplex() fn and to the places where this fn is called and update link-state selectively.", and how it was reasoned that this "fix" should be allowed to, or was required to, disable wireless networking, and why it should need to add warnings to the log file at a rate of 10 time per second. At first glance, none of that is comprehensible. There is too much missing and presumed information, and/or too much dependence upon jargon. It really does not seem to have been thought-through, or I'm not able to follow the reasoning.
Hi, I'm an arch linux developer/trusted user. I have been using this setup of bonding with wired/wireless with active failover for 5 years now. This update completely broke this setup. I should mention that the wireless card is fully functional, outside of the context of the bonding interface. I run wpa_supplicant on it, and it is able to join networks. But, as soon as the bond interface is running with no active wireless interface, connectivity stops and the bond interface goes down. Not to mention the logging spamming that happens every second the wireless interface is up and a member of the bonding interface.
For reference, the earlier commit, c4adfc822bf5, "bonding: make speed, duplex setting consistent with link state": https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=c4adfc822bf5d8e97660b6114b5a8892530ce8cb "bond_update_speed_duplex() retrieves speed and duplex settings. There is a possibility of failure in retrieving these values but caller has to assume it's always successful. This leads to having inconsistent slave link settings. If these (speed, duplex) values cannot be retrieved, then keeping the link UP causes problems. The updated bond_update_speed_duplex() returns 0 on success if it retrieves sane values for speed and duplex. On failure it returns 1 and marks the link down." That comment does not say what problems are being caused by keeping the wireless link in an UP state. And, that comment does not say whether "If these (speed, duplex) values cannot be retrieved" is a problem caused by the wireless drivers, by the kernel ethtool, or by some failing in the bonding module. But then, regardless, "On failure it returns 1 and marks the link down." seems quite inappropriate, just "nuking" the wireless interface. What/where is the actual cause of the failure to properly report speed and duplex from the wireless drivers? If this is a real problem with the wireless drivers, then absolutely they should be made to properly report speed and duplex, and the log warning is appropriate. I have seen wireless link speed report failures with some wireless utilities. But in no case is a continuous 10 times per second log warning appropriate!
This should fail open, not fail closed. Unless there can be a guarantee that all the wireless drivers return the proper value when requested the speed/duplex settings, the bonding driver should accommodate for this, in the case the interface is a wireless one. The logging is clearly wrong, but it's mostly a nuisance. The fact the bonding if is brought down because of this is the main issue.
*** Bug 196531 has been marked as a duplicate of this bug. ***
Not all bonding modes use the link speed. At least the active-backup and round-robin modes do not, isn't it? In such case it makes no sense to bring the link down when some slave do not report a correct link speed, since it is not used anyway. So, one possible way forward could be making the link-speed check in the bonding driver conditional on the bonding mode. This will keep the check for cases in which the bonding interface actually needs to know the link speed, which I assume is the scenario Mahesh was targeting.
You could check for dev->ieee80211_ptr being set and still let these interfaces proceed with being enslaved in a bond master network device if that's the case. This should remove the need for wireless network devices to report accurate speed/duplex, because they can't, this changes faster than on a per-packet basis, because it's... wireless :)
I agree with comment #7 by Abdo that not all bonding modes require link speed or duplex information for its slave. The bonding documentation lists link speed as explicitly needed only for 802.3ad, alb and tlb. Thus it seems sensible to take a bonding slave down due to a failed speed/duplex update only for the aforementioned modes and to leave the bonding slave up for all the other modes. That way Mahesh's not explicitly mentioned use case should remain also fixed. Assumingly with 802.3ad in mind he wants a slave to be taken down if it becomes unusable for that mode due to missing link speed or duplex information. With that in mind I submitted a patch, which fixes the issue with wireless slaves and was successfully tested with a wireless+wired active-backup bond on ArchLinux: https://www.spinics.net/lists/netdev/msg448662.html Anybody else who can confirm that the patch works for 802.3ad or wireless slaves in other modes?
The Bonding Driver Option "primary_reselect=better" does not work with a wireless interface. The bonding module does not receive the wireless link speed, so there is not enough information to determine whether the wireless link is "better". That's not the bonding module's fault, it turns out. It would be nice to see wireless drivers add support for link speed reporting, but this "if ... bond_needs_speed_duplex(bond)" approach seems like a good "quick fix".
BTW, even with something like the "if ... bond_needs_speed_duplex(bond)" fix, there is still the issue with the bonding module throwing *10 warnings per second* to the log file. That issue still needs to be addressed.
Regarding "primary_reselect=better" from comment #10: That's not a regression though. I'm certain that already before c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") a wireless slave would never have been selected because the link speed is unknown (-1). This is fully expectable behavior as detailed in comment #8. Regarding the 10 warnings per second from comment #11: I do not consider that as pressing anymore. For modes that require link speed/duplex updates a warning is definitely needed because link speed/duplex updates should not usually fail when the link is supposed to be brought up in bond_miimon_commit() and if the slave is part of a bond that needs link speed/duplex like 802.3ad. Unlike for other bonding modes... If the update does fail during bringing the link up, is it even going to be a persistent failure over a longer period of time? If indeed so a ratelimited warning could be used instead: pr_warn_ratelimited( "%s: failed to get link speed/duplex for %s\n", bond->dev->name, slave->dev->name); But that's IMO a separate issue/patch.
> Regarding "primary_reselect=better" from comment #10: That's > not a regression though. Yes, not a regression. I am simply pointing-out that, even with the work-around, there continues to exist a reason to address speed reporting in the wireless drivers. On the linux-wireless mailing list, it has been suggested that this larger issue might best be taken-up on the netdev mailing list, or even at the yearly Netconf network developers conference. > If the update does fail during bringing the link up, is it even > going to be a persistent failure over a longer period of time? When there is "breakage", yes, as we have seen. > If indeed so a ratelimited warning could be used instead: > pr_warn_ratelimited( > "%s: failed to get link speed/duplex for %s\n", > bond->dev->name, > slave->dev->name); If you can turn that into a patch, that would be much appreciated. You might as well just fix the "log spamming" issue and be done with it. I don't see the point in trying to justify, what looks to me like, some previously existing "sloppy" coding, whether or not is might be considered a "separate issue". It certainly was not a separate issue for me, since the console had then become unusable, and a remote log-in was required! Many thanks for "stepping-up" with an actual patch. I'm hoping that someone can check the 802.3ad functionality. At the moment, I cannot. As the documentation states: Prerequisites: ... 2. A switch that supports IEEE 802.3ad Dynamic link aggregation.
Created attachment 257903 [details] Use "net_warn_ratelimited" instead of "netdev_warn()" in bond_main.c This patch applies after Andreas' patch, to replace "netdev_warn()" with "net_warn_ratelimited()".
Regarding the earlier comment of people using bonding with wireless, personally I've bonded both Ethernet adapters and the wireless adapter on this machine so that I never have to worry about how I'm connected to a network, it'll "just work" and I'll be assigned the same IP via DHCP. The latest patch appears to be responsible for a massive amount of kernel spam though. My dmesg now has the following message... bond0: failed to get link speed/duplex for wlan0 bond0: link status up for interface wlan0, enabling it in 2000 ms ... echoed once a second, every second. Suffice to say my dmesg log is now completely useless.
Patches have been submitted and queued for 4.13 and 4.12. See: https://patchwork.ozlabs.org/patch/800080/ bonding: require speed/duplex only for 802.3ad, alb and tlb Submitted by Andreas Born on Aug. 10, 2017, 4:41 a.m. https://patchwork.ozlabs.org/patch/800770/ [v2] bonding: ratelimit failed speed/duplex update warning Submitted by Andreas Born on Aug. 11, 2017, 10:36 p.m.
Hi, Those patches were included in the Arch Linux kernel and the issue is gone. We are not getting the log spammed anymore and the bonding interface works when using only the wireless interface.
Patches have now landed in the mainstream kernel, with 4.12.10.