Bug 196547

Summary: Since 4.12 - bonding module not working with wireless drivers
Product: Networking Reporter: James (james)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: RESOLVED CODE_FIX    
Severity: high CC: abdo.roig, andy, ben, f.fainelli, futur.andy, gigadoc2+linux, grazzolini, marius_reg, salinasv
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.12 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Use "net_warn_ratelimited" instead of "netdev_warn()" in bond_main.c

Description James 2017-07-31 13:25:35 UTC
Going from linux 4.11 to 4.12, currently 4.12.4, the bonding module received a patch,

 bonding: fix active-backup transition
 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit?id=3f3c278c94dd994fe0d9f21679ae19b9c0a55292

which requires correct reporting of link speed using the kernel net/core/ethtool.c
__ethtool_get_link_ksettings()
giving an error with all wireless drivers tested, presumably at
581 err = dev->ethtool_ops->get_settings(dev, &cmd);

The consequences are two-fold.

Note:
drivers/net/bonding/bond_main.c

if (bond_update_speed_duplex(slave)) {
slave->link = BOND_LINK_DOWN;
netdev_warn(bond->dev,
"failed to get link speed/duplex for %s\n",
slave->dev->name);
continue;
}

1) While the wireless drivers work perfectly well alone, and the wired network interfaces continue to work with the bonding module, when used in conjunction with the bonding module, a wireless interface will be put into the "down" state, and will not work with the bonding module.

2) This "bond_update_speed_duplex(slave)" function executes 10 times per second, and a) the log file will be "spammed" with "failed to get link speed/duplex for blah" warnings continuously, 10 times per second, and b) these log messages may be sent to the console, 10 times per second, effectively creating a "Denial of Service" at the console. A remote terminal is then needed to reconfigure networking, to remove the wireless slave from the bonding module.

After two weeks, Mahesh Bandewar, the Author of this patch has been entirely unresponsive to direct email.  Unless Mahesh has been on vacation, or suffering some life crisis, I will infer that he is permanently MIA.  There also has been no response from Matthew Wilcox <matthew@wil.cx>, the Author of the kernel ethtool, which the bonding module uses to communicate with the wireless drivers.

Other people are beginning to complain on the Arch bug tracker about the bugs caused by this patch.  This patch, as it is, simply breaks wireless bonding!  The bonding module is commonly used to provide automatic switching between wired and wireless networking on laptop/notebook computers.

It seems to me that, at least, Mahesh's patch should be reverted immediately, to get wireless bonding working again.  If some additional fix can be created quickly, that would be nice, but wireless bonding should not be left broken.

It is not entirely clear whether the proper fix for the "fix active-backup transition" problem involves only changes to the bonding module, or also changes to the kernel ethtool, or to all the wireless drivers.  But that determination should be made *after* Mahesh's patch has been reverted and reworked.
Comment 1 Andy Gospodarek 2017-07-31 13:45:25 UTC
James, can you please provide a link to the bugtracker for Arch Linux?

While bonding with wireless isn't necessarily something that there was an expectation that people would want to use, we should be able to come up with something to satisfy both you and Mahesh.

I was also hoping that Mahesh would have a chance to respond to this before jumping-in and pushing for a revert since this was clearly fixing something that was critical to him and his infrastructure.
Comment 2 James 2017-07-31 16:18:08 UTC
Arch Linux
FS#54922 - [linux] 4.12 - bonding module not working with wireless

https://bugs.archlinux.org/task/54922

BTW, from the patch, Mahesh says:
"link-state transition is used to change the active link ... Also when speed and duplex settings for a link are updated during a link-event, the link-status should not be changed to invoke correct transition logic."

Andy, please maybe translate that, or contextualize that statement, so that we can understand the alleged problem, tersely summarized at "The above mentioned patch broke that logic."  I'm trying to understand the original problem "an attempt to keep slave state consistent with speed and duplex settings.", the attempted "fix", "This patch fixes this issue by moving the link-state update outside of the bond_update_speed_duplex() fn and to the places where this fn is called and update link-state selectively.", and how it was reasoned that this "fix" should be allowed to, or was required to, disable wireless networking, and why it should need to add warnings to the log file at a rate of 10 time per second.  At first glance, none of that is comprehensible.  There is too much missing and presumed information, and/or too much dependence upon jargon.  It really does not seem to have been thought-through, or I'm not able to follow the reasoning.
Comment 3 Giancarlo Razzolini 2017-07-31 16:22:41 UTC
Hi,

I'm an arch linux developer/trusted user. I have been using this setup of bonding with wired/wireless with active failover for 5 years now. This update completely broke this setup. I should mention that the wireless card is fully functional, outside of the context of the bonding interface. I run wpa_supplicant on it, and it is able to join networks. But, as soon as the bond interface is running with no active wireless interface, connectivity stops and the bond interface goes down. Not to mention the logging spamming that happens every second the wireless interface is up and a member of the bonding interface.
Comment 4 James 2017-07-31 17:02:09 UTC
For reference, the earlier commit, c4adfc822bf5, "bonding: make speed, duplex setting consistent with link state":

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=c4adfc822bf5d8e97660b6114b5a8892530ce8cb

"bond_update_speed_duplex() retrieves speed and duplex settings. There
is a possibility of failure in retrieving these values but caller has
to assume it's always successful. This leads to having inconsistent
slave link settings. If these (speed, duplex) values cannot be
retrieved, then keeping the link UP causes problems.

The updated bond_update_speed_duplex() returns 0 on success if it
retrieves sane values for speed and duplex. On failure it returns 1
and marks the link down."

That comment does not say what problems are being caused by keeping the wireless link in an UP state.

And, that comment does not say whether "If these (speed, duplex) values cannot be retrieved" is a problem caused by the wireless drivers, by the kernel ethtool, or by some failing in the bonding module.

But then, regardless, "On failure it returns 1 and marks the link down." seems quite inappropriate, just "nuking" the wireless interface.  What/where is the actual cause of the failure to properly report speed and duplex from the wireless drivers?

If this is a real problem with the wireless drivers, then absolutely they should be made to properly report speed and duplex, and the log warning is appropriate.  I have seen wireless link speed report failures with some wireless utilities.  But in no case is a continuous 10 times per second log warning appropriate!
Comment 5 Giancarlo Razzolini 2017-07-31 17:12:57 UTC
This should fail open, not fail closed. Unless there can be a guarantee that all the wireless drivers return the proper value when requested the speed/duplex settings, the bonding driver should accommodate for this, in the case the interface is a wireless one.

The logging is clearly wrong, but it's mostly a nuisance. The fact the bonding if is brought down because of this is the main issue.
Comment 6 Abdó Roig-Maranges 2017-08-08 07:03:07 UTC
*** Bug 196531 has been marked as a duplicate of this bug. ***
Comment 7 Abdó Roig-Maranges 2017-08-08 07:18:34 UTC
Not all bonding modes use the link speed. At least the active-backup and round-robin modes do not, isn't it? In such case it makes no sense to bring the link down when some slave do not report a correct link speed, since it is not used anyway.

So, one possible way forward could be making the link-speed check in the bonding driver conditional on the bonding mode. This will keep the check for cases in which the bonding interface actually needs to know the link speed, which I assume is the scenario Mahesh was targeting.
Comment 8 Florian Fainelli 2017-08-08 22:07:15 UTC
You could check for dev->ieee80211_ptr being set and still let these interfaces proceed with being enslaved in a bond master network device if that's the case.

This should remove the need for wireless network devices to report accurate speed/duplex, because they can't, this changes faster than on a per-packet basis, because it's... wireless :)
Comment 9 Andreas Born 2017-08-10 11:29:38 UTC
I agree with comment #7 by Abdo that not all bonding modes require link speed or duplex information for its slave. The bonding documentation lists link speed as explicitly needed only for 802.3ad, alb and tlb. Thus it seems sensible to take a bonding slave down due to a failed speed/duplex update only for the aforementioned modes and to leave the bonding slave up for all the other modes. That way Mahesh's not explicitly mentioned use case should remain also fixed. Assumingly with 802.3ad in mind he wants a slave to be taken down if it becomes unusable for that mode due to missing link speed or duplex information.

With that in mind I submitted a patch, which fixes the issue with wireless slaves and was successfully tested with a wireless+wired active-backup bond on ArchLinux: https://www.spinics.net/lists/netdev/msg448662.html

Anybody else who can confirm that the patch works for 802.3ad or wireless slaves in other modes?
Comment 10 James 2017-08-10 14:20:53 UTC
The Bonding Driver Option "primary_reselect=better" does not work with a wireless interface.  The bonding module does not receive the wireless link speed, so there is not enough information to determine whether the wireless link is "better".  That's not the bonding module's fault, it turns out.

It would be nice to see wireless drivers add support for link speed reporting, but this "if ... bond_needs_speed_duplex(bond)" approach seems like a good "quick fix".
Comment 11 James 2017-08-10 14:39:36 UTC
BTW, even with something like the "if ... bond_needs_speed_duplex(bond)" fix, there is still the issue with the bonding module throwing *10 warnings per second* to the log file.  That issue still needs to be addressed.
Comment 12 Andreas Born 2017-08-10 16:03:41 UTC
Regarding "primary_reselect=better" from comment #10: That's not a regression though. I'm certain that already before c4adfc822bf5 ("bonding: make speed, duplex setting consistent with link state") a wireless slave would never have been selected because the link speed is unknown (-1). This is fully expectable behavior as detailed in comment #8.


Regarding the 10 warnings per second from comment #11: I do not consider that as pressing anymore. For modes that require link speed/duplex updates a warning is definitely needed because link speed/duplex updates should not usually fail when the link is supposed to be brought up in bond_miimon_commit() and if the slave is part of a bond that needs link speed/duplex like 802.3ad. Unlike for other bonding modes...
If the update does fail during bringing the link up, is it even going to be a persistent failure over a longer period of time? If indeed so a ratelimited warning could be used instead:
pr_warn_ratelimited(
		"%s: failed to get link speed/duplex for %s\n",
		bond->dev->name,
	    slave->dev->name);

But that's IMO a separate issue/patch.
Comment 13 James 2017-08-10 18:29:58 UTC
> Regarding "primary_reselect=better" from comment #10: That's
> not a regression though.

Yes, not a regression.  I am simply pointing-out that, even with the work-around, there continues to exist a reason to address speed reporting in the wireless drivers.  On the linux-wireless mailing list, it has been suggested that this larger issue might best be taken-up on the netdev mailing list, or even at the yearly Netconf network developers conference.

> If the update does fail during bringing the link up, is it even
> going to be a persistent failure over a longer period of time?

When there is "breakage", yes, as we have seen.

> If indeed so a ratelimited warning could be used instead:
> pr_warn_ratelimited(
>                "%s: failed to get link speed/duplex for %s\n",
>                bond->dev->name,
>            slave->dev->name);

If you can turn that into a patch, that would be much appreciated.  You might as well just fix the "log spamming" issue and be done with it.  I don't see the point in trying to justify, what looks to me like, some previously existing "sloppy" coding, whether or not is might be considered a "separate issue".  It certainly was not a separate issue for me, since the console had then become unusable, and a remote log-in was required!

Many thanks for "stepping-up" with an actual patch.  I'm hoping that someone can check the 802.3ad functionality.  At the moment, I cannot.  As the documentation states:

 Prerequisites:
 ...
 2. A switch that supports IEEE 802.3ad Dynamic link aggregation.
Comment 14 James 2017-08-12 19:16:14 UTC
Created attachment 257903 [details]
Use "net_warn_ratelimited" instead of "netdev_warn()" in bond_main.c

This patch applies after Andreas' patch, to replace "netdev_warn()" with "net_warn_ratelimited()".
Comment 15 Benjamin Hodgetts 2017-08-16 16:46:44 UTC
Regarding the earlier comment of people using bonding with wireless, personally I've bonded both Ethernet adapters and the wireless adapter on this machine so that I never have to worry about how I'm connected to a network, it'll "just work" and I'll be assigned the same IP via DHCP.

The latest patch appears to be responsible for a massive amount of kernel spam though. My dmesg now has the following message...

bond0: failed to get link speed/duplex for wlan0
bond0: link status up for interface wlan0, enabling it in 2000 ms

... echoed once a second, every second. Suffice to say my dmesg log is now completely useless.
Comment 16 James 2017-08-16 19:53:09 UTC
Patches have been submitted and queued for 4.13 and 4.12.

See:


https://patchwork.ozlabs.org/patch/800080/
bonding: require speed/duplex only for 802.3ad, alb and tlb
Submitted by Andreas Born on Aug. 10, 2017, 4:41 a.m.


https://patchwork.ozlabs.org/patch/800770/
[v2] bonding: ratelimit failed speed/duplex update warning
Submitted by Andreas Born on Aug. 11, 2017, 10:36 p.m.
Comment 17 Giancarlo Razzolini 2017-08-23 11:30:25 UTC
Hi,

Those patches were included in the Arch Linux kernel and the issue is gone. We are not getting the log spammed anymore and the bonding interface works when using only the wireless interface.
Comment 18 James 2017-09-01 01:58:34 UTC
Patches have now landed in the mainstream kernel, with 4.12.10.