Bug 202235 - regression: physical to VETH (LXC) network bridge after updating to 4.20.0
Summary: regression: physical to VETH (LXC) network bridge after updating to 4.20.0
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-11 22:58 UTC by Michael Evans
Modified: 2019-01-27 23:06 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.20.0
Tree: Mainline
Regression: No


Attachments

Description Michael Evans 2019-01-11 22:58:45 UTC
I have since reverted to the working LTS kernel image offered by Arch Linux (4.19.13), but am willing to re-test / gather data additional data on a couple lower-use time periods during the week.  
  
After updating to Linux 4.20.0 (along with a full system update otherwise) my BRIDGED network connections to some LXC containers ceased working.  
  
Attempting to troubleshoot this issue also produced extremely odd results, which I think offhand MIGHT have caused network packets to fill up some kind of memory buffer instead of being relaid or dropped; there are some additional details at the serverfault and LXC bugs that I filed, as it was initially (and still is) unclear where the actual issue is.  
  
-  
  
At this time I am unsure if it is related to netdev (bridge, veth), cgroups, or some changed default that should now be configured in a way that is different to previous defaults.  
  
https://serverfault.com/questions/947848/linux-bridge-broken-after-upgrade-out-of-ideas-places-to-look-now-4-20-0-arc  
  
https://github.com/lxc/lxc/issues/2769  
  
* It is NOT related to IP forwarding, as this is a BRIDGED connection, not a routed one, and it works on older kernels without that enabled.  
  
* physical network to bridge works (and will stay connected for a few min after later troubleshooting steps, even if ARP caches / ping flake out and stop responding)  
  
* VETH (within LXC) can ping the the host IP on the bridge (but not the gateway, the host can before this step) if manually assigned a static address.  Doing this seems to cause general instability and a timed out SSH session.  This lead me to rebooting between each round of testing to ensure I had a clean slate to start with.  
  
I went over the major settings that I did check in the other two bug reports, but I'm open to checking other values and/or performing different kinds of tests occasionally over a given week.  Responses won't be immediate but I'll try to check on this frequently over the next two weeks.
Comment 1 Ian Kumlien 2019-01-12 10:14:22 UTC
Try applying this patch:
https://marc.info/?l=linux-netdev&m=154696956604748&w=2

It solved it for me, what qdisc do you use?
(tc qdisc will list them - I was using fq which is why it hit me)
Comment 2 Michael Evans 2019-01-14 00:44:53 UTC
(In reply to Ian Kumlien from comment #1)
> Try applying this patch:
> https://marc.info/?l=linux-netdev&m=154696956604748&w=2
> 
> It solved it for me, what qdisc do you use?
> (tc qdisc will list them - I was using fq which is why it hit me)

Thank you, I can confirm that applying that single line patch DOES make the difference and resolve the issue (for me); though as the published current kernel versions are still need this patch back-ported this bug shouldn't be closed.

ArchLinux had a package that made testing the a custom-kernel build easier, but it was based on 4.20.2, so I re-tested without (failed, as expected) the patch and with (appears to be working, as hoped).
Comment 3 Ian Kumlien 2019-01-14 08:20:56 UTC
Yeah, it didn't make 4.20.2 - It has been picked up and marked for -stable so hopefully it will be in 4.20.3 :)
Comment 4 Ian Kumlien 2019-01-14 14:59:30 UTC
FYI it's in the current pull set posted to Linus

Patch 15 in:
https://marc.info/?l=linux-netdev&m=154741526902566&w=2
Comment 5 Dragoon Aethis 2019-01-18 21:12:35 UTC
I've had a similar issie with bridged networking in QEMU (TAP networking to a bridge with enslaved host interface) and the patch mentioned above did solve my issue (where both the VM and the host lost internet connectivity - setting the host interface down, then up again brought networking back for the host). It's not yet included in 4.20.3 either, for anyone looking for this.
Comment 6 Ian Kumlien 2019-01-25 15:13:53 UTC
It's included in: 4.20.5-rc1

So, it should be in 4.20.5 final ;)
Comment 7 Ian Kumlien 2019-01-27 23:06:04 UTC
Released and confirmed working, IMHO this bug report can be closed with fixed in 4.20.5

Note You need to log in before you can comment on or make changes to this bug.