Bug 202235

Summary: regression: physical to VETH (LXC) network bridge after updating to 4.20.0
Product: Networking Reporter: Michael Evans (mjevans1983)
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: blocking CC: Ian.kumlien, linux-bugzilla, mjevans1983
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.20.0 Subsystem:
Regression: No Bisected commit-id:

Description Michael Evans 2019-01-11 22:58:45 UTC
I have since reverted to the working LTS kernel image offered by Arch Linux (4.19.13), but am willing to re-test / gather data additional data on a couple lower-use time periods during the week.  
  
After updating to Linux 4.20.0 (along with a full system update otherwise) my BRIDGED network connections to some LXC containers ceased working.  
  
Attempting to troubleshoot this issue also produced extremely odd results, which I think offhand MIGHT have caused network packets to fill up some kind of memory buffer instead of being relaid or dropped; there are some additional details at the serverfault and LXC bugs that I filed, as it was initially (and still is) unclear where the actual issue is.  
  
-  
  
At this time I am unsure if it is related to netdev (bridge, veth), cgroups, or some changed default that should now be configured in a way that is different to previous defaults.  
  
https://serverfault.com/questions/947848/linux-bridge-broken-after-upgrade-out-of-ideas-places-to-look-now-4-20-0-arc  
  
https://github.com/lxc/lxc/issues/2769  
  
* It is NOT related to IP forwarding, as this is a BRIDGED connection, not a routed one, and it works on older kernels without that enabled.  
  
* physical network to bridge works (and will stay connected for a few min after later troubleshooting steps, even if ARP caches / ping flake out and stop responding)  
  
* VETH (within LXC) can ping the the host IP on the bridge (but not the gateway, the host can before this step) if manually assigned a static address.  Doing this seems to cause general instability and a timed out SSH session.  This lead me to rebooting between each round of testing to ensure I had a clean slate to start with.  
  
I went over the major settings that I did check in the other two bug reports, but I'm open to checking other values and/or performing different kinds of tests occasionally over a given week.  Responses won't be immediate but I'll try to check on this frequently over the next two weeks.
Comment 1 Ian Kumlien 2019-01-12 10:14:22 UTC
Try applying this patch:
https://marc.info/?l=linux-netdev&m=154696956604748&w=2

It solved it for me, what qdisc do you use?
(tc qdisc will list them - I was using fq which is why it hit me)
Comment 2 Michael Evans 2019-01-14 00:44:53 UTC
(In reply to Ian Kumlien from comment #1)
> Try applying this patch:
> https://marc.info/?l=linux-netdev&m=154696956604748&w=2
> 
> It solved it for me, what qdisc do you use?
> (tc qdisc will list them - I was using fq which is why it hit me)

Thank you, I can confirm that applying that single line patch DOES make the difference and resolve the issue (for me); though as the published current kernel versions are still need this patch back-ported this bug shouldn't be closed.

ArchLinux had a package that made testing the a custom-kernel build easier, but it was based on 4.20.2, so I re-tested without (failed, as expected) the patch and with (appears to be working, as hoped).
Comment 3 Ian Kumlien 2019-01-14 08:20:56 UTC
Yeah, it didn't make 4.20.2 - It has been picked up and marked for -stable so hopefully it will be in 4.20.3 :)
Comment 4 Ian Kumlien 2019-01-14 14:59:30 UTC
FYI it's in the current pull set posted to Linus

Patch 15 in:
https://marc.info/?l=linux-netdev&m=154741526902566&w=2
Comment 5 Dragoon Aethis 2019-01-18 21:12:35 UTC
I've had a similar issie with bridged networking in QEMU (TAP networking to a bridge with enslaved host interface) and the patch mentioned above did solve my issue (where both the VM and the host lost internet connectivity - setting the host interface down, then up again brought networking back for the host). It's not yet included in 4.20.3 either, for anyone looking for this.
Comment 6 Ian Kumlien 2019-01-25 15:13:53 UTC
It's included in: 4.20.5-rc1

So, it should be in 4.20.5 final ;)
Comment 7 Ian Kumlien 2019-01-27 23:06:04 UTC
Released and confirmed working, IMHO this bug report can be closed with fixed in 4.20.5