Since a recent kernel upgrade, we have experienced kernel panics on 4 of our servers, with the backtrace ending in (manually copied from a srceenshot of the panic message):
You can find two screenshots of the kernel panic (from the webinterface of one of the affected servers) at https://imgur.com/7Teb8BV and https://imgur.com/a/eWWcyg1.
The panics usually happen a few hours after boot, but we have seen almost 2 days of stable operation once. For now, we have downgraded all our kernels again to keep our servers stable.
These machines are all running Debian stable with Debian kernels (and all the machines with this setup are affected). The panics started happening when we upgraded from 4.9.130-2 to 4.9.144-3. Some machines got, at the same time, upgraded from 4.19.12-1~bpo9+1 to 4.19.16-1~bpo9+1, which also introduced the same issue.
The changelog for 4.9.135 includes "ipv6: frags: rewrite ip6_expire_frag_queue()", so that'd be my first guess -- but I really have no idea.
Someone now also ran into this issue on Ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1824687
Created attachment 282563 [details]
5 kernel stack traces from 4 servers exhibiting the issue
I have had this crash, with the ip6_expire_frag_queue stack trace, more than 18 times since 2019-04-16 on more than 10 different servers in 8 different countries. There have been some more crashes, but from these ones the panic dump managed to go out to a remote syslog server where it's easy to grep. Crash count by kernel version; these are on both Ubuntu 14.04 trusty and 16.04 xenial:
2 crashes: 4.4.0-144-generic #170~14.04.1-Ubuntu
8 crashes: 4.4.0-145-generic #171-Ubuntu
8 crashes: 4.4.0-146-generic #172-Ubuntu
Downgrading to 4.4.0-143 now, as that build does not have the "ipv6: frags: rewrite ip6_expire_frag_queue()" change; it first appears in 4.4.0-144-generic image. I think by tomorrow it's clear whether that kernel is stable as we're now having multiple crashes per day (last crash 50 minutes ago).
These are routers running NAT & firewall & some applications, with substantial IPv6 traffic.
Interestingly the crashes only happen on bare hardware. We have a much larger number of VMs doing the same thing, most of them now running 4.4.0-146, and none of them have crashed like this. The hardware instances do have a larger number of CPU cores, the VMs only have 2 or 4.
I am also seeing crashes on 4.15.0-48-generic hwe kernel running on Ubuntu 16.04 xenial, but no stack trace to show yet.
Attaching kernel stack trace file containing several crashes on various servers (hessu-ipv6_expire_frag_queue-crashes.txt).
Someone has reported this same crash happening in 3-5 hours in 3 systems on Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922488
> Someone has reported this same crash happening in 3-5 hours in 3 systems on
That was also me.
We have since then upgraded a few of our systems to 4.19.28, and are not experiencing the issue any more. Seems like maybe something between 4.19.16 and 4.19.28 fixed it?