Bug 202669 - Kernel panic in ip6_expire_frag_queue [regression]
Summary: Kernel panic in ip6_expire_frag_queue [regression]
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV6 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Hideaki YOSHIFUJI
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-24 17:28 UTC by Ralf
Modified: 2019-04-29 20:43 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.9.144
Subsystem:
Regression: No
Bisected commit-id:


Attachments
5 kernel stack traces from 4 servers exhibiting the issue (33.80 KB, text/plain)
2019-04-29 17:57 UTC, Heikki Hannikainen
Details

Description Ralf 2019-02-24 17:28:14 UTC
Since a recent kernel upgrade, we have experienced kernel panics on 4 of our servers, with the backtrace ending in (manually copied from a srceenshot of the panic message):

  __pskb_pull_tail
  ip6_dst_lookup_tail
  _decode_session6
  __xfrm_decode_session
  icmpv6_route_lookup
  icmp6_send
  __kmalloc_reserve
  nf_ct_net_exit
  ip6_expire_frag_queue

You can find two screenshots of the kernel panic (from the webinterface of one of the affected servers) at https://imgur.com/7Teb8BV and https://imgur.com/a/eWWcyg1.

The panics usually happen a few hours after boot, but we have seen almost 2 days of stable operation once.  For now, we have downgraded all our kernels again to keep our servers stable.

These machines are all running Debian stable with Debian kernels (and all the machines with this setup are affected). The panics started happening when we upgraded from 4.9.130-2 to 4.9.144-3. Some machines got, at the same time, upgraded from 4.19.12-1~bpo9+1 to 4.19.16-1~bpo9+1, which also introduced the same issue.
The changelog for 4.9.135 includes "ipv6: frags: rewrite ip6_expire_frag_queue()", so that'd be my first guess -- but I really have no idea.
Comment 1 Ralf 2019-04-14 11:33:15 UTC
Someone now also ran into this issue on Ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1824687
Comment 2 Heikki Hannikainen 2019-04-29 17:57:19 UTC
Created attachment 282563 [details]
5 kernel stack traces from 4 servers exhibiting the issue

I have had this crash, with the ip6_expire_frag_queue stack trace, more than 18 times since 2019-04-16 on more than 10 different servers in 8 different countries. There have been some more crashes, but from these ones the panic dump managed to go out to a remote syslog server where it's easy to grep. Crash count by kernel version; these are on both Ubuntu 14.04 trusty and 16.04 xenial:

2 crashes: 4.4.0-144-generic #170~14.04.1-Ubuntu
8 crashes: 4.4.0-145-generic #171-Ubuntu
8 crashes: 4.4.0-146-generic #172-Ubuntu

Downgrading to 4.4.0-143 now, as that build does not have the "ipv6: frags: rewrite ip6_expire_frag_queue()" change; it first appears in 4.4.0-144-generic image. I think by tomorrow it's clear whether that kernel is stable as we're now having multiple crashes per day (last crash 50 minutes ago).

These are routers running NAT & firewall & some applications, with substantial IPv6 traffic.

Interestingly the crashes only happen on bare hardware. We have a much larger number of VMs doing the same thing, most of them now running 4.4.0-146, and none of them have crashed like this. The hardware instances do have a larger number of CPU cores, the VMs only have 2 or 4.

I am also seeing crashes on 4.15.0-48-generic hwe kernel running on Ubuntu 16.04 xenial, but no stack trace to show yet.

Attaching kernel stack trace file containing several crashes on various servers (hessu-ipv6_expire_frag_queue-crashes.txt).
Comment 3 Heikki Hannikainen 2019-04-29 18:02:15 UTC
Someone has reported this same crash happening in 3-5 hours in 3 systems on Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=922488
Comment 4 Ralf 2019-04-29 20:43:45 UTC
> Someone has reported this same crash happening in 3-5 hours in 3 systems on
> Debian

That was also me.

We have since then upgraded a few of our systems to 4.19.28, and are not experiencing the issue any more. Seems like maybe something between 4.19.16 and 4.19.28 fixed it?

Note You need to log in before you can comment on or make changes to this bug.