Bug 201063 - kernel panic on heavy network use
Summary: kernel panic on heavy network use
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
Depends on:
Reported: 2018-09-09 13:45 UTC by oyvinds
Modified: 2018-11-15 23:30 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.19rc2
Regression: No
Bisected commit-id:

RIP: native_smp_send_rechedule, what did they mean by this (3.72 MB, image/jpeg)
2018-09-09 13:45 UTC, oyvinds
Call trace rip dev_hard_start_xmit (4.23 MB, image/jpeg)
2018-09-10 02:07 UTC, oyvinds
Another system freeze on 4.19rc6(right before rc7) (4.07 MB, image/jpeg)
2018-10-08 00:40 UTC, oyvinds
Kernel Panic 4.19 RC 7 (3.05 MB, image/jpeg)
2018-10-08 16:15 UTC, oyvinds
another fine picture of a kernel panic (3.50 MB, image/jpeg)
2018-10-11 04:42 UTC, oyvinds
4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e Kernel Panic (3.74 MB, image/jpeg)
2018-10-21 15:19 UTC, oyvinds

Description oyvinds 2018-09-09 13:45:28 UTC
Created attachment 278379 [details]
RIP: native_smp_send_rechedule, what did they mean by this

kernel panics, it seems to happen when there is heavy network traffic going through that box. no nothing in logs, took picture of screen with kernel panic, it is attached
Comment 1 oyvinds 2018-09-10 02:07:11 UTC
Created attachment 278391 [details]
Call trace rip dev_hard_start_xmit

call trade photo since it locked up and froze again
Comment 2 oyvinds 2018-10-08 00:40:31 UTC
Created attachment 278943 [details]
Another system freeze on 4.19rc6(right before rc7)

Been running this box with 4.18.8 and it's fine, tried 4.19 again at rc6 and it hung after perhaps half a day.

The box has 4 RTL8111/8168/8411 (r8169) NIC cards, two of those are in a 802.3ad bond and that bond along with the other two cards are in a bridge. There's also a Intel I211 (igb) NIC.

Doesn't seem to be anything saved to disk when it crashes so all I have is a picture of the screen (which probably isn't that useful but that's all I've got).
Comment 3 Heiner Kallweit 2018-10-08 09:52:44 UTC
Please re-test with rc7. Your problem description is more or less the same as addressed by the following fix: ad5f97faff42 ("r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO")
Comment 4 Heiner Kallweit 2018-10-08 09:58:25 UTC
Just see in your latest screenshot that the issue seems to happen also related to traffic on the Intel card, therefore the root cause may not be in the r8169 driver.
Comment 5 oyvinds 2018-10-08 16:15:06 UTC
Created attachment 278957 [details]
Kernel Panic 4.19 RC 7

It seemed fine for some hours with 4.19rc7 and didn't crash so naturally I tried copying 40 GB from the NAS box to another box while downloading kpop shows at 100mbit (10MB/s) to another box, the router/firewall/NAS hadn't crashed after five minutes so I started make -j7 bzImage on the NAS box for good measure and some minutes later it died with the Kernel Panic.

Regardless, this remains a problem with the latest git kernel.
Comment 6 Heiner Kallweit 2018-10-09 08:45:28 UTC
Thanks for the update. The GPF is triggered also from igb_poll, so I think it's not a problem with a particular network driver, but deeper in the network stack.
There have been some recent changes to SKB handling which might be related. I add David to the cc list therefore.
Comment 7 oyvinds 2018-10-11 04:42:15 UTC
Created attachment 278989 [details]
another fine picture of a kernel panic

Tested a bit and it looks like the box can do networking for a while if it's very idle. It's real easy to make it crash pretty immediately by downloading a big file that uses all the Internet speed on a box on the LAN while doing something like make -j7 bzImage on the box. It can't network and compute at the same time.

Perhaps I'll just use 4.18.8 forever. I like that one since it works.
Comment 8 oyvinds 2018-10-21 15:19:53 UTC
Created attachment 279113 [details]
4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e Kernel Panic

today's git 4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e still crashes after some time.
Comment 9 oyvinds 2018-10-25 15:34:31 UTC
The box appears to work fine with 4.19 (just two days so it may not be) after removing the 4 r8169 cards and replacing them with an ancient quad port Intel Pro e1000e card. This would indicate that the problem is either with the r8169 driver or elsewhere triggered by those cards. There is always was the motherboard Intel using igb in this box.

The bad news for debugging this is that I'm not about to put those Realtek r8169 cards back in the box just to test future kernels for this problem, doing so seems specially useful if there's no information to be acquired beyond taking a photograph of the screen when it freezes. The 4xRealtek card setup did work fine with kernel 4.18.x so it's clearly a regression.
Comment 10 Heiner Kallweit 2018-10-25 16:06:37 UTC
Worth to be mentioned is that the same error (GPF in dev_hard_start_xmit) happens also with traffic over the onboard Intel adapter, see screenshot "Another system freeze on 4.19rc6(right before rc7)" where igb_poll is involved.

It can be the case that the network driver triggers the problem, but the root cause seems to be deeper in the network stack. The GPF and values dead...100 in R12 and R15 make me think that the function tries to access a poisened list entry pointer.
Comment 11 oyvinds 2018-10-25 17:03:01 UTC
I'll try stressing the current setup with some network and IO load and see what happens and avoid rebooting it to see how much uptime's possible with the current box's configuration. if it crashes without r8169 cards then it's clearly something else wrong and if it doesn't then that may not mean all that either.

Now the igb Intel is upstream and quad Intel PRO is a bond in a bridge with other ports.

Regardless of this, it would be useful with a link to some kind of howto to get anything beyond a photo of the screen if it crashes.
Comment 12 Oleksandr Natalenko 2018-10-26 20:24:59 UTC
I've just described a similar issue here: [1].

[1] http://lkml.iu.edu/hypermail/linux/kernel/1810.3/02392.html
Comment 13 Oleksandr Natalenko 2018-10-26 21:02:27 UTC
Could you please try to test with GRO disabled? It seems it works around the issue for me.
Comment 14 Heiner Kallweit 2018-10-28 22:56:16 UTC
Confirmed to be an issue with GRO handling, see here for the fix:
I expect the fix to be included in 4.19.1.

Note You need to log in before you can comment on or make changes to this bug.