Bug 201063
Summary: | kernel panic on heavy network use | ||
---|---|---|---|
Product: | Networking | Reporter: | oyvinds |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | davem, hkallweit1, oleksandr |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19rc2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
RIP: native_smp_send_rechedule, what did they mean by this
Call trace rip dev_hard_start_xmit Another system freeze on 4.19rc6(right before rc7) Kernel Panic 4.19 RC 7 another fine picture of a kernel panic 4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e Kernel Panic |
Created attachment 278391 [details]
Call trace rip dev_hard_start_xmit
call trade photo since it locked up and froze again
Created attachment 278943 [details]
Another system freeze on 4.19rc6(right before rc7)
Been running this box with 4.18.8 and it's fine, tried 4.19 again at rc6 and it hung after perhaps half a day.
The box has 4 RTL8111/8168/8411 (r8169) NIC cards, two of those are in a 802.3ad bond and that bond along with the other two cards are in a bridge. There's also a Intel I211 (igb) NIC.
Doesn't seem to be anything saved to disk when it crashes so all I have is a picture of the screen (which probably isn't that useful but that's all I've got).
Please re-test with rc7. Your problem description is more or less the same as addressed by the following fix: ad5f97faff42 ("r8169: fix network stalls due to missing bit TXCFG_AUTO_FIFO") Just see in your latest screenshot that the issue seems to happen also related to traffic on the Intel card, therefore the root cause may not be in the r8169 driver. Created attachment 278957 [details]
Kernel Panic 4.19 RC 7
It seemed fine for some hours with 4.19rc7 and didn't crash so naturally I tried copying 40 GB from the NAS box to another box while downloading kpop shows at 100mbit (10MB/s) to another box, the router/firewall/NAS hadn't crashed after five minutes so I started make -j7 bzImage on the NAS box for good measure and some minutes later it died with the Kernel Panic.
Regardless, this remains a problem with the latest git kernel.
Thanks for the update. The GPF is triggered also from igb_poll, so I think it's not a problem with a particular network driver, but deeper in the network stack. There have been some recent changes to SKB handling which might be related. I add David to the cc list therefore. Created attachment 278989 [details]
another fine picture of a kernel panic
Tested a bit and it looks like the box can do networking for a while if it's very idle. It's real easy to make it crash pretty immediately by downloading a big file that uses all the Internet speed on a box on the LAN while doing something like make -j7 bzImage on the box. It can't network and compute at the same time.
Perhaps I'll just use 4.18.8 forever. I like that one since it works.
Created attachment 279113 [details]
4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e Kernel Panic
today's git 4.19.0-rc8-SunnySNSD-00119-g270b77a0f30e still crashes after some time.
The box appears to work fine with 4.19 (just two days so it may not be) after removing the 4 r8169 cards and replacing them with an ancient quad port Intel Pro e1000e card. This would indicate that the problem is either with the r8169 driver or elsewhere triggered by those cards. There is always was the motherboard Intel using igb in this box. The bad news for debugging this is that I'm not about to put those Realtek r8169 cards back in the box just to test future kernels for this problem, doing so seems specially useful if there's no information to be acquired beyond taking a photograph of the screen when it freezes. The 4xRealtek card setup did work fine with kernel 4.18.x so it's clearly a regression. Worth to be mentioned is that the same error (GPF in dev_hard_start_xmit) happens also with traffic over the onboard Intel adapter, see screenshot "Another system freeze on 4.19rc6(right before rc7)" where igb_poll is involved. It can be the case that the network driver triggers the problem, but the root cause seems to be deeper in the network stack. The GPF and values dead...100 in R12 and R15 make me think that the function tries to access a poisened list entry pointer. I'll try stressing the current setup with some network and IO load and see what happens and avoid rebooting it to see how much uptime's possible with the current box's configuration. if it crashes without r8169 cards then it's clearly something else wrong and if it doesn't then that may not mean all that either. Now the igb Intel is upstream and quad Intel PRO is a bond in a bridge with other ports. Regardless of this, it would be useful with a link to some kind of howto to get anything beyond a photo of the screen if it crashes. I've just described a similar issue here: [1]. [1] http://lkml.iu.edu/hypermail/linux/kernel/1810.3/02392.html Could you please try to test with GRO disabled? It seems it works around the issue for me. Confirmed to be an issue with GRO handling, see here for the fix: https://lkml.org/lkml/2018/10/28/195 I expect the fix to be included in 4.19.1. |
Created attachment 278379 [details] RIP: native_smp_send_rechedule, what did they mean by this kernel panics, it seems to happen when there is heavy network traffic going through that box. no nothing in logs, took picture of screen with kernel panic, it is attached