Bug 102171
Summary: | High CPU usage and crashes probably caused by ALX driver with linux 4.1.3 | ||
---|---|---|---|
Product: | Drivers | Reporter: | aidan |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | NEW --- | ||
Severity: | high | CC: | aidan, bastienphilbert, eijebong, feng.tang, garyvdm, jirislaby, o.freyermuth, rainer.klier, rien, szg00000, tobias.regnery |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.1.3 and 4.2-rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
The journalctl -k output from when this bug happens.
Manual revert of 8954672d86d0 alx_debug_reset Dmesg output with the debug patch applied |
Description
aidan
2015-08-01 01:52:29 UTC
A bug report for this bug was also filed on the arch linux bugtracker: https://bugs.archlinux.org/task/45685 This seems to be fixed in 4.2.0-rc5, although I'm still doing some testing. Nope, the issue still happens, although it might be less frequent. Created attachment 184871 [details]
The journalctl -k output from when this bug happens.
I couldn't get the dmesg output, because it did not include the beginning, so here's instead the journalctl -k output. Also, the last working version was 4.0.7, and the next version that I tried 4.1.2, which shows this problem.
Can you try bisecting this as you have a known good version. If you don't known how to to it's very easy and can be found by googling git bistect. I haven't done it myself, but a user on the Arch Linux bugtracker (https://bugs.archlinux.org/task/45685#comment138675) bisected the problem and got 8954672d86d036643e3ce7ce3b2422c336db66d0 as the first bad commit. Unfortunately I was not able to cleanly revert the possible bad commit cleanly with git. The commit link is here https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=02cea3958664723a5d2236f0f0058de97c7e4693 and for each file listed like: --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h Change the lines for that file with a + to a -. If you would feel better I will find a patch for you doing this. The file's path is from the root of the kernel source directory if your not sure. hi, this bug affects me since over i year now. i commented this to https://bugzilla.kernel.org/show_bug.cgi?id=70761 but that seems to be another bug which is fixed in the meantime. i don't remember exactly, but i think, the latest 3.18.x worked. at least i know, that ALL 4.x versions have that problem for me. increasing the MTU size helped a little bit, to lower the frequencey of that crash, but didn't solve it. i am using openSUSE Tumbleweed x86_64 with currently kernel 4.9.0. Hi, the result of the bisection looks odd because I don't see a call to one of the changed functions in the driver. Starting with 4.9 the driver supports msi-x interrupts. This can be activated with the module parameter "msix=1". After that you should see two entries in /proc/interrupts for your ethernet adapter. Can you try if this makes any difference for you? ok, i created the file /etc/modprobe.d/50-alx.conf with the following content: options alx msix=1 is this correct? after rebooting i checked /proc/interrupts. i found the following lines: 32: 117 27 18 6 17 37 32 16 IR-PCI-MSI 2097152-edge eth0 33: 907 847 435 231 378 738 717 277 IR-PCI-MSI 2097153-edge eth0-TxRx-0 is this correct? when checking the dmesg output for MSI i found the following: pci 0000:04:00.0: set MSI_INTX_DISABLE_BUG flag when googling for MSI_INTX_DISABLE_BUG i found the following: http://lists.infradead.org/pipermail/unified-drivers/2013-March/000017.html https://patchwork.ozlabs.org/patch/225897/ this is releated to this problem. (AR8161/AR8162/AR8171/AR8172/E210X are exactly this network chip family) but what exatcly does it mean? that enabling msi-x will not be allowed or will be disabled for these devices? (In reply to Rainer Klier from comment #10) > ok, i created the file /etc/modprobe.d/50-alx.conf with the following > content: > options alx msix=1 > > is this correct? > > after rebooting i checked /proc/interrupts. > i found the following lines: > 32: 117 27 18 6 17 37 > 32 16 IR-PCI-MSI 2097152-edge eth0 > 33: 907 847 435 231 378 738 > 717 277 IR-PCI-MSI 2097153-edge eth0-TxRx-0 > > is this correct? Yes, this is correct and msi-x interrupts should be used. The better way to check this is to do "sudo lspci -v". There must be a "Capatibilities" line for MSI and MSI-X. The "+" or "-"-sign behind "Enable" indicates which type of interrupts are in use. Does the crash still show up? (In reply to Rainer Klier from comment #11) > when checking the dmesg output for MSI i found the following: > > pci 0000:04:00.0: set MSI_INTX_DISABLE_BUG flag > > when googling for MSI_INTX_DISABLE_BUG i found the following: > > http://lists.infradead.org/pipermail/unified-drivers/2013-March/000017.html > https://patchwork.ozlabs.org/patch/225897/ > > this is releated to this problem. (AR8161/AR8162/AR8171/AR8172/E210X are > exactly this network chip family) > > but what exatcly does it mean? > that enabling msi-x will not be allowed or will be disabled for these > devices? I think this has someting to do with the INTx-emulation and shouldn't matter for your problem. MSI and MSI-X interrupts should work regardless of this quirk. I have the same line in dmesg for my AR8161 Card. (In reply to Tobias Regnery from comment #12) > Yes, this is correct and msi-x interrupts should be used. The better way to > check this is to do "sudo lspci -v". There must be a "Capatibilities" line > for MSI and MSI-X. The "+" or "-"-sign behind "Enable" indicates which type > of interrupts are in use. ok, here is the output: 04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10) Subsystem: ASUSTeK Computer Inc. Device 200f Flags: bus master, fast devsel, latency 0, IRQ 19 Memory at dd500000 (64-bit, non-prefetchable) [size=256K] I/O ports at d000 [size=128] Capabilities: [40] Power Management version 3 Capabilities: [58] Express Endpoint, MSI 00 Capabilities: [c0] MSI: Enable- Count=1/16 Maskable+ 64bit+ Capabilities: [d8] MSI-X: Enable+ Count=16 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [180] Device Serial Number ff-b8-00-59-ac-22-0b-ff Kernel driver in use: alx Kernel modules: alx so the "options alx msix=1" is working. good! > Does the crash still show up? i can't say yet. because it seems, that the crash depends on network traffic. i will do some tests with heavy downloading to evoke the crash. ok, "options alx msix=1" does NOT fix the issue. :-( today i checked out many files from my svn repository and suddenly i saw messages like this on all open shell windows: "NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!" and the whole desktop was not usebale any more. i had to switch off the computer. after reboot i checked with "systemd-journalctl" and "tail -9000 /var/log/messages | less" the messages. and again it was filled with "Dec 20 12:59:08 lap051 kernel: alx 0000:04:00.0 eth0: fatal interrupt 0x4001607, resetting". so, i will revert back to msi interrupt instead of msi-x interupt, because with this setup, it was at least possible to shut down the computer. with these soft lockups i was not even able to shutdown normally, but i had to use the on/off button to hard switch it off. Created attachment 248301 [details] Manual revert of 8954672d86d0 Ok, so the first thing I would like to try is to revert the bad commit from the bisection in comment 6 to see if this is really cause. Since the bad commit is a merge, I created the attached patch which manually reverts the core kernel changes from the commit plus a build fix for recent kernels. This is against vanilla 4.9.0, it boots on my machine and my AR8161 Card shows no issues after an hour of iperf traffic. Since I cant't reproduce this issue I need someone to apply patches, compile kernels and test these changes. To speed things up it would be good to find a fast reproducer for this issue. You can experiment with different network traffic and see if something specific triggers this issue reliable. (In reply to Tobias Regnery from comment #16) > Since I cant't reproduce this issue I need someone to apply patches, compile > kernels and test these changes. To speed things up it would be good to find > a fast reproducer for this issue. You can experiment with different network > traffic and see if something specific triggers this issue reliable. i could easily reproduce the crash, but i can't easily build the patched kernel for my openSUSE tumbleweed system. i will ask tiwai@suse.com to create a patched kernel for me. Well, my bisection is wrong... It just crashed for me with the patch applied. Created attachment 250361 [details]
alx_debug_reset
Ok, so there are no driver changes between 4.0 and 4.4 (beside new device ids) so this must clearly be a change outside of the driver. Without a proper bisection result and a way to reproduce locally, it will be very hard for me to find the root cause...
I will try to look at the old bisection result again. We can be quite sure the bad commits are really bad and only one of the good annotation is wrong. Maybe we can isolate the range of commits which could be related.
However can you please test this debug patch (again for 4.9) and post the dmesg output? It will print the register state of the adapter on every reset, maybe there is something usefull in it. Besides this the patch tries to not reset the adapter or update the hardware stats if it is down. Maybe this helps to not crash the whole machine...
Created attachment 250561 [details]
Dmesg output with the debug patch applied
So it just crashed, I attached my dmesg output with you patch applied. I hope you can make something out of it.
Unfortunately not, the adapter seems to be in a serious broken state and can't read out the content of the register. I will try to look at the bisection result this week... Ping -- any updates? any update on this? I perhaps have the exact same problem: what I have experienced is that when it happened, the internet stopped working, dmesg was filled with message "alx 0000:03:00.0 eth0: fatal interrupt 0x4001607, resetting", and for some reason I don't really know, I cannot shutdown the system in the normal way and I have to push and hold power button to shutdown my machine forcibly. My card is "Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)" and the numeric id is "1969:e091". Is my problem related to this one and if so, is there something I can do to help? (In reply to Javran Cheng from comment #23) > any update on this? I perhaps have the exact same problem: what I have > experienced is that when it happened, the internet stopped working, dmesg > was filled with message "alx 0000:03:00.0 eth0: fatal interrupt 0x4001607, > resetting", and for some reason I don't really know, I cannot shutdown the > system in the normal way and I have to push and hold power button to > shutdown my machine forcibly. this is EXACTLY like it is for me. > My card is "Ethernet controller: Qualcomm Atheros Killer E220x Gigabit > Ethernet Controller (rev 13)" and the numeric id is "1969:e091". Is my i have similar hardware: dmesg | grep -i alx reports: Qualcomm Atheros AR816x/AR817x Ethernet [ac:22:0b:b8:00:59] lspci reports: 04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10) and as both chips using this alx driver, we suffer from this bug. hm, any update on this? It's affecting us for years and when the crash happens, the whole system is rendered unusable. I'm willing to provide info or try patches if someone could give me some pointers. is there anyone interested working on this issue? I experience this quite often but with no reliable way of reproducing it. I think this is a serious bug as it renders the whole OS unusable until I reboot. I can provide help with investigation but we need some attention to push this forward. Thanks! i also still suffer from this bug. Hi, I also have this issue on linux 5.3.11-arch1-1 using QCA8171 Did everyone just buy different hardware? > Did everyone just buy different hardware?
Yip, I got myself a usd eth dongle :-(
Yup, switching hardware is more practical in this case. No one is working on this even if you are willing to help, hopeless. Clearing the uefi settings seemed to have helped. did not have the issue for 2 months now. (In reply to rprofijt from comment #31) > Clearing the uefi settings seemed to have helped. did not have the issue for > 2 months now. what exactly did you do to "clear the uefi settings"? are you talking about some settings/changes in the computer's BIOS? Resetting to default bios/uefi settings and removing the cmos battery for a few seconds. Maybe some power management setting? seconds = minutes |