Bug 102171 - High CPU usage and crashes probably caused by ALX driver with linux 4.1.3
Summary: High CPU usage and crashes probably caused by ALX driver with linux 4.1.3
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-01 01:52 UTC by aidan
Modified: 2020-02-22 16:53 UTC (History)
11 users (show)

See Also:
Kernel Version: 4.1.3 and 4.2-rc4
Tree: Mainline
Regression: No


Attachments
The journalctl -k output from when this bug happens. (1.77 MB, text/plain)
2015-08-13 22:22 UTC, aidan
Details
Manual revert of 8954672d86d0 (13.64 KB, patch)
2016-12-22 10:25 UTC, Tobias Regnery
Details | Diff
alx_debug_reset (4.43 KB, patch)
2017-01-05 12:18 UTC, Tobias Regnery
Details | Diff
Dmesg output with the debug patch applied (223.93 KB, application/octet-stream)
2017-01-06 16:08 UTC, eijebong
Details

Description aidan 2015-08-01 01:52:29 UTC
High CPU usage is observed in kworker threads with both linux 4.1.3 and 4.2-rc4 (it was fine before the upgrade).  This also occasionally causes kernel freezes.  I see lines like [ xx.xxxxxx] alx 0000:08:00.0 enp8s0: fatal interrupt 0x4019607, resetting filling up the dmesg.  As far as I am aware, nothing I'm doing is specifically triggering this, it just happens on it's own.

The output of lspci | grep Ethernet:
08:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10)

Let me know if any more information would be of help, an I can get it.
Comment 1 aidan 2015-08-01 01:55:52 UTC
A bug report for this bug was also filed on the arch linux bugtracker: https://bugs.archlinux.org/task/45685
Comment 2 aidan 2015-08-04 04:05:29 UTC
This seems to be fixed in 4.2.0-rc5, although I'm still doing some testing.
Comment 3 aidan 2015-08-06 13:29:20 UTC
Nope, the issue still happens, although it might be less frequent.
Comment 4 aidan 2015-08-13 22:22:19 UTC
Created attachment 184871 [details]
The journalctl -k output from when this bug happens.

I couldn't get the dmesg output, because it did not include the beginning, so here's instead the journalctl -k output.  Also, the last working version was 4.0.7, and the next version that I tried 4.1.2, which shows this problem.
Comment 5 [account disabled by administrator] 2016-04-16 20:59:22 UTC
Can you try bisecting this as you have a known good version. If you don't known how to to it's very easy and can be found by googling git bistect.
Comment 6 aidan 2016-04-16 21:11:24 UTC
I haven't done it myself, but a user on the Arch Linux bugtracker (https://bugs.archlinux.org/task/45685#comment138675) bisected the problem and got 8954672d86d036643e3ce7ce3b2422c336db66d0 as the first bad commit.
Comment 7 [account disabled by administrator] 2016-04-17 04:20:22 UTC
Unfortunately I was not able to cleanly revert the possible bad commit cleanly with git. The commit link is here https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=02cea3958664723a5d2236f0f0058de97c7e4693 and for each file listed like:
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
Change the lines for that file with a + to a -. If you would feel better I will find a patch for you doing this. The file's path is from the root of the kernel source directory if your not sure.
Comment 8 Rainer Klier 2016-12-14 13:02:51 UTC
hi,
this bug affects me since over i year now.

i commented this to https://bugzilla.kernel.org/show_bug.cgi?id=70761
but that seems to be another bug which is fixed in the meantime.

i don't remember exactly, but i think, the latest 3.18.x worked.
at least i know, that ALL 4.x versions have that problem for me.
increasing the MTU size helped a little bit, to lower the frequencey of that crash, but didn't solve it.

i am using openSUSE Tumbleweed x86_64 with currently kernel 4.9.0.
Comment 9 Tobias Regnery 2016-12-16 10:24:41 UTC
Hi, the result of the bisection looks odd because I don't see a call to one of the changed functions in the driver.

Starting with 4.9 the driver supports msi-x interrupts. This can be activated with the module parameter "msix=1". After that you should see two entries in /proc/interrupts for your ethernet adapter. Can you try if this makes any difference for you?
Comment 10 Rainer Klier 2016-12-16 11:22:42 UTC
ok, i created the file /etc/modprobe.d/50-alx.conf with the following content:
options alx msix=1

is this correct?

after rebooting i checked /proc/interrupts.
i found the following lines:
 32:        117         27         18          6         17         37         32         16  IR-PCI-MSI 2097152-edge      eth0
 33:        907        847        435        231        378        738        717        277  IR-PCI-MSI 2097153-edge      eth0-TxRx-0

is this correct?
Comment 11 Rainer Klier 2016-12-16 11:35:46 UTC
when checking the dmesg output for MSI i found the following:

pci 0000:04:00.0: set MSI_INTX_DISABLE_BUG flag

when googling for MSI_INTX_DISABLE_BUG i found the following:

http://lists.infradead.org/pipermail/unified-drivers/2013-March/000017.html
https://patchwork.ozlabs.org/patch/225897/

this is releated to this problem. (AR8161/AR8162/AR8171/AR8172/E210X are exactly this network chip family)

but what exatcly does it mean?
that enabling msi-x will not be allowed or will be disabled for these devices?
Comment 12 Tobias Regnery 2016-12-19 08:25:32 UTC
(In reply to Rainer Klier from comment #10)
> ok, i created the file /etc/modprobe.d/50-alx.conf with the following
> content:
> options alx msix=1
> 
> is this correct?
> 
> after rebooting i checked /proc/interrupts.
> i found the following lines:
>  32:        117         27         18          6         17         37      
> 32         16  IR-PCI-MSI 2097152-edge      eth0
>  33:        907        847        435        231        378        738      
> 717        277  IR-PCI-MSI 2097153-edge      eth0-TxRx-0
> 
> is this correct?

Yes, this is correct and msi-x interrupts should be used. The better way to check this is to do "sudo lspci -v". There must be a "Capatibilities" line for MSI and MSI-X. The "+" or "-"-sign behind "Enable" indicates which type of interrupts are in use.

Does the crash still show up?
Comment 13 Tobias Regnery 2016-12-19 08:34:23 UTC
(In reply to Rainer Klier from comment #11)
> when checking the dmesg output for MSI i found the following:
> 
> pci 0000:04:00.0: set MSI_INTX_DISABLE_BUG flag
> 
> when googling for MSI_INTX_DISABLE_BUG i found the following:
> 
> http://lists.infradead.org/pipermail/unified-drivers/2013-March/000017.html
> https://patchwork.ozlabs.org/patch/225897/
> 
> this is releated to this problem. (AR8161/AR8162/AR8171/AR8172/E210X are
> exactly this network chip family)
> 
> but what exatcly does it mean?
> that enabling msi-x will not be allowed or will be disabled for these
> devices?

I think this has someting to do with the INTx-emulation and shouldn't matter for your problem. MSI and MSI-X interrupts should work regardless of this quirk. I have the same line in dmesg for my AR8161 Card.
Comment 14 Rainer Klier 2016-12-19 08:56:56 UTC
(In reply to Tobias Regnery from comment #12)
> Yes, this is correct and msi-x interrupts should be used. The better way to
> check this is to do "sudo lspci -v". There must be a "Capatibilities" line
> for MSI and MSI-X. The "+" or "-"-sign behind "Enable" indicates which type
> of interrupts are in use.

ok, here is the output:
04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10)
        Subsystem: ASUSTeK Computer Inc. Device 200f
        Flags: bus master, fast devsel, latency 0, IRQ 19
        Memory at dd500000 (64-bit, non-prefetchable) [size=256K]
        I/O ports at d000 [size=128]
        Capabilities: [40] Power Management version 3
        Capabilities: [58] Express Endpoint, MSI 00
        Capabilities: [c0] MSI: Enable- Count=1/16 Maskable+ 64bit+
        Capabilities: [d8] MSI-X: Enable+ Count=16 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [180] Device Serial Number ff-b8-00-59-ac-22-0b-ff
        Kernel driver in use: alx
        Kernel modules: alx

so the "options alx msix=1" is working.
good!

> Does the crash still show up?

i can't say yet.
because it seems, that the crash depends on network traffic.
i will do some tests with heavy downloading to evoke the crash.
Comment 15 Rainer Klier 2016-12-20 12:54:07 UTC
ok, "options alx msix=1" does NOT fix the issue. :-(

today i checked out many files from my svn repository and suddenly i saw messages like this on all open shell windows:
"NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!"

and the whole desktop was not usebale any more.
i had to switch off the computer.

after reboot i checked with "systemd-journalctl" and "tail -9000 /var/log/messages | less" the messages.
and again it was filled with
"Dec 20 12:59:08 lap051 kernel: alx 0000:04:00.0 eth0: fatal interrupt 0x4001607, resetting".

so, i will revert back to msi interrupt instead of msi-x interupt, because with this setup, it was at least possible to shut down the computer.
with these soft lockups i was not even able to shutdown normally, but i had to use the on/off button to hard switch it off.
Comment 16 Tobias Regnery 2016-12-22 10:25:36 UTC
Created attachment 248301 [details]
Manual revert of 8954672d86d0

Ok, so the first thing I would like to try is to revert the bad commit from the bisection in comment 6 to see if this is really cause. Since the bad commit is a merge, I created the attached patch which manually reverts the core kernel changes from the commit plus a build fix for recent kernels. This is against vanilla 4.9.0, it boots on my machine and my AR8161 Card shows no issues after an hour of iperf traffic.

Since I cant't reproduce this issue I need someone to apply patches, compile kernels and test these changes. To speed things up it would be good to find a fast reproducer for this issue. You can experiment with different network traffic and see if something specific triggers this issue reliable.
Comment 17 Rainer Klier 2016-12-22 12:58:47 UTC
(In reply to Tobias Regnery from comment #16)

> Since I cant't reproduce this issue I need someone to apply patches, compile
> kernels and test these changes. To speed things up it would be good to find
> a fast reproducer for this issue. You can experiment with different network
> traffic and see if something specific triggers this issue reliable.

i could easily reproduce the crash, but i can't easily build the patched kernel for my openSUSE tumbleweed system.
i will ask tiwai@suse.com to create a patched kernel for me.
Comment 18 eijebong 2016-12-23 14:33:05 UTC
Well, my bisection is wrong... It just crashed for me with the patch applied.
Comment 19 Tobias Regnery 2017-01-05 12:18:03 UTC
Created attachment 250361 [details]
alx_debug_reset

Ok, so there are no driver changes between 4.0 and 4.4 (beside new device ids) so this must clearly be a change outside of the driver. Without a proper bisection result and a way to reproduce locally, it will be very hard for me to find the root cause...

I will try to look at the old bisection result again. We can be quite sure the bad commits are really bad and only one of the good annotation is wrong. Maybe we can isolate the range of commits which could be related.

However can you please test this debug patch (again for 4.9) and post the dmesg output? It will print the register state of the adapter on every reset, maybe there is something usefull in it. Besides this the patch tries to not reset the adapter or update the hardware stats if it is down. Maybe this helps to not crash the whole machine...
Comment 20 eijebong 2017-01-06 16:08:07 UTC
Created attachment 250561 [details]
Dmesg output with the debug patch applied

So it just crashed, I attached my dmesg output with you patch applied. I hope you can make something out of it.
Comment 21 Tobias Regnery 2017-01-16 11:00:13 UTC
Unfortunately not, the adapter seems to be in a serious broken state and can't read out the content of the register. I will try to look at the bisection result this week...
Comment 22 Jiri Slaby 2017-08-23 10:52:53 UTC
Ping -- any updates?
Comment 23 Javran Cheng 2017-10-13 04:35:38 UTC
any update on this? I perhaps have the exact same problem: what I have experienced is that when it happened, the internet stopped working, dmesg was filled with message "alx 0000:03:00.0 eth0: fatal interrupt 0x4001607, resetting", and for some reason I don't really know, I cannot shutdown the system in the normal way and I have to push and hold power button to shutdown my machine forcibly.

My card is "Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)" and the numeric id is "1969:e091". Is my problem related to this one and if so, is there something I can do to help?
Comment 24 Rainer Klier 2017-10-13 08:42:56 UTC
(In reply to Javran Cheng from comment #23)
> any update on this? I perhaps have the exact same problem: what I have
> experienced is that when it happened, the internet stopped working, dmesg
> was filled with message "alx 0000:03:00.0 eth0: fatal interrupt 0x4001607,
> resetting", and for some reason I don't really know, I cannot shutdown the
> system in the normal way and I have to push and hold power button to
> shutdown my machine forcibly.

this is EXACTLY like it is for me.

> My card is "Ethernet controller: Qualcomm Atheros Killer E220x Gigabit
> Ethernet Controller (rev 13)" and the numeric id is "1969:e091". Is my

i have similar hardware:

dmesg | grep -i alx reports: Qualcomm Atheros AR816x/AR817x Ethernet [ac:22:0b:b8:00:59]

lspci reports: 04:00.0 Ethernet controller: Qualcomm Atheros QCA8171 Gigabit Ethernet (rev 10)

and as both chips using this alx driver, we suffer from this bug.
Comment 25 Javran Cheng 2017-12-23 06:06:08 UTC
hm, any update on this? It's affecting us for years and when the crash happens, the whole system is rendered unusable. I'm willing to provide info or try patches if someone could give me some pointers.
Comment 26 Javran Cheng 2018-06-21 17:21:42 UTC
is there anyone interested working on this issue? I experience this quite often but with no reliable way of reproducing it.

I think this is a serious bug as it renders the whole OS unusable until I reboot. I can provide help with investigation but we need some attention to push this forward.

Thanks!
Comment 27 Rainer Klier 2018-06-25 14:09:18 UTC
i also still suffer from this bug.
Comment 28 rprofijt 2019-11-15 18:26:09 UTC
Hi,

I also have this issue on linux 5.3.11-arch1-1 using QCA8171

Did everyone just buy different hardware?
Comment 29 Gary van der Merwe 2019-12-05 11:31:28 UTC
> Did everyone just buy different hardware?

Yip, I got myself a usd eth dongle :-(
Comment 30 Javran Cheng 2019-12-08 00:11:20 UTC
Yup, switching hardware is more practical in this case.
No one is working on this even if you are willing to help, hopeless.
Comment 31 rprofijt 2020-02-21 17:04:58 UTC
Clearing the uefi settings seemed to have helped. did not have the issue for 2 months now.
Comment 32 Rainer Klier 2020-02-22 13:02:26 UTC
(In reply to rprofijt from comment #31)
> Clearing the uefi settings seemed to have helped. did not have the issue for
> 2 months now.

what exactly did you do to "clear the uefi settings"?
are you talking about some settings/changes in the computer's BIOS?
Comment 33 rprofijt 2020-02-22 16:53:04 UTC
Resetting to default bios/uefi settings and removing the cmos battery for a few seconds. Maybe some power management setting?
Comment 34 rprofijt 2020-02-22 16:53:34 UTC
seconds = minutes

Note You need to log in before you can comment on or make changes to this bug.