Bug 199177

Summary: ASUS XG-C100C 10G Network Adapter no longer working since 4.16 (all release candidates)
Product: Drivers Reporter: Kevin (kernelbugzilla)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: NEW ---    
Severity: normal CC: cw, irusskikh, kernelbugzilla, perry_yuan, regressions, thomas.bludau, wes
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.16.rc* Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output from Kernel 4.15.12; NIC works properly here.
dmesg output from Kernel 4.16 rc7; NIC does not initialize / work.
dmesg output from Kernel 4.15.12; NIC working; VirtualBox removed
dmesg output from Kernel 4.16 rc7; NIC NOT working; VirtualBox removed
ethtool -d output for kernel 4.15.13
ethtool -d output for kernel 4.16
different testing log files
kernel configuration
Register dump from enp7s0
workaround patch
Production patch v1

Description Kevin 2018-03-22 17:40:55 UTC
Hi,
I've got an ASUS XG-C100C 10G network adapter.
The adapter has worked just fine in the 4.15 series kernels.
I'm currently using 4.15.10-041510 without issue.

However, every release of the 4.16 kernel in the mainline stream, which as of this moment goes up to RC6, has failed to recognize the adapter.

Here's the "proper" device details taken *when the device is working properly* in 4.15:

08:00.0 Ethernet controller: Device 1d6a:d108 (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 875b
        Flags: bus master, fast devsel, latency 0, IRQ 18
        Memory at df040000 (64-bit, non-prefetchable) [size=64K]
        Memory at df050000 (64-bit, non-prefetchable) [size=4K]
        Memory at dec00000 (64-bit, non-prefetchable) [size=4M]
        Expansion ROM at df000000 [disabled] [size=256K]
        Capabilities: [40] Express Endpoint, MSI 00
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
        Capabilities: [a0] MSI: Enable- Count=1/32 Maskable- 64bit+
        Capabilities: [c0] Vital Product Data
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [180] #19
        Kernel driver in use: atlantic
        Kernel modules: atlantic

I've got an AMD Vega card and I'm eager to move to 4.16 but cannot until this bug is resolved as having no network interface is a minor productivity problem in 2018 :P

Thanks in advance
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2018-03-26 09:03:28 UTC
Please retest with rc7 (there were lots of networking fixes) and bring the issue to netdev (https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt ) in case it still shows up; please CC me
Comment 2 Kevin 2018-03-27 15:22:52 UTC
Still broken in 4.16rc7.

Interestingly, I note the "lspci -v" information looks similar on 4.16rc7 to what I've listed above, including the "kernel driver in use"; not an expert, but that sort of looks like it's using the driver, then, so maybe something else is broken Kernel-wise other than the driver?
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2018-03-27 18:43:27 UTC
FWIW: lspci output in cases like this often is not much of a help; it's simply an utility for displaying information about PCI buses in the system and devices connected to them; it even works for devices which Linux has no driver for.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2018-03-27 19:01:07 UTC
Could you maybe attach the dmesg output from 4.15 and 4.16-rc7 to this report? Maybe try to compare the two if the output from the atlantic driver changed
Comment 5 Kevin 2018-03-27 19:58:14 UTC
Created attachment 274965 [details]
dmesg output from Kernel 4.15.12; NIC works properly here.
Comment 6 Kevin 2018-03-27 19:58:48 UTC
Created attachment 274967 [details]
dmesg output from Kernel 4.16 rc7; NIC does not initialize / work.
Comment 7 Kevin 2018-03-27 19:59:23 UTC
As requested, I have uploaded full dmesg output from fresh boot, one on 4.15.12 (working), and one on 4.16 rc7 (not working).
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2018-03-27 20:13:37 UTC
Please boot both kernels without loading the VirtualBox drivers or services, they might mess things up here (as the "loading out-of-tree module taints kernel." indicates)
Comment 9 Kevin 2018-03-27 20:22:14 UTC
Created attachment 274969 [details]
dmesg output from Kernel 4.15.12; NIC working; VirtualBox removed
Comment 10 Kevin 2018-03-27 20:22:39 UTC
Created attachment 274971 [details]
dmesg output from Kernel 4.16 rc7; NIC NOT working; VirtualBox removed
Comment 11 Kevin 2018-03-27 20:23:24 UTC
As requested, see attached. Attached dmesg from both 4.15 and 4.16 with VirtualBox completely uninstalled.

Still doesn't work.

4.16 dmesg shows this at the bottom:
[   15.802524] IPv6: ADDRCONF(NETDEV_UP): enp5s0: link is not ready
Comment 12 Kevin 2018-03-27 20:26:12 UTC
Oh, and thanks for looking into this for me, Thorsten!
Comment 13 Igor Russkikh 2018-03-30 11:10:25 UTC
Hi Kevin, could you please provide the following output:

    sudo ethtool -i enp5s0

That'll help us alot in diagnosis.
Comment 14 Christian Weinmueller 2018-04-03 09:22:04 UTC
Hello Igor, 

i hop just in, as i have the same problem Kevin already reported. It look likes there was a major driver version upgrade from kernel 4.15 to 4.16 which somehow breaks something. My XG-C100C is connected to a managed 10 GbE switch, which recognizes a physical connection to the network card and does a successful auto-negotiation for the link - i can change the link speed manually, renegotiation is successful for 100M/1000M/10000M, but does not appear in the kernel log, the interface never goes up or down. Further analysis on the switch shows, that from the switch packets are sent to the XG-C100C without any reply (RX counter is zero).

ethtool output: 
4.15.13:
driver: atlantic
version: 1.6.13.0-kern
firmware-version: 1.5.44
expansion-rom-version:
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

4.16 (release version, not rc):
driver: atlantic
version: 2.0.2.1-kern
firmware-version: 1.5.44
expansion-rom-version:
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

further, ethtool enp7s0 reports with kernel 4.16 Speed: Unknown! and Link detected: No 

Hope that helps!
Comment 15 Igor Russkikh 2018-04-03 09:33:21 UTC
Hi Christian,

Thanks for the feedback. We are already looking into this but can't reproduce with similar XG-C100C cards.

Could you also provide the output of

    sudo ethtool -d enp7s0

Do you have an ability to rebuild atlantic.ko with our test patch and check it out?

Thanks!
Comment 16 Christian Weinmueller 2018-04-03 09:45:00 UTC
Created attachment 275059 [details]
ethtool -d output for kernel 4.15.13
Comment 17 Christian Weinmueller 2018-04-03 09:45:26 UTC
Created attachment 275061 [details]
ethtool -d output for kernel 4.16
Comment 18 Christian Weinmueller 2018-04-03 09:48:20 UTC
Hi Igor, 

thanks for the quick reply. I attached the output of ethtool -d for both kernels, hope that helps. 

Rebuilding and testing the module with a test patch would be no problem for me.
Comment 19 Igor Russkikh 2018-04-03 09:59:44 UTC
Thanks!

Right now thats a blind guess, but could you try this:

----------------
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
index d3b847e..9d65410 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c
@@ -152,7 +152,7 @@ static int hw_atl_utils_soft_reset_flb(struct aq_hw_s *self)
                return -EIO;
        }
        /* Old FW requires fixed delay after init */
-       AQ_HW_SLEEP(15);
+       AQ_HW_SLEEP(500);
 
        return 0;
 }
@@ -221,7 +221,7 @@ static int hw_atl_utils_soft_reset_rbl(struct aq_hw_s *self)
                return -EIO;
        }
        /* Old FW requires fixed delay after init */
-       AQ_HW_SLEEP(15);
+       AQ_HW_SLEEP(500);
 
        return 0;
 }
----------------

To rebuild only the module:

    make M=drivers/net/ethernet/aquantia modules
    sudo rmmod atlantic
    sudo insmod drivers/net/ethernet/aquantia/atlantic/atlantic.ko


Thanks!
Comment 20 Christian Weinmueller 2018-04-03 10:23:07 UTC
Created attachment 275063 [details]
different testing log files

Hi Igor, 

thank you for the patch. 
sadly, it doesn't fix the issue, the interface still stays down. i attached ethtool (-i/-d) and dmesg output.
Comment 21 Igor Russkikh 2018-04-04 09:27:24 UTC
Thanks Christian.

We'd like to check full register dump of the NIC.

Could you please:

    git clone https://github.com/Aquantia/pcimem
    cd pcimem
    make
    sudo ./pcimem /sys/class/net/enp7s0/device/resource0 0 w*0x2400 > regdump.txt

That should generate a full reg dump.

Also, please attach your kernel .config just in case.
We still can't reproduce this on the very same card (
Comment 22 Kevin 2018-04-04 15:54:11 UTC
Hopefully Igor can supply the requested RegDump, I'll be unable to.

However, I thought I'd chime in with some details about my system in case it helps reproduce the issue.

I'm running Linux Mint 18.3 KDE base install, which was installed 100% clean, and I manually install the kernel updates from the mainline branch. I manually installed 4.15 when it came out and everything was working, it only went south on the 4.16. It's always possible it's some weird distro configuration issue that only happens when upgrading the Kernel, I suppose.
Comment 23 Igor Russkikh 2018-04-05 08:32:14 UTC
Kevin,

Could you please attach your .config as well. That may help.
Comment 24 Christian Weinmueller 2018-04-05 08:43:10 UTC
Created attachment 275105 [details]
kernel configuration

hello, 

i tried dumping the registers from the NIC, but the script fails with:

Error at line 112, file pcimem.c (22) [Invalid argument]

so, for now only my kernel configuration, which is basically the default configuration of debian 9.4 with aquantia drivers built as module. 

if igor has an idea why the register dump fails, i'll try again. 

Thanks
Comment 25 Igor Russkikh 2018-04-05 13:31:42 UTC
Thanks.

Strangely 4.16 does not allow sharing of pci resource regions.

Please do

    sudo rmmod atlantic

Then pcimem should success, however you should find and put a full correct pci device path, something like this:

    /sys/devices/pci0000:00/0000:00:1d.0/0000:02:00.0/resource0
Comment 26 Christian Weinmueller 2018-04-05 13:49:43 UTC
Created attachment 275109 [details]
Register dump from enp7s0

Hi, 

found it and made the dump. 

Hope that helps!
Comment 27 Igor Russkikh 2018-04-05 15:22:15 UTC
Thanks Chris!

We see your card has some suspicious flash image configuration. We'd like to take a full dump of your NICs flash for analysis.

We may send you some tools for that. Can we do that through your email?

That'll help us alot!
Comment 28 Christian Weinmueller 2018-04-05 15:39:19 UTC
Sure, thank you. 

Do you need the serial number of the NIC, maybe there is a whole batch of the cards by ASUS affected ? I have a second one in my windows workstation so i could check that one, too.
Comment 29 Igor Russkikh 2018-04-06 06:50:45 UTC
Christian, sent you an instruction by email to @127null01.com
Thanks in advance!
Comment 30 Christian Weinmueller 2018-04-06 10:04:46 UTC
Hi Igor, 

sorry for the delay, it was already late here. i just sent the mail - should be in your mailbox right now!
Comment 31 Wes 2018-04-09 03:36:14 UTC
Experiencing this as well with 4.16.1.  Card never gets a carrier
Comment 32 Igor Russkikh 2018-04-09 08:21:36 UTC
Hi Wes, please let us know your system configuration.
It seems thats a NIC/driver/hadware incompatibility.
Comment 33 Wes 2018-04-09 08:46:59 UTC
Hi Igor, 

One AMD Threadripper 1700x host and one Intel 4770K host running Gentoo Linux.  Both work fine with kernel 4.15.15 and 4.15.16, it's only when upgrading to 4.16.1 today that they stopped working.  Reverting to the older kernel and they return to operational status.  They appear to initialise, the driver loads and kernel creates an interface (enp5s0 in both cases), it just never reports having carrier/link and the kernel believes it's 'down'.
Comment 34 Wes 2018-04-10 00:33:58 UTC
Er, sorry, that should be Threadripper 1900x
Comment 35 Kevin 2018-04-10 00:42:52 UTC
i7-8700k here; based on Wes' response, that probably rules out processor architecture, as he's listed both AMD 1900x and Intel 4770K.
Comment 36 Wes 2018-04-10 00:59:23 UTC
Igor, I'm curious what the outcome of that 'suspicious flash image configuration' investigation was as mine are also the ASUS branded cards.  Is this issue suspected to be due to something ASUS did differently with these cards?
Comment 37 Wes 2018-04-10 03:10:09 UTC
I've just tried building and testing atlantic.ko using every revision in git between the v4.15 and v4.16 tags and it appears to be working on 4.15.16.

driver: atlantic
version: 2.0.2.1-kern  (was 1.6.13-0-kern for in-tree on 4.15.16)

I think this issue is more a compatibility issue between the atlantic driver code and the 4.16 kernel itself than anything else, or a kernel bug in 4.16?
Comment 38 Wes 2018-04-10 03:26:50 UTC
The commits I tested below.  Note that I didn't try these on 4.16 yet only 4.15.  Testing on 4.16 will have to wait until later tonight or this weekend.

d4c242d4ba5730b62579969804cd8fcf58b9c84f to efe779b749cc9da0f36a01fba38c98864e6b8748
working

4948293ff963e5451a8f0c21be8f1dfc2c7f65f5 and 8fcb98f462e6504e6d1ab2dab87c6db803c206b6
Driver probe function unexpectedly returned 4

23ee07ad3c2fd5adf6e9ef21afb9aec489dc3b4e to 76c19c6cfa8f7e4f8c7d5407f77237b80095e5d9     
Bad FW version detected: expected=0, actual=105002c
Probe of 0000:05:00.0 failed with error -95

0c58c35f02c2e99bb10137b32e8ec96dcbdcc705 to 6a91ded32d6c8a6d0aee1928bb741e31577af24f (latest on github, included in v4.16 tag but doesn't seem to be in the 4.16. or 4.16.1 sourceballs..)
working
Comment 39 Igor Russkikh 2018-04-10 09:16:26 UTC
Wes, Kevin,

The culprit commit is:

c8c82eb net: aquantia: Introduce global AQC hardware reset sequence

Its not directly revertible, so attaching a simple patch doing the same revert.

Please note this is not production ready patch, we are still looking for the acceptable final solution.

Cold reset should be performed after the patch - otherwise NIC will stay in bad state.
Comment 40 Igor Russkikh 2018-04-10 09:17:11 UTC
Created attachment 275269 [details]
workaround patch

Cold reboot after applying!
Comment 41 Igor Russkikh 2018-04-10 09:37:20 UTC
The direct, detailed root cause is:

========================================

On ASUS XG-C100C with 1.5.44 firmware a special mode called "dirty wake" is active. With this mode when motherboard gets powered (but no poweron happens yet), NIC automatically enables powersave link and watches for WOL packet.

This normally allows to powerup the PC after AC power failures.

Not all motherboards or bios settings gives power to PCI slots, so this mode is not enabled on all the hardware.

This mode also never gets enabled if "Start on AC power" BIOS setting is active (i.e. when PC starts up as soon as AC power is there).

4.16 linux drivers introduces full hardware reset sequence with c8c82eb commit. This is required since before that we had no NIC hardware reset implemented and there were side effects of "not clean start".

But this full reset is incompatible with "dirty wake" WOL feature - it keeps the PHY link in a special mode forever. As a consequence, driver sees no link and no traffic.

========================================

We are now working out the best way to cleanup this. The above patch is a temporary workaround.
Comment 42 Wes 2018-04-10 10:02:17 UTC
Thanks for the explanation Igor, I guess I missed this in my testing because I wasn't cold rebooting the machine between each build.

I'll test out the patch shortly.

Thanks again
Comment 43 Igor Russkikh 2018-04-10 11:20:30 UTC
Created attachment 275271 [details]
Production patch v1

In the meantime, here is a better and stable patch. This is planned for the upstream.
Comment 44 Thomas Bludau 2018-04-11 11:38:50 UTC
(In reply to Igor Russkikh from comment #43)
> Created attachment 275271 [details]
> Production patch v1
> 
> In the meantime, here is a better and stable patch. This is planned for the
> upstream.

Worked for me, thanks a lot!
Comment 45 Igor Russkikh 2018-04-16 10:14:20 UTC
Fix is in netdev tree (expected in 4.17-rc1). Queued to 4.16 stable.
Comment 46 Wes 2018-04-25 02:17:51 UTC
Any idea when this will land in 4.16?  I don't see in mentioned on 4.16.2, 4.16.3 or 4.16.4 changelogs
Comment 47 Kevin 2018-04-25 02:52:22 UTC
I second Wes' question, and add one additional question to it - when will it be in 4.16, and when will 4.16 go back to installing properly with signed packages?

Actually, I went to try installing 4.16.4 and I notice the mainline kernel branch only has unsigned kernel copies. It also has a new "modules" package. I took the change (I'm on EFI so this was setting up for trouble) but the 4.16.4 packages won't install on Linux Mint 18.3. They error out and leave the package manager in a broken state.

Same problem with the 4.17.x branch - can't install, unsigned. What the frig.
Comment 48 Igor Russkikh 2018-04-25 07:34:37 UTC
> when this will land in 4.16

That depends on when davem will push the networking patchset to 4.16.x.
You may track it here:
http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
Comment 49 Wes 2018-05-01 13:13:44 UTC
FYIs it appears to be included in 4.16.6 as per 
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.16.6

Thanks again