Hi, I've got an ASUS XG-C100C 10G network adapter. The adapter has worked just fine in the 4.15 series kernels. I'm currently using 4.15.10-041510 without issue. However, every release of the 4.16 kernel in the mainline stream, which as of this moment goes up to RC6, has failed to recognize the adapter. Here's the "proper" device details taken *when the device is working properly* in 4.15: 08:00.0 Ethernet controller: Device 1d6a:d108 (rev 02) Subsystem: ASUSTeK Computer Inc. Device 875b Flags: bus master, fast devsel, latency 0, IRQ 18 Memory at df040000 (64-bit, non-prefetchable) [size=64K] Memory at df050000 (64-bit, non-prefetchable) [size=4K] Memory at dec00000 (64-bit, non-prefetchable) [size=4M] Expansion ROM at df000000 [disabled] [size=256K] Capabilities: [40] Express Endpoint, MSI 00 Capabilities: [80] Power Management version 3 Capabilities: [90] MSI-X: Enable+ Count=32 Masked- Capabilities: [a0] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [c0] Vital Product Data Capabilities: [100] Advanced Error Reporting Capabilities: [150] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?> Capabilities: [180] #19 Kernel driver in use: atlantic Kernel modules: atlantic I've got an AMD Vega card and I'm eager to move to 4.16 but cannot until this bug is resolved as having no network interface is a minor productivity problem in 2018 :P Thanks in advance
Please retest with rc7 (there were lots of networking fixes) and bring the issue to netdev (https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt ) in case it still shows up; please CC me
Still broken in 4.16rc7. Interestingly, I note the "lspci -v" information looks similar on 4.16rc7 to what I've listed above, including the "kernel driver in use"; not an expert, but that sort of looks like it's using the driver, then, so maybe something else is broken Kernel-wise other than the driver?
FWIW: lspci output in cases like this often is not much of a help; it's simply an utility for displaying information about PCI buses in the system and devices connected to them; it even works for devices which Linux has no driver for.
Could you maybe attach the dmesg output from 4.15 and 4.16-rc7 to this report? Maybe try to compare the two if the output from the atlantic driver changed
Created attachment 274965 [details] dmesg output from Kernel 4.15.12; NIC works properly here.
Created attachment 274967 [details] dmesg output from Kernel 4.16 rc7; NIC does not initialize / work.
As requested, I have uploaded full dmesg output from fresh boot, one on 4.15.12 (working), and one on 4.16 rc7 (not working).
Please boot both kernels without loading the VirtualBox drivers or services, they might mess things up here (as the "loading out-of-tree module taints kernel." indicates)
Created attachment 274969 [details] dmesg output from Kernel 4.15.12; NIC working; VirtualBox removed
Created attachment 274971 [details] dmesg output from Kernel 4.16 rc7; NIC NOT working; VirtualBox removed
As requested, see attached. Attached dmesg from both 4.15 and 4.16 with VirtualBox completely uninstalled. Still doesn't work. 4.16 dmesg shows this at the bottom: [ 15.802524] IPv6: ADDRCONF(NETDEV_UP): enp5s0: link is not ready
Oh, and thanks for looking into this for me, Thorsten!
Hi Kevin, could you please provide the following output: sudo ethtool -i enp5s0 That'll help us alot in diagnosis.
Hello Igor, i hop just in, as i have the same problem Kevin already reported. It look likes there was a major driver version upgrade from kernel 4.15 to 4.16 which somehow breaks something. My XG-C100C is connected to a managed 10 GbE switch, which recognizes a physical connection to the network card and does a successful auto-negotiation for the link - i can change the link speed manually, renegotiation is successful for 100M/1000M/10000M, but does not appear in the kernel log, the interface never goes up or down. Further analysis on the switch shows, that from the switch packets are sent to the XG-C100C without any reply (RX counter is zero). ethtool output: 4.15.13: driver: atlantic version: 1.6.13.0-kern firmware-version: 1.5.44 expansion-rom-version: bus-info: 0000:07:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no 4.16 (release version, not rc): driver: atlantic version: 2.0.2.1-kern firmware-version: 1.5.44 expansion-rom-version: bus-info: 0000:07:00.0 supports-statistics: yes supports-test: no supports-eeprom-access: no supports-register-dump: yes supports-priv-flags: no further, ethtool enp7s0 reports with kernel 4.16 Speed: Unknown! and Link detected: No Hope that helps!
Hi Christian, Thanks for the feedback. We are already looking into this but can't reproduce with similar XG-C100C cards. Could you also provide the output of sudo ethtool -d enp7s0 Do you have an ability to rebuild atlantic.ko with our test patch and check it out? Thanks!
Created attachment 275059 [details] ethtool -d output for kernel 4.15.13
Created attachment 275061 [details] ethtool -d output for kernel 4.16
Hi Igor, thanks for the quick reply. I attached the output of ethtool -d for both kernels, hope that helps. Rebuilding and testing the module with a test patch would be no problem for me.
Thanks! Right now thats a blind guess, but could you try this: ---------------- diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c index d3b847e..9d65410 100644 --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_utils.c @@ -152,7 +152,7 @@ static int hw_atl_utils_soft_reset_flb(struct aq_hw_s *self) return -EIO; } /* Old FW requires fixed delay after init */ - AQ_HW_SLEEP(15); + AQ_HW_SLEEP(500); return 0; } @@ -221,7 +221,7 @@ static int hw_atl_utils_soft_reset_rbl(struct aq_hw_s *self) return -EIO; } /* Old FW requires fixed delay after init */ - AQ_HW_SLEEP(15); + AQ_HW_SLEEP(500); return 0; } ---------------- To rebuild only the module: make M=drivers/net/ethernet/aquantia modules sudo rmmod atlantic sudo insmod drivers/net/ethernet/aquantia/atlantic/atlantic.ko Thanks!
Created attachment 275063 [details] different testing log files Hi Igor, thank you for the patch. sadly, it doesn't fix the issue, the interface still stays down. i attached ethtool (-i/-d) and dmesg output.
Thanks Christian. We'd like to check full register dump of the NIC. Could you please: git clone https://github.com/Aquantia/pcimem cd pcimem make sudo ./pcimem /sys/class/net/enp7s0/device/resource0 0 w*0x2400 > regdump.txt That should generate a full reg dump. Also, please attach your kernel .config just in case. We still can't reproduce this on the very same card (
Hopefully Igor can supply the requested RegDump, I'll be unable to. However, I thought I'd chime in with some details about my system in case it helps reproduce the issue. I'm running Linux Mint 18.3 KDE base install, which was installed 100% clean, and I manually install the kernel updates from the mainline branch. I manually installed 4.15 when it came out and everything was working, it only went south on the 4.16. It's always possible it's some weird distro configuration issue that only happens when upgrading the Kernel, I suppose.
Kevin, Could you please attach your .config as well. That may help.
Created attachment 275105 [details] kernel configuration hello, i tried dumping the registers from the NIC, but the script fails with: Error at line 112, file pcimem.c (22) [Invalid argument] so, for now only my kernel configuration, which is basically the default configuration of debian 9.4 with aquantia drivers built as module. if igor has an idea why the register dump fails, i'll try again. Thanks
Thanks. Strangely 4.16 does not allow sharing of pci resource regions. Please do sudo rmmod atlantic Then pcimem should success, however you should find and put a full correct pci device path, something like this: /sys/devices/pci0000:00/0000:00:1d.0/0000:02:00.0/resource0
Created attachment 275109 [details] Register dump from enp7s0 Hi, found it and made the dump. Hope that helps!
Thanks Chris! We see your card has some suspicious flash image configuration. We'd like to take a full dump of your NICs flash for analysis. We may send you some tools for that. Can we do that through your email? That'll help us alot!
Sure, thank you. Do you need the serial number of the NIC, maybe there is a whole batch of the cards by ASUS affected ? I have a second one in my windows workstation so i could check that one, too.
Christian, sent you an instruction by email to @127null01.com Thanks in advance!
Hi Igor, sorry for the delay, it was already late here. i just sent the mail - should be in your mailbox right now!
Experiencing this as well with 4.16.1. Card never gets a carrier
Hi Wes, please let us know your system configuration. It seems thats a NIC/driver/hadware incompatibility.
Hi Igor, One AMD Threadripper 1700x host and one Intel 4770K host running Gentoo Linux. Both work fine with kernel 4.15.15 and 4.15.16, it's only when upgrading to 4.16.1 today that they stopped working. Reverting to the older kernel and they return to operational status. They appear to initialise, the driver loads and kernel creates an interface (enp5s0 in both cases), it just never reports having carrier/link and the kernel believes it's 'down'.
Er, sorry, that should be Threadripper 1900x
i7-8700k here; based on Wes' response, that probably rules out processor architecture, as he's listed both AMD 1900x and Intel 4770K.
Igor, I'm curious what the outcome of that 'suspicious flash image configuration' investigation was as mine are also the ASUS branded cards. Is this issue suspected to be due to something ASUS did differently with these cards?
I've just tried building and testing atlantic.ko using every revision in git between the v4.15 and v4.16 tags and it appears to be working on 4.15.16. driver: atlantic version: 2.0.2.1-kern (was 1.6.13-0-kern for in-tree on 4.15.16) I think this issue is more a compatibility issue between the atlantic driver code and the 4.16 kernel itself than anything else, or a kernel bug in 4.16?
The commits I tested below. Note that I didn't try these on 4.16 yet only 4.15. Testing on 4.16 will have to wait until later tonight or this weekend. d4c242d4ba5730b62579969804cd8fcf58b9c84f to efe779b749cc9da0f36a01fba38c98864e6b8748 working 4948293ff963e5451a8f0c21be8f1dfc2c7f65f5 and 8fcb98f462e6504e6d1ab2dab87c6db803c206b6 Driver probe function unexpectedly returned 4 23ee07ad3c2fd5adf6e9ef21afb9aec489dc3b4e to 76c19c6cfa8f7e4f8c7d5407f77237b80095e5d9 Bad FW version detected: expected=0, actual=105002c Probe of 0000:05:00.0 failed with error -95 0c58c35f02c2e99bb10137b32e8ec96dcbdcc705 to 6a91ded32d6c8a6d0aee1928bb741e31577af24f (latest on github, included in v4.16 tag but doesn't seem to be in the 4.16. or 4.16.1 sourceballs..) working
Wes, Kevin, The culprit commit is: c8c82eb net: aquantia: Introduce global AQC hardware reset sequence Its not directly revertible, so attaching a simple patch doing the same revert. Please note this is not production ready patch, we are still looking for the acceptable final solution. Cold reset should be performed after the patch - otherwise NIC will stay in bad state.
Created attachment 275269 [details] workaround patch Cold reboot after applying!
The direct, detailed root cause is: ======================================== On ASUS XG-C100C with 1.5.44 firmware a special mode called "dirty wake" is active. With this mode when motherboard gets powered (but no poweron happens yet), NIC automatically enables powersave link and watches for WOL packet. This normally allows to powerup the PC after AC power failures. Not all motherboards or bios settings gives power to PCI slots, so this mode is not enabled on all the hardware. This mode also never gets enabled if "Start on AC power" BIOS setting is active (i.e. when PC starts up as soon as AC power is there). 4.16 linux drivers introduces full hardware reset sequence with c8c82eb commit. This is required since before that we had no NIC hardware reset implemented and there were side effects of "not clean start". But this full reset is incompatible with "dirty wake" WOL feature - it keeps the PHY link in a special mode forever. As a consequence, driver sees no link and no traffic. ======================================== We are now working out the best way to cleanup this. The above patch is a temporary workaround.
Thanks for the explanation Igor, I guess I missed this in my testing because I wasn't cold rebooting the machine between each build. I'll test out the patch shortly. Thanks again
Created attachment 275271 [details] Production patch v1 In the meantime, here is a better and stable patch. This is planned for the upstream.
(In reply to Igor Russkikh from comment #43) > Created attachment 275271 [details] > Production patch v1 > > In the meantime, here is a better and stable patch. This is planned for the > upstream. Worked for me, thanks a lot!
Fix is in netdev tree (expected in 4.17-rc1). Queued to 4.16 stable.
Any idea when this will land in 4.16? I don't see in mentioned on 4.16.2, 4.16.3 or 4.16.4 changelogs
I second Wes' question, and add one additional question to it - when will it be in 4.16, and when will 4.16 go back to installing properly with signed packages? Actually, I went to try installing 4.16.4 and I notice the mainline kernel branch only has unsigned kernel copies. It also has a new "modules" package. I took the change (I'm on EFI so this was setting up for trouble) but the 4.16.4 packages won't install on Linux Mint 18.3. They error out and leave the package manager in a broken state. Same problem with the 4.17.x branch - can't install, unsigned. What the frig.
> when this will land in 4.16 That depends on when davem will push the networking patchset to 4.16.x. You may track it here: http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
FYIs it appears to be included in 4.16.6 as per https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.16.6 Thanks again