Bug 214619 - Aquantia / Atlantic driver not loading post 5.14.RC7
Summary: Aquantia / Atlantic driver not loading post 5.14.RC7
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-10-04 21:02 UTC by Miles.Rumppe
Modified: 2021-11-24 14:39 UTC (History)
3 users (show)

See Also:
Kernel Version: 514.rc7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Photo of error (2.83 MB, image/jpeg)
2021-10-04 21:02 UTC, Miles.Rumppe
Details
attachment-11457-0.html (5.40 KB, text/html)
2021-11-21 19:08 UTC, Miles.Rumppe
Details
attachment-22663-0.html (371 bytes, text/html)
2021-11-23 23:58 UTC, Miles.Rumppe
Details

Description Miles.Rumppe 2021-10-04 21:02:11 UTC
Created attachment 299091 [details]
Photo of error

When upgrading from RC7 to any post release my system hangs with an error regarding the card.  

server:~$ sudo inxi -F
System:    Host: server Kernel: 5.14.0-051400rc7-lowlatency x86_64 bits: 64 Console: tty pts/5 
           Distro: Ubuntu 21.10 (Impish Indri) 
Machine:   Type: Desktop Mobo: ASRock model: Z390M Pro4 serial: M80-C9015301070 UEFI-[Legacy]: American Megatrends v: P4.30 
           date: 12/03/2019 
CPU:       Info: 6-Core model: Intel Core i7-8700K bits: 64 type: MT MCP cache: L2: 12 MiB 
           Speed: 800 MHz min/max: 800/4700 MHz Core speeds (MHz): 1: 800 2: 799 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800 
           9: 800 10: 800 11: 800 12: 800 
Graphics:  Device-1: Intel CometLake-S GT2 [UHD Graphics 630] driver: i915 v: kernel 
           Device-2: Conexant Systems CX23887/8 PCIe Broadcast Audio and Video Decoder with 3D Comb driver: cx23885 v: 0.0.4 
           Device-3: Conexant Systems CX23887/8 PCIe Broadcast Audio and Video Decoder with 3D Comb driver: cx23885 v: 0.0.4 
           Display: server: X.org 1.20.13 driver: loaded: modesetting tty: 182x55 
           Message: Advanced graphics data unavailable in console for root. 
Audio:     Device-1: Intel Cannon Lake PCH cAVS driver: snd_hda_intel 
           Device-2: Conexant Systems CX23887/8 PCIe Broadcast Audio and Video Decoder with 3D Comb driver: cx23885 
           Device-3: Conexant Systems CX23887/8 PCIe Broadcast Audio and Video Decoder with 3D Comb driver: cx23885 
           Sound Server-1: ALSA v: k5.14.0-051400rc7-lowlatency running: yes 
           Sound Server-2: PulseAudio v: 15.0 running: yes 
           Sound Server-3: PipeWire v: 0.3.32 running: yes 
Network:   Device-1: Intel Ethernet I219-V driver: e1000e 
           IF: eno1 state: up speed: 1000 Mbps duplex: full mac: a2:9d:8a:e8:39:2a 
           Device-2: Aquantia AQC111 NBase-T/IEEE 802.3bz Ethernet [AQtion] driver: atlantic 
           IF: enp5s0 state: up speed: 2500 Mbps duplex: full mac: 24:5e:be:4d:c4:53 
           Device-3: Aquantia AQC111 NBase-T/IEEE 802.3bz Ethernet [AQtion] driver: atlantic 
           IF: enp6s0 state: down mac: 24:5e:be:4d:c4:54 
           Device-4: Aquantia AQC111 NBase-T/IEEE 802.3bz Ethernet [AQtion] driver: atlantic 
           IF: enp8s0 state: up speed: 1000 Mbps duplex: full mac: a2:9d:8a:e8:39:2a 
           Device-5: Aquantia AQC111 NBase-T/IEEE 802.3bz Ethernet [AQtion] driver: atlantic 
           IF: enp9s0 state: up speed: 1000 Mbps duplex: full mac: a2:9d:8a:e8:39:2a 
           IF-ID-1: bo0 state: up speed: 2000 Mbps duplex: full mac: a2:9d:8a:e8:39:2a 
           IF-ID-2: bonding_masters state: N/A speed: N/A duplex: N/A mac: N/A 
           IF-ID-3: br0 state: up speed: 2500 Mbps duplex: unknown mac: da:88:2f:77:a2:3d 
           IF-ID-4: nordlynx state: unknown speed: N/A duplex: N/A mac: N/A 
RAID:      Device-1: md0 type: mdraid level: raid-10 status: active size: 18.19 TiB report: 5/5 UUUUU 
           Components: Online: 1: sdd1 2: sde1 3: sdb1 4: sdc1 5: sda1 
Drives:    Local Storage: total: raw: 36.62 TiB usable: 18.43 TiB used: 7.18 TiB (39.0%) 
           ID-1: /dev/nvme0n1 vendor: Samsung model: MZVPW256HEGL-000H1 size: 238.47 GiB 
           ID-2: /dev/sda vendor: Western Digital model: WD80EZAZ-11TDBA0 size: 7.28 TiB 
           ID-3: /dev/sdb vendor: Western Digital model: WD80EZAZ-11TDBA0 size: 7.28 TiB 
           ID-4: /dev/sdc vendor: Western Digital model: WD80EZAZ-11TDBA0 size: 7.28 TiB 
           ID-5: /dev/sdd vendor: Western Digital model: WD80EZAZ-11TDBA0 size: 7.28 TiB 
           ID-6: /dev/sde vendor: Western Digital model: WD80EZAZ-11TDBA0 size: 7.28 TiB 
Partition: ID-1: / size: 233.73 GiB used: 89.04 GiB (38.1%) fs: ext4 dev: /dev/nvme0n1p1 
Swap:      Alert: No swap data was found. 
Sensors:   System Temperatures: cpu: 30.0 C mobo: 32.0 C 
           Fan Speeds (RPM): fan-1: 860 fan-2: 634 fan-4: 888 fan-5: 882 
Info:      Processes: 366 Uptime: 2h 58m Memory: 15.29 GiB used: 2.47 GiB (16.1%) Init: systemd runlevel: 5 Shell: Bash 
           inxi: 3.3.06 

I've tried weekly to update the kernel with the same result as the attachment. Requires a boot into livecd and remove the updated kernel and revert to rc7 release to resume working.  It's a bit difficult to capture more info since the hanging boot eventually does bring up the desktop but, functions requiring sudo don't respond in terminal to capture additional info.  I've attempted weekly upgrades to non-RC kernel versions as well as skipping to 5.15.x to see if it has been resolved to no avail.
Comment 1 Miles.Rumppe 2021-11-09 16:38:35 UTC
Just to add I've also tried the latest 5.15 releases and still experience the boot hang with the NIC. I dove into the "log" for the kernel releases to compare the modules being loaded and it seems they're in the package.  

I'm starting to second guess things here and think it might be on the ifenslave side as I'm using bridge/bond configs for WAN/LAN as a router.

I'm also considering an update to Z690/12700K and would be porting this card over to that board for continued use.  It would just be nice to keep current with releases for the additional features / security of the updates.
Comment 2 Jeremy Soller 2021-11-09 16:41:30 UTC
Here is a similar kernel panic from 5.15.1:

invalid opcode: 0000 [#1] SMP NOPTI
CPU: 3 PID: 1126 Comm: NetworkManager Tainted: P           OE     5.15.1-76051501-generic #202111061036~1636381537~21.10~73176c9
Hardware name: System76 Thelio Major/Thelio Major, BIOS 3301 11/06/2020
RIP: 0010:aq_nic_start+0x37d/0x380 [atlantic]
Code: 6c 45 c0 e8 35 0f ed cf 41 89 c4 85 c0 0f 88 3d fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 90 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 8b bf 50
RSP: 0018:ffffbc40c1cdf418 EFLAGS: 00010202
RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000010
RDX: 0000000000000019 RSI: 0000000000005be8 RDI: ffff9e1019182428
RBP: ffffbc40c1cdf430 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000001000 R11: ffffe9148446be00 R12: 0000000000000000
R13: ffff9e1019196980 R14: 0000000000000001 R15: ffff9e1019196000
FS:  00007f2f2d6bb480(0000) GS:ffff9e177fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc18ca091c0 CR3: 0000000117532004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 aq_ndev_open+0x49/0x70 [atlantic]
 __dev_open+0xec/0x1a0
 __dev_change_flags+0x1a3/0x210
 dev_change_flags+0x26/0x60
 do_setlink+0x28a/0xc50
 ? __nla_validate_parse+0x4c/0x1a0
 ? __nla_reserve+0x41/0x50
 __rtnl_newlink+0x608/0xa10
 ? __netlink_sendskb+0x5f/0x80
 ? netlink_unicast+0x2f3/0x330
 ? cpumask_next_and+0x24/0x30
 ? update_sg_lb_stats+0x6f/0x390
 ? update_sd_lb_stats.constprop.0+0xda/0x250
 ? kmem_cache_alloc_trace+0x18c/0x2c0
 rtnl_newlink+0x49/0x70
 rtnetlink_rcv_msg+0x14f/0x3a0
 ? rtnl_calcit.isra.0+0x130/0x130
 netlink_rcv_skb+0x53/0x100
 rtnetlink_rcv+0x15/0x20
 netlink_unicast+0x21a/0x330
 netlink_sendmsg+0x250/0x4a0
 sock_sendmsg+0x62/0x70
 ____sys_sendmsg+0x24e/0x290
 ? import_iovec+0x31/0x40
 ? sendmsg_copy_msghdr+0x7b/0xa0
 ? kfree+0xba/0x3c0
 ___sys_sendmsg+0x81/0xc0
 ? proc_sys_call_handler+0x1b4/0x290
 ? proc_sys_write+0x13/0x20
 ? new_sync_write+0x117/0x1a0
 ? security_file_free+0x54/0x60
 ? kmem_cache_free+0xfb/0x3f0
 ? __fget_files+0x5f/0x90
 __sys_sendmsg+0x62/0xb0
 __x64_sys_sendmsg+0x1d/0x20
 do_syscall_64+0x59/0xc0
 ? exit_to_user_mode_prepare+0x37/0xb0
 ? syscall_exit_to_user_mode+0x27/0x50
 ? __x64_sys_close+0x11/0x40
 ? do_syscall_64+0x69/0xc0
 ? irqentry_exit+0x19/0x30
 ? sysvec_reschedule_ipi+0x78/0xe0
 ? asm_sysvec_reschedule_ipi+0xa/0x20
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f2f2e6d13fd
Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 fa a4 f6 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 3e a5 f6 ff 48
RSP: 002b:00007ffcf63b1400 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000010 RCX: 00007f2f2e6d13fd
RDX: 0000000000000000 RSI: 00007ffcf63b1440 RDI: 000000000000000c
RBP: 000056449b84f030 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 00007ffcf63b1590 R14: 00007ffcf63b158c R15: 0000000000000000
Modules linked in: skx_edac(-) isst_if_common cmac algif_hash algif_skcipher af_alg snd_hda_codec_realtek bnep nfit snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio x86_pkg_temp_thermal nvidia_drm(POE) nvidia_modeset(POE) intel_powerclamp coretemp snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core iwlmvm snd_hwdep kvm_intel snd_pcm nvidia(POE) eeepc_wmi snd_seq_midi kvm snd_seq_midi_event asus_wmi mac80211 snd_rawmidi platform_profile btusb rapl sparse_keymap btrtl drm_kms_helper nls_iso8859_1 input_leds intel_wmi_thunderbolt wmi_bmof libarc4 intel_cstate video btbcm snd_seq cec btintel mxm_wmi rc_core bluetooth snd_seq_device fb_sys_fops syscopyarea iwlwifi snd_timer efi_pstore sysfillrect ecdh_generic sysimgblt snd ecc cfg80211 soundcore mei_me ioatdma mei dca mac_hid acpi_tad sch_fq_codel msr parport_pc ppdev drm lp parport ip_tables x_tables autofs4 dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
 raid6_pq libcrc32c raid1 raid0 multipath linear system76_io(OE) system76_acpi(OE) hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crct10dif_pclmul crc32_pclmul nvme ghash_clmulni_intel intel_lpss_pci aesni_intel ahci atlantic xhci_pci i2c_i801 intel_lpss crypto_simd cryptd e1000e nvme_core i2c_smbus xhci_pci_renesas libahci idma64 macsec wmi pinctrl_sunrisepoint
---[ end trace 68d84992dd565846 ]---
RIP: 0010:aq_nic_start+0x37d/0x380 [atlantic]
Code: 6c 45 c0 e8 35 0f ed cf 41 89 c4 85 c0 0f 88 3d fd ff ff 41 8b 8d 94 01 00 00 b8 01 00 00 00 d3 e0 41 09 85 f8 05 00 00 eb 84 <0f> 0b 90 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 48 8b bf 50
RSP: 0018:ffffbc40c1cdf418 EFLAGS: 00010202
RAX: 0000000000000008 RBX: 0000000000000008 RCX: 0000000000000010
RDX: 0000000000000019 RSI: 0000000000005be8 RDI: ffff9e1019182428
RBP: ffffbc40c1cdf430 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000001000 R11: ffffe9148446be00 R12: 0000000000000000
R13: ffff9e1019196980 R14: 0000000000000001 R15: ffff9e1019196000
FS:  00007f2f2d6bb480(0000) GS:ffff9e177fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc18ca091c0 CR3: 0000000117532004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Comment 3 Igor Russkikh 2021-11-10 07:58:33 UTC
Hmm.. Thats very strange failure and the backtrace.

Looking into the driver change log:

~/kernel/linux/drivers/net/ethernet/aquantia$ git log --oneline v5.14-rc7..v5.15-rc1 .
29ce8f970107 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
57f780f1c433 atlantic: Fix driver resume flow.
f3ccfda19319 ethtool: extend coalesce setting uAPI with CQE mode
3852e54e6736 net: atlantic: switch from 'pci_' to 'dma_' API
a76053707dbf dev_ioctl: split out ndo_eth_ioctl


I really don't see how any of these can impact aq_ndev_open flow.

Is it possible for you to bisect based on these commits?

Thanks,
  Igor
Comment 4 Igor Russkikh 2021-11-10 08:29:58 UTC
After looking into disassembly, I see the failure is AFTER the end of the function:

    272d:       b8 01 00 00 00          mov    $0x1,%eax
    2732:       d3 e0                   shl    %cl,%eax
    2734:       41 09 85 f8 05 00 00    or     %eax,0x5f8(%r13)
    273b:       eb 84                   jmp    26c1 <aq_nic_start+0x301>
>>  273d:       0f 0b                   ud2    
    273f:       90                      nop

On undefined instruction. We had similar report in past from another user, but were unable to find out how execution could reach that point.

I suspect this is something related to compiler optimisation or some other coincidence.

What kernel build are you using? ubuntu or something else?

Igor
Comment 5 Jeremy Soller 2021-11-10 14:56:13 UTC
Thank you for replying Igor, I will try to bisect the error. The kernel is built from this branch using debian packaging: https://github.com/pop-os/linux/tree/5.15.1_impish
Comment 6 Jeremy Soller 2021-11-10 22:59:01 UTC
We have gotten this far, so far: v5.14-rc7 does not appear to have the bug, while v5.14 does. We are now bisecting between these two tags.
Comment 7 Miles.Rumppe 2021-11-21 19:08:31 UTC
Created attachment 299663 [details]
attachment-11457-0.html

Little update here.

I rebuilt my server from the ground up to Z690/12700K this week and have
been pushing through some quirks I've discovered along the way.

Anyway I was able to finally update beyond RC7 to


*5.15.4-051504-lowlatency*

All the NIC ports are up and running

No more hang / panic observed.

It's a bit strange though being stuck @ 5.14.rc7 and then to change out the
MOBO/CPU and being able to upgrade again.

System:    Host: server Kernel:* 5.15.4-051504-lowlatency x86_64 bits*: 64
Console: tty pts/13
           Distro: Ubuntu 21.10 (Impish Indri)
Machine:   Type: Desktop Mobo: ASRock model: Z690 Steel Legend serial:
UEFI: American Megatrends LLC. v: 2.02
           date: 10/01/2021
CPU:       Info: 10-Core model: 12th Gen Intel Core i7-12700K bits: 64
type: MT MCP cache: L2: 25 MiB
           Speed: 600 MHz min/max: 800/6300 MHz Core speeds (MHz): 1: 600
2: 903 3: 600 4: 600 5: 600 6: 601 7: 668 8: 600
           9: 800 10: 800 11: 801 12: 801 13: 600 14: 600 15: 600 16: 2691
17: 1190 18: 4856 19: 4810 20: 4900
Graphics:  Device-1: Intel AlderLake-S GT1 driver: N/A
           Device-2: Conexant Systems CX23887/8 PCIe Broadcast Audio and
Video Decoder with 3D Comb driver: cx23885 v: 0.0.4
           Device-3: Conexant Systems CX23887/8 PCIe Broadcast Audio and
Video Decoder with 3D Comb driver: cx23885 v: 0.0.4
           Display: server: X.org 1.20.13 driver: loaded: fbdev unloaded:
modesetting,vesa tty: 202x56
           Message: Advanced graphics data unavailable in console for root.
Audio:     Device-1: Intel driver: snd_hda_intel
           Device-2: Conexant Systems CX23887/8 PCIe Broadcast Audio and
Video Decoder with 3D Comb driver: cx23885
           Device-3: Conexant Systems CX23887/8 PCIe Broadcast Audio and
Video Decoder with 3D Comb driver: cx23885
           Sound Server-1: ALSA v: k5.15.4-051504-lowlatency running: yes
           Sound Server-2: PulseAudio v: 15.0 running: yes
           Sound Server-3: PipeWire v: 0.3.32 running: yes
Network:   Device-1: Realtek RTL8125 2.5GbE driver: r8169
           IF: enp5s0 state: up speed: 1000 Mbps duplex: full mac:
0a:22:5f:3e:7c:65
           Device-2: Intel Wi-Fi 6 AX210/AX211/AX411 160MHz driver: iwlwifi
           IF: wlp6s0 state: down mac: d8:f8:83:d8:8e:c0

*           Device-3: Aquantia AQC111 NBase-T/IEEE 802.3bz Ethernet
[AQtion] driver: atlantic *




*  IF: enp10s0 state: up speed: 2500 Mbps duplex: full mac:
24:5e:be:4d:c4:53            Device-4: Aquantia AQC111 NBase-T/IEEE 802.3bz
Ethernet [AQtion] driver: atlantic            IF: enp11s0 state: down mac:
24:5e:be:4d:c4:54            Device-5: Aquantia AQC111 NBase-T/IEEE 802.3bz
Ethernet [AQtion] driver: atlantic            IF: enp13s0 state: up speed:
1000 Mbps duplex: full mac: 0a:22:5f:3e:7c:65            Device-6: Aquantia
AQC111 NBase-T/IEEE 802.3bz Ethernet [AQtion] driver: atlantic *

*           IF: enp14s0 state: up speed: 1000 Mbps duplex: full mac:
0a:22:5f:3e:7c:65 *           IF-ID-1: bo0 state: up speed: 2000 Mbps
duplex: full mac: 0a:22:5f:3e:7c:65
           IF-ID-2: bonding_masters state: N/A speed: N/A duplex: N/A mac:
N/A
           IF-ID-3: br0 state: up speed: 2500 Mbps duplex: unknown mac:
f6:78:d2:2e:6e:c9
           IF-ID-4: nordlynx state: unknown speed: N/A duplex: N/A mac: N/A
Bluetooth: Device-1: Intel type: USB driver: btusb
           Report: hciconfig ID: hci0 state: up address: D8:F8:83:D8:8E:C4
bt-v: 3.0
RAID:      Device-1: md0 type: mdraid level: raid-10 status: active size:
18.19 TiB report: 5/5 UUUUU
           Components: Online: 1: sdc1 2: sdb1 3: sdd1 4: sda1 5: sde1
Drives:    Local Storage: total: raw: 36.62 TiB usable: 18.43 TiB used:
7.68 TiB (41.7%)
           ID-1: /dev/sda vendor: Western Digital model: WD80EZAZ-11TDBA0
size: 7.28 TiB
           ID-2: /dev/sdb vendor: Western Digital model: WD80EZAZ-11TDBA0
size: 7.28 TiB
           ID-3: /dev/sdc vendor: Western Digital model: WD80EZAZ-11TDBA0
size: 7.28 TiB
           ID-4: /dev/sdd vendor: Western Digital model: WD80EZAZ-11TDBA0
size: 7.28 TiB
           ID-5: /dev/sde vendor: Western Digital model: WD80EZAZ-11TDBA0
size: 7.28 TiB
           ID-6: /dev/sdf type: USB vendor: SanDisk model: Extreme Pro
size: 238.5 GiB
Partition: ID-1: / size: 233.71 GiB used: 97.16 GiB (41.6%) fs: ext4 dev:
/dev/sdf2
           ID-2: /boot/efi size: 48.2 MiB used: 3.3 MiB (6.8%) fs: vfat
dev: /dev/sdf1
Swap:      Alert: No swap data was found.
Sensors:   System Temperatures: cpu: 33.0 C mobo: 34.5 C
           Fan Speeds (RPM): fan-1: 658 fan-2: 831 fan-3: 0 fan-4: 678
fan-5: 0 fan-6: 0 fan-7: 632
Info:      Processes: 440 Uptime: 6m Memory: 15.39 GiB used: 2.23 GiB
(14.5%) Init: systemd runlevel: 5 Shell: Bash
           inxi: 3.3.06
Comment 8 Miles.Rumppe 2021-11-23 23:58:39 UTC
Created attachment 299693 [details]
attachment-22663-0.html

I just tested 5.16.rc2 and got the same hang behavior. I guess this is a
moving target as to what works / doesn't in each release

In the past this sort of non-working issue tends to resolve itself though
with the next release or two after the defective one.  I'll just keep an
eye on things.
Comment 9 Jeremy Soller 2021-11-24 14:39:57 UTC
Our issue went away with kernel 5.15.4. We have not tested 5.16rc2 yet.

Note You need to log in before you can comment on or make changes to this bug.