Bug 215798

Summary: Aquantia / Atlantic driver crashes on hibernation entry
Product: Drivers Reporter: Manuel Ullmann (labre)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED INVALID    
Severity: normal CC: andrew, christoph.n.stich
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.17.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: Picture of the stack trace screen
Script working around the crash
Minimal script working around the crash

Description Manuel Ullmann 2022-04-04 21:36:26 UTC
Created attachment 300692 [details]
Picture of the stack trace screen

==========================
Summary
==========================

If the network link of the atlantic module is up during pm hibernation entry, it will crash with the attached trace. Setting it down or unloading the module is a valid workaround, but logind and/or NetworkManager will reload the module (regardless of blacklisting) and restore the link state, so this is broken with common userspace.

I’ll provide a working hibernation script using solely the kernel interface. This did not happen with any of the core >/sys/power/pm_test, which I tried.

==========================
Steps to reproduce
==========================
1. modprobe atlantic
2. ip link set <iface> up
# possibly a connection has to be established first
3. echo platform >/sys/power/disk
# provided a swap device is available
4. echo disk >/sys/power/state

==========================
Actual behaviour
==========================
The atlantic module will crash with a trace, leaving the system in an semi-hibernated state. Sysrq is still possible.

==========================
Expected behaviour
==========================
The module should happily go to sleep, cuddling with his best friends.

==========================
Additional information
==========================
Stack trace is attached. Sorry, OS can’t do screenshots in this state.
The device is an AQC107 integrated in an ASUS ROG Zenith Ⅱ Extreme Alpha mainboard.

44:00.0 Ethernet controller: Aquantia Corp. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion] (rev 02)
        Subsystem: ASUSTeK Computer Inc. AQC107 NBase-T/IEEE 802.3bz Ethernet Controller [AQtion]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 81
        IOMMU group: 50
        Region 0: Memory at e1040000 (64-bit, non-prefetchable) [size=64K]
        Region 2: Memory at e1050000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at e0c00000 (64-bit, non-prefetchable) [size=4M]
        Expansion ROM at e1000000 [disabled] [size=256K]
        Capabilities: [40] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s unlimited, L1 unlimited
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x2 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink+ Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI-X: Enable+ Count=32 Masked-
                Vector table: BAR=2 offset=00000000
                PBA: BAR=2 offset=00000200
        Capabilities: [a0] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [c0] Vital Product Data
                Product Name: Atlantic
                Read-only fields:
                        [PN] Part number: 3290495095
                        [EC] Engineering changes: 0
                        [FG] Unknown: 61 62 63
                        [LC] Unknown: 64 65 66
                        [MN] Manufacture ID: AFDSWEWEBSFD
                        [PG] Unknown: 49 49 49
                        [SN] Serial number: CPL5938TLKMY
                        [V0] Vendor specific: wfewfe
                        [V1] Vendor specific: fwewfe
                        [V2] Vendor specific: SDFWI
                        [RV] Reserved: checksum good, 0 byte(s) reserved
                Read/write fields:
                        [YA] Asset tag: 9495829
                        [V0] Vendor specific: f34ge4rsg
                        [V1] Vendor specific: ger35g5rthghgsa3
                        [Y0] System specific: bsdfvbxcz
                        [Y1] System specific: fwefewwfe
                        [RW] Read-write area: 11 byte(s) free
                End
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [180 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: atlantic
        Kernel modules: atlantic
Comment 1 Manuel Ullmann 2022-04-04 21:39:17 UTC
Created attachment 300693 [details]
Script working around the crash
Comment 2 Manuel Ullmann 2022-04-04 21:47:23 UTC
Created attachment 300694 [details]
Minimal script working around the crash

Had already referenced my logind quirks, so the previous script was not helpful for the report. This should be, though.
Comment 3 Manuel Ullmann 2022-04-07 17:48:54 UTC
I have ignored the bug reporting documentation. I’m currently collecting all relevant information bits and report back correctly, when done with that. Please excuse the noise.
Comment 4 Andrew M 2022-05-06 15:24:47 UTC
Hi, the change associated with this appears to have caused a regression. See https://bugzilla.kernel.org/show_bug.cgi?id=215949
Comment 5 Manuel Ullmann 2022-05-06 17:35:52 UTC
> --- Comment #4 from Andrew M (andrew@m6l.net) ---
> Hi, the change associated with this appears to have caused a regression. See
> https://bugzilla.kernel.org/show_bug.cgi?id=215949
Thanks, this is handled in
https://patchwork.kernel.org/project/netdevbpf/patch/8735hniqcm.fsf@posteo.de/

This is a partial revert and has been successfully tested by two other
reporters. If need be, you can apply it to a custom kernel until it
reaches stable. Shouldn’t take too long.

Manuel
Comment 6 Andrew M 2022-05-06 18:45:02 UTC
Hi, thank you for the patch. I can confirm that applying that patch (instead of a revert) onto 5.15.36 remedies the regression I saw. I look forward to seeing it merged into stable.
Comment 7 Christoph Stich 2022-05-19 23:08:49 UTC
For what's it worth, the patch also fixes the regression for me. Thanks.
Comment 8 Manuel Ullmann 2022-05-19 23:45:48 UTC
It’s included in v5.17.9, v5.15.41, v5.10.117 and mainline.