217659 – [Intel I225-V] Access causes instant hard lockup

Bug 217659 - [Intel I225-V] Access causes instant hard lockup

Summary: [Intel I225-V] Access causes instant hard lockup

Status:	NEW

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	Network (show other bugs)
Hardware:	Intel Linux

Importance:	P3 high
Assignee:	drivers_network@kernel-bugs.osdl.org

URL:
Keywords:

Depends on:
Blocks:

Reported:	2023-07-11 20:45 UTC by Andreas Reis
Modified:	2023-08-07 10:10 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
"dmesg -w -T \| grep -i igc" & call traces (9.60 KB, text/plain) 2023-07-11 20:45 UTC, Andreas Reis	Details
dmesg call traces with vanilla 6.4.3.arch1-1 (12.91 KB, text/plain) 2023-07-12 16:16 UTC, Andreas Reis	Details
Add an attachment (proposed patch, testcase, etc.)

Description Andreas Reis 2023-07-11 20:45:58 UTC

Created attachment 304611 [details]
"dmesg -w -T | grep -i igc" & call traces

ASUS B660-G (BIOS 2404; most recent)
Kernel: 6.4.3-zen (Arch Linux; booted with pcie_aspm=force)

Any attempt to even just get the status of my board's Intel I225-V chip causes a hard lockup.

As in: Just running 'ethtool -i enp5s0' or 'dhcpcd enp5s0' is enough to cause the bug.

For that reason, it's not connected to anything.

Trying to boot with 'pcie_port_pm=off' as suggested for other boards has no effect (not that I'd like to waste energy in this economy anyway :).

I've read the reports about the chip, in particular…
https://old.reddit.com/r/buildapc/comments/xypn1m/network_card_intel_ethernet_controller_i225v_igc/
… incl. a report of the same board, same bug — so I know that unless Asus gets its act together, the chip is unlikely to function.

But it would be nice to at least get it to not cause lockups.

In particular, as these access attempts also occur during Arch's standard kernel updates (both during device-mapper & polkit reloads), so currently these play out as either
* at best: lockup occurs after the updated kernel is written, forcing me to reset the computer.
* at worst, the writing is interrupted, leaving me without a usable kernel (iow, forced to boot into an installation medium).

Please the the attached text for dmesg excepts / call traces.

Comment 1 Bagas Sanjaya 2023-07-12 09:05:39 UTC

(In reply to Andreas Reis from comment #0)
> Created attachment 304611 [details]
> "dmesg -w -T | grep -i igc" & call traces
> 
> ASUS B660-G (BIOS 2404; most recent)
> Kernel: 6.4.3-zen (Arch Linux; booted with pcie_aspm=force)
> 
> Any attempt to even just get the status of my board's Intel I225-V chip
> causes a hard lockup.
> 
> As in: Just running 'ethtool -i enp5s0' or 'dhcpcd enp5s0' is enough to
> cause the bug.
> 
> For that reason, it's not connected to anything.
> 
> Trying to boot with 'pcie_port_pm=off' as suggested for other boards has no
> effect (not that I'd like to waste energy in this economy anyway :).
> 
> I've read the reports about the chip, in particular…
> https://old.reddit.com/r/buildapc/comments/xypn1m/
> network_card_intel_ethernet_controller_i225v_igc/
> … incl. a report of the same board, same bug — so I know that unless Asus
> gets its act together, the chip is unlikely to function.
> 
> But it would be nice to at least get it to not cause lockups.
> 
> In particular, as these access attempts also occur during Arch's standard
> kernel updates (both during device-mapper & polkit reloads), so currently
> these play out as either
> * at best: lockup occurs after the updated kernel is written, forcing me to
> reset the computer.
> * at worst, the writing is interrupted, leaving me without a usable kernel
> (iow, forced to boot into an installation medium).
> 
> Please the the attached text for dmesg excepts / call traces.


Do you have this issue on vanilla v6.1 and v6.4?

Comment 2 Andreas Reis 2023-07-12 16:16:00 UTC

Created attachment 304618 [details]
dmesg call traces with vanilla 6.4.3.arch1-1

Same behavior when trying to run ethtool & dhcpcd with vanilla 6.4.3.arch1-1 booted with pcie_port_pm=off, yes (see attachment).

I don't have v6.1 around anymore but the bug has been present ever since I bought this board last December.

Comment 3 Andreas Reis 2023-07-21 15:12:01 UTC

Seeing a relevant commit landed recently in stable…

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v6.4.4&id=c82c1a4dbe2dd43efef85290bee42ce658a79c02 (07-19)

… I tried again with "dhcpcd enp5s0":

・ first after cold boot: Same hang-crash as before.
・ after a suspend & resume: Ethernet works properly.

So I tried "rmmod igc && modprobe igc" after cold boot and that works, too.

I know neither…
・ if it's due to this commit (I didn't check for module reload before, but I did notice that occasionally the kernel update's device-mapper phase would *not* cause a hang – which I couldn't make sense of)
・ how stable this connection is (seeing there are reports of the chip bugging out after an hour
… but anyway, I was able to download 1GB at my full ISP speed.

Seems like there's some minor workaround applied during module reload that ought also be called during its initial load.

Comment 4 Andreas Reis 2023-07-24 14:10:39 UTC

6.4.5 with several commits to igc did not solve this matter.

Given the apparent solution is already somewhere in the module unload (& [re?]load) code, could somebody please add Intel devs concerned with the latter to this bug's CC?

Eg., from the recent relevant commit
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c080fe262f9e73a00934b70c16b1479cf40cd2bd :

Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tony Nguyen <anthony.l.nguyen@intel.com>

Comment 5 Andreas Reis 2023-07-31 09:40:05 UTC

Anyone alive? I would be really neat if this was forwarded to the relevant Intel devs…

Comment 6 Andreas Reis 2023-08-07 10:10:13 UTC

If apparently this isn't the place for Intel igc bug reports, could somebody please point me to the proper channel?

Note You need to log in before you can comment on or make changes to this bug.