Bug 219124 - [iwlegacy] kernel oops iwl4965
Summary: [iwlegacy] kernel oops iwl4965
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: i386 Linux
: P3 normal
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-04 15:53 UTC by Martin-Éric Racine
Modified: 2024-12-04 07:42 UTC (History)
2 users (show)

See Also:
Kernel Version: 6.5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel oops 6.10.3 (7.15 KB, text/plain)
2024-08-05 16:30 UTC, Martin-Éric Racine
Details
possible fix for iwlegacy (3.37 KB, patch)
2024-09-01 19:43 UTC, Ben Hutchings
Details | Diff
Trace from patched 6.11.0-rc6 kernel (7.06 KB, text/plain)
2024-09-07 02:04 UTC, Brandon Nielsen
Details
dmesg with above patch (72.83 KB, text/plain)
2024-09-10 12:25 UTC, Martin-Éric Racine
Details
dmesg 6.10.11-686-pae (72.27 KB, text/plain)
2024-09-23 11:38 UTC, Martin-Éric Racine
Details
dmesg 6.10.12 (70.68 KB, text/plain)
2024-10-02 11:39 UTC, Martin-Éric Racine
Details
dmesg 6.11.3 (72.36 KB, text/plain)
2024-10-15 06:02 UTC, Martin-Éric Racine
Details

Description Martin-Éric Racine 2024-08-04 15:53:23 UTC
Greetings,

As reported a while back at (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1062421) against kernel 6.5 (still present on kernel 6.9.12), iwlegacy ooopses on iwl4965 hardware.

The bug report contains a lot of auto-collected information. Please ping me if anything else is needed.

Thanks!
Martin-Éric
Comment 1 Artem S. Tashkinov 2024-08-05 07:54:09 UTC
Is this reproducible with 6.10.3 or 6.6.44?
Comment 2 Martin-Éric Racine 2024-08-05 08:16:20 UTC
This regression was introduced around 6.5 and remains present in releases up to 6.9.12.

6.10.3 is being build on Debian as we speak. I'll know more in a few hours.
Comment 3 Martin-Éric Racine 2024-08-05 16:30:54 UTC
Created attachment 306668 [details]
kernel oops 6.10.3

Apparently, yes, it still applies to 6.10.3, as per attachment.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-08-06 16:39:20 UTC
I fear no developer will look into this unless you find the change that broke things using a git bisection. Could you perform one? https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
Comment 5 Ben Hutchings 2024-09-01 19:43:01 UTC
Created attachment 306803 [details]
possible fix for iwlegacy

Hi Martin-Éric, could you please test whether the attached patch fixes this?

You should be able to do that by following the instructions at:
https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#id-1.6.6.4
(although that is hosted on Salsa which is having an outage right now).
Comment 6 Brandon Nielsen 2024-09-07 02:04:07 UTC
Created attachment 306826 [details]
Trace from patched 6.11.0-rc6 kernel

Jumping in as someone with an HP Compaq 8510w with an Intel 4965 wireless adapter. My testing seems to show the patch doesn't completely fix the issue when applied to kernel-6.11.0-0.rc6.
Comment 7 Ben Hutchings 2024-09-07 16:12:17 UTC
> Jumping in as someone with an HP Compaq 8510w with an Intel 4965 wireless
> adapter. My testing seems to show the patch doesn't completely fix the issue
> when applied to kernel-6.11.0-0.rc6.

That error message shows that the target of the memcpy() is still "&out_cmd->cmd.payload" and not "payload", so the patch was not actually applied in your build.
Comment 8 Brandon Nielsen 2024-09-08 02:29:23 UTC
(In reply to Ben Hutchings from comment #7)
> > Jumping in as someone with an HP Compaq 8510w with an Intel 4965 wireless
> > adapter. My testing seems to show the patch doesn't completely fix the
> issue
> > when applied to kernel-6.11.0-0.rc6.
> 
> That error message shows that the target of the memcpy() is still
> "&out_cmd->cmd.payload" and not "payload", so the patch was not actually
> applied in your build.

So sorry about that! Just got around to testing with the patch _actually_ applied.

Not seeing any traces or errors on the journal with the patch applied.

I do still see occasional disassociation <-> reassociation events ("Reason: 2=PREV_AUTH_NOT_VALID"). It was reproducible for 3 speedtests in a row, and then suddenly got better. I see the same behavior without the patch applied, so perhaps it's not related.
Comment 9 Martin-Éric Racine 2024-09-09 08:19:01 UTC
(In reply to Ben Hutchings from comment #5)
> Created attachment 306803 [details]
> possible fix for iwlegacy
> 
> Hi Martin-Éric, could you please test whether the attached patch fixes this?
> 
> You should be able to do that by following the instructions at:
> https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#id-
> 1.6.6.4
> (although that is hosted on Salsa which is having an outage right now).

The 'test-patch' command barfed. It tried configuring for flavor 'pae' instead of the expected '686-pae' flavor. Bug filed against 'devscripts' at Debian.
Comment 10 Martin-Éric Racine 2024-09-10 12:25:01 UTC
Created attachment 306849 [details]
dmesg with above patch

The patch indeed seems to quiet down the iwl4965 messages, but it introduced the following:

WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394 pti_clone_pgtable+0x2a1/0x2dc

The trace for this appears in the above dmesg output.
Comment 11 Ben Hutchings 2024-09-10 17:53:51 UTC
(In reply to Martin-Éric Racine from comment #10)
> Created attachment 306849 [details]
> dmesg with above patch
> 
> The patch indeed seems to quiet down the iwl4965 messages, but it introduced
> the following:
> 
> WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394 pti_clone_pgtable+0x2a1/0x2dc
> 
> The trace for this appears in the above dmesg output.

It can't have introduced that warning, because that was emitted before the driver was even loaded.
Comment 12 Martin-Éric Racine 2024-09-11 05:26:42 UTC
(In reply to Ben Hutchings from comment #11)
> (In reply to Martin-Éric Racine from comment #10)
> > Created attachment 306849 [details]
> > dmesg with above patch
> > 
> > The patch indeed seems to quiet down the iwl4965 messages, but it
> introduced
> > the following:
> > 
> > WARNING: CPU: 0 PID: 1 at arch/x86/mm/pti.c:394
> pti_clone_pgtable+0x2a1/0x2dc
> > 
> > The trace for this appears in the above dmesg output.
> 
> It can't have introduced that warning, because that was emitted before the
> driver was even loaded.

It's indeed extremely unlikely to have introduced it.

Anyhow, at this point, with two people having tested it on real hardware, I think that the patch effectively fixes the iwl4965 kernel oops.

As to what causes the above PTI warnings, that's a separate issue.
Comment 13 Martin-Éric Racine 2024-09-15 07:56:58 UTC
PS: I welcome pointers on which module should get the bug report about the above PTI oops.
Comment 14 Ben Hutchings 2024-09-15 16:11:24 UTC
(In reply to Martin-Éric Racine from comment #13)
> PS: I welcome pointers on which module should get the bug report about the
> above PTI oops.

That was already reported in <https://lore.kernel.org/all/e541b49b-9cc2-47bb-b283-2de70ae3a359@roeck-us.net/>. The fix went into 6.11-rc3 but is not in 6.10-stable yet.
Comment 15 Ben Hutchings 2024-09-15 16:25:53 UTC
(In reply to Ben Hutchings from comment #14)
> (In reply to Martin-Éric Racine from comment #13)
> > PS: I welcome pointers on which module should get the bug report about the
> > above PTI oops.
> 
> That was already reported in
> <https://lore.kernel.org/all/e541b49b-9cc2-47bb-b283-2de70ae3a359@roeck-us.
> net/>. The fix went into 6.11-rc3 but is not in 6.10-stable yet.

Actually it's in 6.10.10.
Comment 16 Martin-Éric Racine 2024-09-23 11:36:45 UTC
The PTI issue seems to be fixed in Debian 6.10.11-1, but the iwl4965 issue isn't.
Comment 17 Martin-Éric Racine 2024-09-23 11:38:39 UTC
Created attachment 306913 [details]
dmesg 6.10.11-686-pae
Comment 18 Martin-Éric Racine 2024-10-02 11:39:41 UTC
Created attachment 306946 [details]
dmesg 6.10.12

Still not fixed as of Debian 6.10.12-1.
Comment 19 Martin-Éric Racine 2024-10-15 06:02:01 UTC
Created attachment 307008 [details]
dmesg 6.11.3

Not fixed as of 6.11.3.
Comment 20 Martin-Éric Racine 2024-10-17 08:32:24 UTC
I really have to wonder whether the fix ever got merged at all. I still get the same kernel oops with 6.10.12 and 6.11.3 as before.
Comment 21 Brandon Nielsen 2024-11-03 20:11:18 UTC
(In reply to Martin-Éric Racine from comment #20)
> I really have to wonder whether the fix ever got merged at all. I still get
> the same kernel oops with 6.10.12 and 6.11.3 as before.

I don't believe it did.

I still see the issue with 6.11.4.
Comment 22 Ben Hutchings 2024-11-03 20:30:39 UTC
The fix has been merged and will be included in v6.12-rc6.

It's also under review for inclusion on the 6.1, 6.6, and 6.11 stable branches.
Comment 23 Martin-Éric Racine 2024-11-10 07:45:59 UTC
Merged into 6.11.7. Kernel oops is gone. Thanks.

However, 6.11.7 also merged plenty of related "fixes" as a result of which the connection is now somewhat slow and unstable, but that's for a different bug report.

It however should be noted that this driver worked fine until about kernel 6.4, when someone decided that refactoring a bunch of code took precedence over the old adage "if it ain't broken, don't fix it." Given this, I really have to urge caution when pondering the necessity of backporting changes to kernel 6.1, which is the only one currently giving tried and proven, reliable WiFi on this chipset.
Comment 24 Martin-Éric Racine 2024-12-04 07:42:52 UTC
Unless I'm mistaken, this has now been merged into every possible stable release. We can probably close this?

Note You need to log in before you can comment on or make changes to this bug.