Bug 217635 - iwlwifi driver broken on Intel 3165 network card
Summary: iwlwifi driver broken on Intel 3165 network card
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: Intel Linux
: P3 high
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-05 16:26 UTC by joey.joey586
Modified: 2023-07-29 02:24 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg logs (49.75 KB, application/zip)
2023-07-05 16:26 UTC, joey.joey586
Details
git bisect log (5.80 KB, text/plain)
2023-07-09 19:15 UTC, joey.joey586
Details
second try of git bisect (5.19 KB, text/plain)
2023-07-11 16:48 UTC, joey.joey586
Details

Description joey.joey586 2023-07-05 16:26:00 UTC
Created attachment 304552 [details]
dmesg logs

Distro: Arch Linux
Kernel version: 6.4.1.arch1-1
Happens on mainline kernel? : YES (linux-mainline 6.4-1)
Note: linux-mainline 6.4.1 is not available at time of this writing

Arch linux bug:
https://bugs.archlinux.org/task/78984

Summary:
No network access even after connecting to wifi. Websites don't load, ping doesn't work.
This didn't happen on kernel 6.3.x (specifically 6.3.9, the last 6.3 kernel provided by Arch).

Bug happens on both Arch-provided kernel and mainline kernel

Steps to reproduce:
1) On fresh boot, connect to a wifi network
   a) Make sure wifi password is not saved beforehand
2) Ping a url with terminal, or open a website with browser
3) Ping fails to work / website doesn't load
Comment 1 Bagas Sanjaya 2023-07-06 03:00:27 UTC
(In reply to joey.joey586 from comment #0)
> Created attachment 304552 [details]
> dmesg logs
> 
> Distro: Arch Linux
> Kernel version: 6.4.1.arch1-1
> Happens on mainline kernel? : YES (linux-mainline 6.4-1)
> Note: linux-mainline 6.4.1 is not available at time of this writing
> 
> Arch linux bug:
> https://bugs.archlinux.org/task/78984
> 
> Summary:
> No network access even after connecting to wifi. Websites don't load, ping
> doesn't work.
> This didn't happen on kernel 6.3.x (specifically 6.3.9, the last 6.3 kernel
> provided by Arch).
> 
> Bug happens on both Arch-provided kernel and mainline kernel
> 
> Steps to reproduce:
> 1) On fresh boot, connect to a wifi network
>    a) Make sure wifi password is not saved beforehand
> 2) Ping a url with terminal, or open a website with browser
> 3) Ping fails to work / website doesn't load

Can you perform bisection between v6.3 and v6.4?
Comment 2 Bagas Sanjaya 2023-07-06 03:02:24 UTC
I guess this is related to similar issue reported on LKML [1].

[1]: https://lore.kernel.org/lkml/CAAJw_ZueYAHQtM++4259TXcxQ_btcRQKiX93u85WEs2b2p19wA@mail.gmail.com/
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-06 11:15:28 UTC
(In reply to Bagas Sanjaya from comment #2)
> I guess this is related to similar issue reported on LKML [1].

FWIW, that is about mainline, this is about 6.4 -- and the problem looks different as well. So I doubt somewhat that these are the same problems. 

A bisection would be really helpful.
Comment 4 joey.joey586 2023-07-06 16:33:54 UTC
Bisecting now, might take a while
Comment 5 joey.joey586 2023-07-06 18:14:17 UTC
(In reply to joey.joey586 from comment #4)
> Bisecting now, might take a while

Sorry, I'm unable to bisect. After running 'git bisect bad' once, the kernel fails to build with error:

make[5]: *** No rule to make target 'zip.h', needed by '/home/poweruser/Downloads/linux-git/src/linux-torvalds/tools/bpf/resolve_btfids/libbpf/staticobjs/libbpf.o'.  Stop.
make[4]: *** [Makefile:157: /home/poweruser/Downloads/linux-git/src/linux-torvalds/tools/bpf/resolve_btfids/libbpf/staticobjs/libbpf-in.o] Error 2
make[3]: *** [Makefile:63: /home/poweruser/Downloads/linux-git/src/linux-torvalds/tools/bpf/resolve_btfids//libbpf/libbpf.a] Error 2
make[2]: *** [Makefile:76: bpf/resolve_btfids] Error 2
make[1]: *** [Makefile:1440: tools/bpf/resolve_btfids] Error 2
make[1]: *** Waiting for unfinished jobs....
  CALL    scripts/checksyscalls.sh
make: *** [Makefile:358: __build_one_by_one] Error 2
==> ERROR: A failure occurred in build().
    Aborting...
makepkg -efs  7.42s user 2.72s system 114% cpu 8.843 total

Same error happens on every git bisect bad
I have zero experience with kernel development/building, so I have no idea what to do.
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-07 08:52:54 UTC
(In reply to joey.joey586 from comment #5)
> (In reply to joey.joey586 from comment #4)
>
> Sorry, I'm unable to bisect. After running 'git bisect bad' once, the kernel
> fails to build

That's not the kernel, that's the kernel tools; you don't need those to run a kernel.

> with error:
> 
> make[5]: *** No rule to make target 'zip.h', needed by

You likely need a package called libzip-devel (or something like that -- whatever provides zip.h on your distro).
Comment 7 joey.joey586 2023-07-08 17:12:36 UTC
libzip provides zip.h, so why is it complaining about this error?

Here's my terminal output:
joey@joey ~ % pacman -Qo /usr/include/zip.h 
/usr/include/zip.h is owned by libzip 1.10.0-1
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-09 06:17:16 UTC
From a quick search it seems there was a bug: https://lore.kernel.org/all/ZFJ39HKzBUg64QPO@kernel.org/

But again: you don't need to build the tools, just build the kernel
Comment 9 joey.joey586 2023-07-09 07:56:34 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #8)
> From a quick search it seems there was a bug:
> https://lore.kernel.org/all/ZFJ39HKzBUg64QPO@kernel.org/
> 
> But again: you don't need to build the tools, just build the kernel

How do I do that? I tried replacing 'make all' with 'make vmlinux' (https://www.kernel.org/doc/makehelp.txt), but it still complains about zip.h

I'm using the Arch PKGBUILD for linux-git here:
https://aur.archlinux.org/packages/linux-git
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-09 08:24:02 UTC
(In reply to joey.joey586 from comment #9)
>  but it still complains about zip.h

Guess that tool then is needed during build. Apologies. 

Did a quick look, sadly could not find a fix for this. Try "git bisect skip", with a but if luck it will avoid the problematic area
Comment 11 joey.joey586 2023-07-09 15:59:01 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10)
> (In reply to joey.joey586 from comment #9)
> >  but it still complains about zip.h
> Did a quick look, sadly could not find a fix for this. Try "git bisect
> skip", with a but if luck it will avoid the problematic area

"git bisect skip" works, thanks!

And I think I found the bad commit:
[bd54f3c29077f23dad92ef82a78061b40be30c65] wifi: mac80211: generate EMA beacons in AP mode

Here's my terminal log:
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[master] % git bisect start
status: waiting for both good and bad commits
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[master|bisect] % git bisect good v6.3
status: waiting for bad commit, 1 good commit known
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[master|bisect] % git bisect bad v6.4
Bisecting: 8012 revisions left to test after this (roughly 13 steps)
[d42b1c47570eb2ed818dc3fe94b2678124af109d] Merge tag 'devicetree-for-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect bad v6.4  4.53s user 1.23s system 99% cpu 5.801 total
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~128|bisect] % git bisect skip
Bisecting: 8012 revisions left to test after this (roughly 13 steps)
[1423885c84a5b3a53b79bcf241b18124d0d7cba6] cxl/hdm: Use 4-byte reads to retrieve HDM decoder base+limit
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~68^2~2^2~3|bisect] % git bisect skip
Bisecting: 8012 revisions left to test after this (roughly 13 steps)
[2124f79de6a909630d1a62b01ecc32db9f967181] mm: shrinkers: fix debugfs file permissions
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~103^2~12|bisect] % git bisect skip
Bisecting: 8012 revisions left to test after this (roughly 13 steps)
[c9fa320b00cff04980b8514d497068e59a8ee131] xfrm: copy_to_user_state fetch offloaded SA packets/bytes statistics
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~132^2~231^2~4|bisect] % git bisect skip
Bisecting: 8012 revisions left to test after this (roughly 13 steps)
[bd54f3c29077f23dad92ef82a78061b40be30c65] wifi: mac80211: generate EMA beacons in AP mode
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~132^2~151^2~82|bisect] %

After that last "git bisect skip" the kernel compiles successfully and the wifi stopped crashing.
Comment 12 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-09 16:25:36 UTC
I'm missing something here; you referred to the last "git bisect skip" which afaics is

> rc1~132^2~231^2~4|bisect] % git bisect skip
> Bisecting: 8012 revisions left to test after this (roughly 13 steps)
> [bd54f3c29077f23dad92ef82a78061b40be30c65] wifi: mac80211: generate EMA
> beacons in AP mode

Which sounds like you need to mark bd54f3c29077f23dad92ef82a78061b40be30c65 as bad and continue.
Comment 13 joey.joey586 2023-07-09 19:15:11 UTC
Created attachment 304570 [details]
git bisect log

log of the git bisect
Comment 14 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-10 10:07:06 UTC
(In reply to joey.joey586 from comment #13)
> log of the git bisect

thx for this, sorry, looked a bit odd earlier from here.

Forwarded the report to the developers:
https://lore.kernel.org/all/6f8715af-95c2-8333-2b32-206a143ebb52@leemhuis.info/
Comment 15 joey.joey586 2023-07-10 16:36:01 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #14)
> Forwarded the report to the developers:
> https://lore.kernel.org/all/6f8715af-95c2-8333-2b32-206a143ebb52@leemhuis.
> info/

Thanks, I appreciate it.
Comment 16 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-10 16:49:11 UTC
Could you please recheck you bisection? Johannes doubts it was correct:

https://lore.kernel.org/all/047c7bdc8057175f2bb78981a5f1a1aa6b493153.camel@sipsolutions.net/
Comment 17 joey.joey586 2023-07-11 03:58:43 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #16)
> Could you please recheck you bisection? Johannes doubts it was correct:

Alright, I'll redo the bisect.
Comment 18 joey.joey586 2023-07-11 16:46:06 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #16)
> Could you please recheck you bisection? Johannes doubts it was correct:

Redid the bisection, got a different result:
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[v6.4-rc1~132^2~254|bisect] % git bisect good
5fc3f6c90cca19e4b13433621d9c2dcae875f4d7 is the first bad commit
commit 5fc3f6c90cca19e4b13433621d9c2dcae875f4d7
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Sat Mar 18 22:50:10 2023 +0100

    r8169: consolidate disabling ASPM before EPHY access

    Now that rtl_hw_aspm_clkreq_enable() is a no-op for chip versions < 32,
    we can consolidate disabling ASPM before EPHY access in rtl_hw_start().

    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

 drivers/net/ethernet/realtek/r8169_main.c | 42 +++----------------------------
 1 file changed, 3 insertions(+), 39 deletions(-)
joey@joey ~/Desktop/linux-git/src/linux-torvalds (git)-[bisect/good-c3892e8c51d27f73341eab042afa147a7ca2b966|bisect] %
Comment 19 joey.joey586 2023-07-11 16:48:21 UTC
Created attachment 304610 [details]
second try of git bisect

full 'git bisect' log
Comment 20 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-11 17:11:17 UTC
r8169? that's somewhat odd as well, but who knows. Could you try to revert it ontop of a kernel version you know is affected (e.g. 6.4 or 6.4.1) to verify this result? And a shot in the dark: does blacklisting the driver change anything?
Comment 21 joey.joey586 2023-07-11 20:24:16 UTC
How can I do that? Sorry, I'm not familiar with git.
Comment 22 joey.joey586 2023-07-11 20:28:45 UTC
and I use localmodconfig to build the kernel, so r8169 driver might not even exist in the kernel
Comment 23 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-12 05:05:33 UTC
(In reply to joey.joey586 from comment #21)
> How can I do that? Sorry, I'm not familiar with git.

git checkout --detach v6.4
git revert 5fc3f6c90cca19e4b13433621d9c2dcae875f4d7 --no-edit
[build again]

(In reply to joey.joey586 from comment #22)
> and I use localmodconfig to build the kernel, so r8169 driver might not even
> exist in the kernel

In a earlier dmesg it was loaded iirc
Comment 24 joey.joey586 2023-07-12 08:07:38 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #20)
> And a shot in the dark: does blacklisting the driver change anything?

Blacklisting r8169 fixes the issue.


(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #23)
> In a earlier dmesg it was loaded iirc

You're right, it is loaded. I didn't realize my ethernet is a realtek. I apologize.

I'll rebuild the kernel later tonight.
Comment 25 joey.joey586 2023-07-12 11:10:57 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #23)
> (In reply to joey.joey586 from comment #21)
> > How can I do that? Sorry, I'm not familiar with git.
> 
> git checkout --detach v6.4
> git revert 5fc3f6c90cca19e4b13433621d9c2dcae875f4d7 --no-edit
> [build again]

Reverting 5fc3f6c90cca19e4b13433621d9c2dcae875f4d7 fixes the issue!
Comment 26 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-12 11:25:08 UTC
thx for confirmung, told relevant people by mail (see link above)
Comment 27 Heiner Kallweit 2023-07-13 05:41:03 UTC
Please test whether the following fixes the issue:

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 9445f04f8..2b3aa6b45 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -2747,6 +2747,13 @@ static void rtl_hw_aspm_clkreq_enable(struct rtl8169_private *tp, bool enable)
 		return;
 
 	if (enable) {
+		/* On these chip versions ASPM can harm even other
+		 * PCI devices.
+		 */
+		if (tp->mac_version == RTL_GIGA_MAC_VER_42 ||
+		    tp->mac_version == RTL_GIGA_MAC_VER_43)
+			return;
+
 		rtl_mod_config5(tp, 0, ASPM_en);
 		rtl_mod_config2(tp, 0, ClkReqEn);
 
-- 
2.41.0
Comment 28 joey.joey586 2023-07-13 09:00:33 UTC
(In reply to Heiner Kallweit from comment #27)
> Please test whether the following fixes the issue:
Thanks, the patch fixes the issue. :)
Comment 29 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-13 10:06:47 UTC
(In reply to Heiner Kallweit from comment #27)
> Please test whether the following fixes the issue:

Thx for this. 

> +             /* On these chip versions ASPM can harm even other
> +              * PCI devices.

The comment makes me wonder: might this also fix or somehow be related to other ASPM related regressions reports with r8169 that as of now are unfixed afaik? I mean these:

https://lore.kernel.org/all/9ebb43ee-52a1-c77d-d609-ca447a32f3e6@posteo.at/
https://lore.kernel.org/all/c3465166-f04d-fcf5-d284-57357abb3f99@freenet.de/
Comment 30 Heiner Kallweit 2023-07-13 10:27:52 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #29)
> (In reply to Heiner Kallweit from comment #27)
> > Please test whether the following fixes the issue:
> 
> Thx for this. 
> 
> > +             /* On these chip versions ASPM can harm even other
> > +              * PCI devices.
> 
> The comment makes me wonder: might this also fix or somehow be related to
> other ASPM related regressions reports with r8169 that as of now are unfixed
> afaik? I mean these:
> 
> https://lore.kernel.org/all/9ebb43ee-52a1-c77d-d609-ca447a32f3e6@posteo.at/
> https://lore.kernel.org/all/c3465166-f04d-fcf5-d284-57357abb3f99@freenet.de/

It's unrelated IMO. Chip versions 42 + 43 have the same MAC, and letting this MAC version trigger a transition to a deeper ASPM state apparently can disturb the root port in a way that even communication with other PCIe devices is affected.
The logic in the fix here has been there before and simply was accidentally removed by "r8169: consolidate disabling ASPM before EPHY access".

The other reports refer to chip version 49 (RTL8168h). This chip version runs fine with ASPM up to L1.1. Interestingly these reports so far are only about systems where BIOS instructs the OS not to touch ASPM settings.
This needs some more analysis.
Comment 31 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-13 10:35:48 UTC
(In reply to Heiner Kallweit from comment #30)
> It's unrelated IMO. […]

Many thx for the assessment, much appreciated.
Comment 32 joey.joey586 2023-07-23 15:47:29 UTC
Is the fix already in the mainline kernel? I can't find it in the changelogs.
Comment 33 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-23 16:07:23 UTC
(In reply to joey.joey586 from comment #32)
> Is the fix already in the mainline kernel? I can't find it in the changelogs.

Here it is:
https://git.kernel.org/torvalds/c/162d626f3013215b82b6514ca14f20932c7ccce5
Comment 34 joey.joey586 2023-07-24 04:26:12 UTC
Thanks. I probably should clarify a bit. By 'mainline' I mean the kernel.org website.
Comment 35 joey.joey586 2023-07-24 07:41:56 UTC
What I'm trying to say is: the fix is not present in the 6.4.5 kernel based on the changelogs for that kernel in the kernel.org website. When will it be added to 6.4.x ?
Comment 36 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-07-24 07:54:06 UTC
(In reply to joey.joey586 from comment #34)
> By 'mainline' I mean the kernel.org website.

The term "mainline" normally means "Linus git" tree.

(In reply to joey.joey586 from comment #35)
> When will it be added to 6.4.x ?

Just checked, it now queued for the next release of that series.
Comment 37 joey.joey586 2023-07-24 13:50:22 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #36)
> The term "mainline" normally means "Linus git" tree.
Ah, I see.
> Just checked, it now queued for the next release of that series.
and thanks for the info.
Comment 38 joey.joey586 2023-07-28 15:30:19 UTC
Fixed with kernel v6.4.7

Note You need to log in before you can comment on or make changes to this bug.