Bug 217710 - rtw89_8852ae connection drops since kernel 6.3.8
Summary: rtw89_8852ae connection drops since kernel 6.3.8
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: networking_wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-26 10:57 UTC by Damian B
Modified: 2023-08-24 15:03 UTC (History)
1 user (show)

See Also:
Kernel Version: v6.1-rc1
Subsystem:
Regression: Yes
Bisected commit-id: a0d99ebb3ecd3cc39d1680e6edb4757013f535fd


Attachments
dmesg (112.95 KB, text/plain)
2023-07-26 10:57 UTC, Damian B
Details
bisect log (1.69 KB, text/plain)
2023-07-26 11:04 UTC, Damian B
Details
dmesg-6.1.42 (169.45 KB, text/plain)
2023-07-30 19:39 UTC, Damian B
Details
Patch with potential fix (497 bytes, patch)
2023-08-03 18:37 UTC, Damian B
Details | Diff
Draft fix v1 (1.32 KB, patch)
2023-08-04 01:50 UTC, Ping-Ke Shih
Details | Diff

Description Damian B 2023-07-26 10:57:33 UTC
Created attachment 304699 [details]
dmesg

Hi,

I have RTL8852AE WiFi card on my laptop running Fedora 38.

03:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8852AE 802.11ax PCIe Wireless Network Adapter

Since I upgraded to kernel 6.3.8, I'm facing connection drops after couple minutes of uptime. 6.3.7 is latest kernel working for me without issues. Problem still appear on kernel 6.4.6.

I did git bisect and problematic commit seems to be following

c009ddc88325ec508f3fd3da646d0f19ebf1bbb1 is the first bad commit
commit c009ddc88325ec508f3fd3da646d0f19ebf1bbb1
Author: Ping-Ke Shih <pkshih@realtek.com>
Date:   Sat May 27 16:29:38 2023 +0800

    wifi: rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS


Please let me know if I can provide anything more to help investigate this issue.
Comment 1 Damian B 2023-07-26 11:04:15 UTC
Created attachment 304700 [details]
bisect log
Comment 2 Ping-Ke Shih 2023-07-27 01:16:55 UTC
As commit message of "wifi: rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS", PS was broken after kernel 5.20, and this commit is to fix it. So, the connection problem should happen before kernel 5.19, can you bisect the cause?


To find a workaround for 6.3.8, please try
1. turn off deep PS by insmod with parameter
   disable_ps_mode=y
2. If option 1 doesn't work, try to disable power save via configure file of network mananger
   /etc/NetworkManager/conf.d/default-wifi-powersave-on.conf
   wifi.powersave = 2
   
These two workaround are also identical to rollback to 6.3.7 at runtime.
Comment 3 Ping-Ke Shih 2023-07-27 01:48:38 UTC
FYI. 

I run 8852AE on kernel 6.5-rc2 for 20+ minutes, it works well. 
Please also try another band (5GHz -> 2GHz, or opposite) or another AP.
Comment 4 Damian B 2023-07-27 15:15:54 UTC
(In reply to Ping-Ke Shih from comment #2)
> As commit message of "wifi: rtw89: correct PS calculation for
> SUPPORTS_DYNAMIC_PS", PS was broken after kernel 5.20, and this commit is to
> fix it. So, the connection problem should happen before kernel 5.19, can you
> bisect the cause?

Thank you for your response.

I tried following kernels:
5.17.5 - works fine (Fedora 36 on live USB)
5.19.17 - works fine (compiled myself on Fedora 38)

I couldn't compile & test 5.18 version for some reason, however it might be pointless since it seems to be working on both 5.17 and 5.19. If you think it's still worth trying, I'll do my best to test on this kernel version.


(In reply to Ping-Ke Shih from comment #3)
> Please also try another band (5GHz -> 2GHz, or opposite) or another AP.

Changing band from 5GHz to 2.4GHz solves this issue. I'm no longer seeing any errors in dmesg logs as well. So it seems that issue only happens on 5GHz connections (at least in my case).
Could you try to reproduce this issue on your side using 5GHz, please?
Comment 5 Ping-Ke Shih 2023-07-28 00:20:04 UTC
> If you think it's still worth trying, I'll do my best to test on this kernel
> version.

Could you please try 6.1.42? Then, I can narrow down the problem easier.

> Could you try to reproduce this issue on your side using 5GHz, please?

I did that yesterday. Maybe, this is a interoperability problem, so you can also try another AP to see if symptom disappears.
Comment 6 Damian B 2023-07-30 19:39:03 UTC
(In reply to Ping-Ke Shih from comment #5)
> Could you please try 6.1.42? Then, I can narrow down the problem easier.

Yes, I've tried and it has the very similar issue - although it appears a little bit more stable and system reconnects itself after connection drop. I'm attaching dmesg logs.

I also did some experimenting and I compiled kernel 6.0 and 6.0.19 with your patch (wifi: rtw89: correct PS calculation for SUPPORTS_DYNAMIC_PS) and both works perfectly and does NOT log any error in dmesg. Starting from 6.1.0-rc1 I can see driver errors in dmesg. Do you think it's worth bisecting to find out on which commit started to pollute logs with errors?

BTW, did you checked dmesg log I attached in first comment and here? There are some errors logged from rtw89 driver, so maybe it can help identify issue?

> I did that yesterday. Maybe, this is a interoperability problem, so you can
> also try another AP to see if symptom disappears.

I don't have that possibility currently, but I will try to think of something.
Comment 7 Damian B 2023-07-30 19:39:43 UTC
Created attachment 304736 [details]
dmesg-6.1.42
Comment 8 Ping-Ke Shih 2023-08-01 00:39:37 UTC
> did you checked dmesg log I attached in first comment and here?

It shows that hardware get abnormal, and firmware raises an evnet to driver to recover the state. 

A case below looks like recovery is done, but disconnect after 480 (5930-5450) seconds:

[ 5450.609433] rtw89_8852ae 0000:03:00.0: SER catches error: 0x1002
[ 5450.660514] rtw89_8852ae 0000:03:00.0: c2h class 1 func 3 not support
[ 5930.218515] wlp3s0: deauthenticated from 78:d2:94:76:30:ea (Reason: 3=DEAUTH_LEAVING)



Summarize current experiments 
1. 6.0 and 6.0.19 + PS patch: work perfect without any error log 
2. 6.1.42 (with PS patch): SER
3. 6.3.8 (with PS patch): connection drops (please help to check if kernel log shows error as experiement 1)

> Do you think it's worth bisecting to find out on which commit started to
> pollute logs with errors?

I check the difference between v6.0.19 and v6.1-rc1, which contiains a lot of changes, so it is hard to address the problem by my eyes. To bisect will be helpful. But, please collect log of kernel 6.3.8 first to see if we have only single one problem.
Comment 9 Ping-Ke Shih 2023-08-01 00:53:23 UTC
> Starting from 6.1.0-rc1 I can see driver errors in dmesg.

You mentioned above, but attached log said dmesg-6.1.42. Please correct my summary if anything is wrong.
Comment 10 Damian B 2023-08-01 18:58:00 UTC
(In reply to Ping-Ke Shih from comment #8)
> Summarize current experiments 
> 1. 6.0 and 6.0.19 + PS patch: work perfect without any error log 
> 2. 6.1.42 (with PS patch): SER
> 3. 6.3.8 (with PS patch): connection drops (please help to check if kernel
> log shows error as experiement 1)

1. correct
2. correct
3. it's the same as 2. - I attached logs in first attachement, called 'dmesg'. Although it's from 6.4.6 kernel, it's pretty much the same as in 6.3.8 (SER errors logged and connection drops)

> I check the difference between v6.0.19 and v6.1-rc1, which contiains a lot
> of changes, so it is hard to address the problem by my eyes. To bisect will
> be helpful.

I did bisect and looks like problematic commit is this one:

commit a0d99ebb3ecd3cc39d1680e6edb4757013f535fd
Author: Ping-Ke Shih <pkshih@realtek.com>
Date:   Fri Sep 16 11:38:05 2022 +0800

    wifi: rtw89: initialize DMA of CMAC

I checked changes in this commit and to me looks like following condition is invalid:

+	if (chip_id != RTL8852A && chip_id != RTL8852B)
+		return 0;

This condition will return 0 for RTL8852C, and execute further code for RTL8852A and RTL8852B, which seems wrong to me as change description mentions it's for RTL8852C. I might not understood this change though, so please don't take my word for it but check it yourself. In meantime I'm compiling 6.3.8 without a0d99ebb3ecd3cc39d1680e6edb4757013f535fd and I will let you know how it behaves.
Comment 11 Damian B 2023-08-01 19:59:34 UTC
I can confirm that reverting a0d99ebb3ecd3cc39d1680e6edb4757013f535fd from 6.3.8 fixes issue for me. I don't see any connection drops after 40 minutes of uptime. And no driver errors in dmesg.
Comment 12 Ping-Ke Shih 2023-08-02 00:39:12 UTC
Thansk for the bisection. I will check my colleague to diagnose this problem.
Comment 13 Damian B 2023-08-03 18:37:45 UTC
Created attachment 304768 [details]
Patch with potential fix

Hi, any news about fix?

I'm attaching patch with my own fix for it, I'm running it for two days already and it fixed issue for me.
Feel free to use it if it looks good for you.
Comment 14 Ping-Ke Shih 2023-08-04 01:50:19 UTC
Created attachment 304769 [details]
Draft fix v1

The commit message of ("wifi: rtw89: initialize DMA of CMAC") mentioned 8852C was wrong, and function cmac_dma_init() should be called by 8852B only.

Please help to try if this draft patch works in your environment (I think it will, because the logic for 8852AE is the same as yours). Thank you.
Comment 15 Damian B 2023-08-04 09:46:45 UTC
(In reply to Ping-Ke Shih from comment #14)
> Created attachment 304769 [details]
> Draft fix v1
> 
> The commit message of ("wifi: rtw89: initialize DMA of CMAC") mentioned
> 8852C was wrong, and function cmac_dma_init() should be called by 8852B only.
> 
> Please help to try if this draft patch works in your environment (I think it
> will, because the logic for 8852AE is the same as yours). Thank you.

Thank you for looking into issue and providing fix.

I'm trying your patch currently and it works perfectly. After 4 hours of usage there are no connection drops, and not single error in dmesg log. I think it's good to go from my perspective.
Comment 16 Ping-Ke Shih 2023-08-04 11:24:24 UTC
I have sent out the patch [1]. It would be nice to have your Tested-by tag. 
Thank you.


[1] https://lore.kernel.org/linux-wireless/20230804105002.5781-1-pkshih@realtek.com/T/#u
Comment 17 Damian B 2023-08-04 13:13:34 UTC
(In reply to Ping-Ke Shih from comment #16)
> It would be nice to have your Tested-by tag. 

Not sure how to do this... I'm not familiar with process here. Any tips? Ignore me if it's not necessary needed.
Thank you
Comment 18 Ping-Ke Shih 2023-08-06 02:38:15 UTC
I have Cc'd the patch to your gmail, so just to reply-all the mail in plain text mode, and add (no top posting)

   Tested-by: Damian B <bronecki.damian@gmail.com>

This can help maintainer to know that you have confirmed this patch work.


An example is [1] that I add my reviewed-by.

[1] https://lore.kernel.org/linux-wireless/88026c906c8c4b12a7524c93bc25c850@realtek.com/
Comment 19 Damian B 2023-08-21 06:46:41 UTC
Resolving this bug, since it's fixed and works fine for me since kernel 6.4.11

Thanks

Note You need to log in before you can comment on or make changes to this bug.