Bug 215131

Summary: Realtek 8821CE makes the system freeze after 9e2fd29864c5 ("rtw88: add napi support")
Product: Drivers Reporter: Kai-Heng Feng (kai.heng.feng)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: normal CC: jhp, pkshih, rick.chen, yihunglin
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: mainline, linux-next Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with error flood
Workaround patch

Description Kai-Heng Feng 2021-11-24 23:21:21 UTC
After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce can freeze the system.

Before the system is frozen, we can see lots of warning on "pci bus timeout, check dma status". When that happens, total_pkt_size starts over from 0 and becomes out of sync with rtwpci->rx_tag.

rtw88_pci.disable_aspm=1 seems to workaround the issue, but other methods to prevent the CPU from reaching deeper power saving state can also workaround the issue. 

So the issue seems to depend on how fast CPU can read all data out of RX ring. Use CPU QoS to speed up CPU while doing RX reading can alleviate the problem, but the issue is not completely gone.

disable_aspm=1 is not a viable workaround because it increases power consumption significantly.
Comment 1 Kai-Heng Feng 2021-11-24 23:31:15 UTC
Created attachment 299705 [details]
dmesg with error flood
Comment 2 Ping-Ke Shih 2021-11-26 02:00:19 UTC
If we are focusing on "pci bus timeout, check dma status" instead of "system is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver can process RX packets properly. But, I'm still not sure if these two things are related or not; we can have further discussion after you try the patch.

Indeed, "disable_aspm=1" is a workaround. However, we don't have a platform on hand to analyze. Do you have contacted our FAE to help this?


diff --git a/pci.c b/pci.c
index 3b367c90..18ac5915 100644
--- a/pci.c
+++ b/pci.c
@@ -720,14 +720,19 @@ static void rtw_pci_dma_check(struct rtw_dev *rtwdev,
        struct rtw_pci_rx_buffer_desc *buf_desc;
        u32 desc_sz = chip->rx_buf_desc_sz;
        u16 total_pkt_size;
+       int try = 1000;

        buf_desc = (struct rtw_pci_rx_buffer_desc *)(rx_ring->r.head +
                                                     idx * desc_sz);
-       total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);

-       /* rx tag mismatch, throw a warning */
-       if (total_pkt_size != rtwpci->rx_tag)
-               rtw_warn(rtwdev, "pci bus timeout, check dma status\n");
+       do {
+               total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);
+       } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;}));
+
+       if (try != 1000) {
+               /* rx tag mismatch, throw a warning */
+               rtw_warn(rtwdev, "pci bus timeout, check dma status try=%d\n", try);
+       }

        rtwpci->rx_tag = (rtwpci->rx_tag + 1) % RX_TAG_MAX;
 }
Comment 3 Ping-Ke Shih 2021-11-26 02:02:37 UTC
(In reply to Kai-Heng Feng from comment #0)
> After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce
> can freeze the system.
> 

Just want to confirm that it works well before this commit?
Comment 4 Kai-Heng Feng 2021-11-26 04:48:32 UTC
(In reply to Ping-Ke Shih from comment #2)
> If we are focusing on "pci bus timeout, check dma status" instead of "system
> is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver
> can process RX packets properly. But, I'm still not sure if these two things
> are related or not; we can have further discussion after you try the patch.

No matter total_pkt_size starts over from 0, so no matter how long the wait is, rx_tag will continue to be different total_pkt_size.
The only way to quench the warning is doing "rtwpci->rx_tag = total_pkt_size".

> 
> Indeed, "disable_aspm=1" is a workaround. However, we don't have a platform
> on hand to analyze. Do you have contacted our FAE to help this?

Yes, but let me check if our FAE has contacted Realtek yet.

> 
> 
> diff --git a/pci.c b/pci.c
> index 3b367c90..18ac5915 100644
> --- a/pci.c
> +++ b/pci.c
> @@ -720,14 +720,19 @@ static void rtw_pci_dma_check(struct rtw_dev *rtwdev,
>         struct rtw_pci_rx_buffer_desc *buf_desc;
>         u32 desc_sz = chip->rx_buf_desc_sz;
>         u16 total_pkt_size;
> +       int try = 1000;
> 
>         buf_desc = (struct rtw_pci_rx_buffer_desc *)(rx_ring->r.head +
>                                                      idx * desc_sz);
> -       total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);
> 
> -       /* rx tag mismatch, throw a warning */
> -       if (total_pkt_size != rtwpci->rx_tag)
> -               rtw_warn(rtwdev, "pci bus timeout, check dma status\n");
> +       do {
> +               total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);
> +       } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;}));
> +
> +       if (try != 1000) {
> +               /* rx tag mismatch, throw a warning */
> +               rtw_warn(rtwdev, "pci bus timeout, check dma status
> try=%d\n", try);
> +       }
> 
>         rtwpci->rx_tag = (rtwpci->rx_tag + 1) % RX_TAG_MAX;
>  }
Comment 5 Kai-Heng Feng 2021-11-26 04:49:50 UTC
(In reply to Ping-Ke Shih from comment #3)
> (In reply to Kai-Heng Feng from comment #0)
> > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce
> > can freeze the system.
> > 
> 
> Just want to confirm that it works well before this commit?

Yes. I don't think that commit is the culprit though, it only exposes another underlying bug.
Comment 6 Ping-Ke Shih 2021-11-26 05:30:02 UTC
(In reply to Kai-Heng Feng from comment #4)
> (In reply to Ping-Ke Shih from comment #2)
> > If we are focusing on "pci bus timeout, check dma status" instead of
> "system
> > is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver
> > can process RX packets properly. But, I'm still not sure if these two
> things
> > are related or not; we can have further discussion after you try the patch.
> 
> No matter total_pkt_size starts over from 0, so no matter how long the wait
> is, rx_tag will continue to be different total_pkt_size.
> The only way to quench the warning is doing "rtwpci->rx_tag =
> total_pkt_size".
> 

I wonder if compiler optimize the reading operation in the loop:

	do {
		total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);
	} while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;}));

So, please add 'volatile' to the access pointer:

	struct rtw_pci_rx_buffer_desc volatile *buf_desc;
Comment 7 Ping-Ke Shih 2021-11-26 05:36:02 UTC
(In reply to Kai-Heng Feng from comment #5)
> (In reply to Ping-Ke Shih from comment #3)
> > (In reply to Kai-Heng Feng from comment #0)
> > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with
> 8821ce
> > > can freeze the system.
> > > 
> > 
> > Just want to confirm that it works well before this commit?
> 
> Yes. I don't think that commit is the culprit though, it only exposes
> another underlying bug.

If you have confirmed it works well without this commit, I think this is a good clue.
Comment 8 Ping-Ke Shih 2021-11-26 05:41:38 UTC
For comment #6, please also try mdelay() instead of udelay() to wait longer.
Comment 9 Kai-Heng Feng 2021-11-26 05:42:57 UTC
(In reply to Ping-Ke Shih from comment #6)

> I wonder if compiler optimize the reading operation in the loop:
> 
>       do {
>               total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size);
>       } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;}));
> 
> So, please add 'volatile' to the access pointer:
> 
>       struct rtw_pci_rx_buffer_desc volatile *buf_desc;

I already tried mb() before reading and it doesn't work. But yes I can still try volatile.

The only workaround I found is to disable link PS while doing RX.
Comment 10 Ping-Ke Shih 2021-11-26 05:52:45 UTC
> 
> The only workaround I found is to disable link PS while doing RX.

Could you share this change? Then, I can discuss with my colleagues internally.
Comment 11 Kai-Heng Feng 2021-11-26 15:24:22 UTC
Created attachment 299735 [details]
Workaround patch
Comment 12 Ping-Ke Shih 2021-12-01 01:09:31 UTC
(In reply to Kai-Heng Feng from comment #5)
> (In reply to Ping-Ke Shih from comment #3)
> > (In reply to Kai-Heng Feng from comment #0)
> > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with
> 8821ce
> > > can freeze the system.
> > > 
> > 
> > Just want to confirm that it works well before this commit?
> 
> Yes. I don't think that commit is the culprit though, it only exposes
> another underlying bug.

In our site, the result like yours, and we don't think that commit is cause too.
We will continue to dig the root cause.
Comment 13 Kai-Heng Feng 2021-12-09 01:26:23 UTC
(In reply to Ping-Ke Shih from comment #12)
> (In reply to Kai-Heng Feng from comment #5)
> > (In reply to Ping-Ke Shih from comment #3)
> > > (In reply to Kai-Heng Feng from comment #0)
> > > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with
> > 8821ce
> > > > can freeze the system.
> > > > 
> > > 
> > > Just want to confirm that it works well before this commit?
> > 
> > Yes. I don't think that commit is the culprit though, it only exposes
> > another underlying bug.
> 
> In our site, the result like yours, and we don't think that commit is cause
> too.
> We will continue to dig the root cause.

May I know that's the progress on this issue? Otherwise I'll propose the short term workaround to the mailing list.
Comment 14 Ping-Ke Shih 2021-12-09 02:20:03 UTC
(In reply to Kai-Heng Feng from comment #13)
> 
> May I know that's the progress on this issue? Otherwise I'll propose the
> short term workaround to the mailing list.

We test your workaround, and it also work in our side, so I think we can take your workaround.
Please add a quirk to this workaround in certain platform.

I suppose you are clear how to add a quirk, but I still give you some information.
The existing quirks of rtw88 is rtw88_pci_quirks in pci.c
You can run dmidecode in the platform to get DMI_SYS_VENDOR and DMI_PRODUCT_NAME to fill the table.
Comment 15 Ping-Ke Shih 2021-12-09 02:40:57 UTC
I forget one thing.
Please check ref_cnt rtw_pci_link_ps() of pci.c; instead of rtw_hci_link_ps() of hci.h.
Because this is PCI specific issue.
Comment 16 Kai-Heng Feng 2021-12-09 03:27:25 UTC
There are numerous similar bug reports on 8821CE, so can we apply the workaround to 8821CE unconditionally?
Comment 17 Ping-Ke Shih 2021-12-09 03:35:53 UTC
Have you confirmed this can fix the issue on other platforms?

More, as I know, an AMD platform doesn't have this issue, right?

If power consumption is an important thing, it would be better to enumerate the platforms with this issue in quirk.
Comment 18 Jian-Hong Pan 2021-12-13 07:52:42 UTC
*** Bug 215239 has been marked as a duplicate of this bug. ***