After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce can freeze the system. Before the system is frozen, we can see lots of warning on "pci bus timeout, check dma status". When that happens, total_pkt_size starts over from 0 and becomes out of sync with rtwpci->rx_tag. rtw88_pci.disable_aspm=1 seems to workaround the issue, but other methods to prevent the CPU from reaching deeper power saving state can also workaround the issue. So the issue seems to depend on how fast CPU can read all data out of RX ring. Use CPU QoS to speed up CPU while doing RX reading can alleviate the problem, but the issue is not completely gone. disable_aspm=1 is not a viable workaround because it increases power consumption significantly.
Created attachment 299705 [details] dmesg with error flood
If we are focusing on "pci bus timeout, check dma status" instead of "system is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver can process RX packets properly. But, I'm still not sure if these two things are related or not; we can have further discussion after you try the patch. Indeed, "disable_aspm=1" is a workaround. However, we don't have a platform on hand to analyze. Do you have contacted our FAE to help this? diff --git a/pci.c b/pci.c index 3b367c90..18ac5915 100644 --- a/pci.c +++ b/pci.c @@ -720,14 +720,19 @@ static void rtw_pci_dma_check(struct rtw_dev *rtwdev, struct rtw_pci_rx_buffer_desc *buf_desc; u32 desc_sz = chip->rx_buf_desc_sz; u16 total_pkt_size; + int try = 1000; buf_desc = (struct rtw_pci_rx_buffer_desc *)(rx_ring->r.head + idx * desc_sz); - total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); - /* rx tag mismatch, throw a warning */ - if (total_pkt_size != rtwpci->rx_tag) - rtw_warn(rtwdev, "pci bus timeout, check dma status\n"); + do { + total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); + } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;})); + + if (try != 1000) { + /* rx tag mismatch, throw a warning */ + rtw_warn(rtwdev, "pci bus timeout, check dma status try=%d\n", try); + } rtwpci->rx_tag = (rtwpci->rx_tag + 1) % RX_TAG_MAX; }
(In reply to Kai-Heng Feng from comment #0) > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce > can freeze the system. > Just want to confirm that it works well before this commit?
(In reply to Ping-Ke Shih from comment #2) > If we are focusing on "pci bus timeout, check dma status" instead of "system > is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver > can process RX packets properly. But, I'm still not sure if these two things > are related or not; we can have further discussion after you try the patch. No matter total_pkt_size starts over from 0, so no matter how long the wait is, rx_tag will continue to be different total_pkt_size. The only way to quench the warning is doing "rtwpci->rx_tag = total_pkt_size". > > Indeed, "disable_aspm=1" is a workaround. However, we don't have a platform > on hand to analyze. Do you have contacted our FAE to help this? Yes, but let me check if our FAE has contacted Realtek yet. > > > diff --git a/pci.c b/pci.c > index 3b367c90..18ac5915 100644 > --- a/pci.c > +++ b/pci.c > @@ -720,14 +720,19 @@ static void rtw_pci_dma_check(struct rtw_dev *rtwdev, > struct rtw_pci_rx_buffer_desc *buf_desc; > u32 desc_sz = chip->rx_buf_desc_sz; > u16 total_pkt_size; > + int try = 1000; > > buf_desc = (struct rtw_pci_rx_buffer_desc *)(rx_ring->r.head + > idx * desc_sz); > - total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); > > - /* rx tag mismatch, throw a warning */ > - if (total_pkt_size != rtwpci->rx_tag) > - rtw_warn(rtwdev, "pci bus timeout, check dma status\n"); > + do { > + total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); > + } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;})); > + > + if (try != 1000) { > + /* rx tag mismatch, throw a warning */ > + rtw_warn(rtwdev, "pci bus timeout, check dma status > try=%d\n", try); > + } > > rtwpci->rx_tag = (rtwpci->rx_tag + 1) % RX_TAG_MAX; > }
(In reply to Ping-Ke Shih from comment #3) > (In reply to Kai-Heng Feng from comment #0) > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with 8821ce > > can freeze the system. > > > > Just want to confirm that it works well before this commit? Yes. I don't think that commit is the culprit though, it only exposes another underlying bug.
(In reply to Kai-Heng Feng from comment #4) > (In reply to Ping-Ke Shih from comment #2) > > If we are focusing on "pci bus timeout, check dma status" instead of > "system > > is frozen". We can try below patch to poll rtwpci->rx_tag to see if driver > > can process RX packets properly. But, I'm still not sure if these two > things > > are related or not; we can have further discussion after you try the patch. > > No matter total_pkt_size starts over from 0, so no matter how long the wait > is, rx_tag will continue to be different total_pkt_size. > The only way to quench the warning is doing "rtwpci->rx_tag = > total_pkt_size". > I wonder if compiler optimize the reading operation in the loop: do { total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;})); So, please add 'volatile' to the access pointer: struct rtw_pci_rx_buffer_desc volatile *buf_desc;
(In reply to Kai-Heng Feng from comment #5) > (In reply to Ping-Ke Shih from comment #3) > > (In reply to Kai-Heng Feng from comment #0) > > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with > 8821ce > > > can freeze the system. > > > > > > > Just want to confirm that it works well before this commit? > > Yes. I don't think that commit is the culprit though, it only exposes > another underlying bug. If you have confirmed it works well without this commit, I think this is a good clue.
For comment #6, please also try mdelay() instead of udelay() to wait longer.
(In reply to Ping-Ke Shih from comment #6) > I wonder if compiler optimize the reading operation in the loop: > > do { > total_pkt_size = le16_to_cpu(buf_desc->total_pkt_size); > } while (total_pkt_size != rtwpci->rx_tag && ({udelay(1); try--;})); > > So, please add 'volatile' to the access pointer: > > struct rtw_pci_rx_buffer_desc volatile *buf_desc; I already tried mb() before reading and it doesn't work. But yes I can still try volatile. The only workaround I found is to disable link PS while doing RX.
> > The only workaround I found is to disable link PS while doing RX. Could you share this change? Then, I can discuss with my colleagues internally.
Created attachment 299735 [details] Workaround patch
(In reply to Kai-Heng Feng from comment #5) > (In reply to Ping-Ke Shih from comment #3) > > (In reply to Kai-Heng Feng from comment #0) > > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with > 8821ce > > > can freeze the system. > > > > > > > Just want to confirm that it works well before this commit? > > Yes. I don't think that commit is the culprit though, it only exposes > another underlying bug. In our site, the result like yours, and we don't think that commit is cause too. We will continue to dig the root cause.
(In reply to Ping-Ke Shih from comment #12) > (In reply to Kai-Heng Feng from comment #5) > > (In reply to Ping-Ke Shih from comment #3) > > > (In reply to Kai-Heng Feng from comment #0) > > > > After 9e2fd29864c5 ("rtw88: add napi support"), using a system with > > 8821ce > > > > can freeze the system. > > > > > > > > > > Just want to confirm that it works well before this commit? > > > > Yes. I don't think that commit is the culprit though, it only exposes > > another underlying bug. > > In our site, the result like yours, and we don't think that commit is cause > too. > We will continue to dig the root cause. May I know that's the progress on this issue? Otherwise I'll propose the short term workaround to the mailing list.
(In reply to Kai-Heng Feng from comment #13) > > May I know that's the progress on this issue? Otherwise I'll propose the > short term workaround to the mailing list. We test your workaround, and it also work in our side, so I think we can take your workaround. Please add a quirk to this workaround in certain platform. I suppose you are clear how to add a quirk, but I still give you some information. The existing quirks of rtw88 is rtw88_pci_quirks in pci.c You can run dmidecode in the platform to get DMI_SYS_VENDOR and DMI_PRODUCT_NAME to fill the table.
I forget one thing. Please check ref_cnt rtw_pci_link_ps() of pci.c; instead of rtw_hci_link_ps() of hci.h. Because this is PCI specific issue.
There are numerous similar bug reports on 8821CE, so can we apply the workaround to 8821CE unconditionally?
Have you confirmed this can fix the issue on other platforms? More, as I know, an AMD platform doesn't have this issue, right? If power consumption is an important thing, it would be better to enumerate the platforms with this issue in quirk.
*** Bug 215239 has been marked as a duplicate of this bug. ***