Bug 215433 - data connection lost on Qualcomm SDX55 PCIe device
Summary: data connection lost on Qualcomm SDX55 PCIe device
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-28 11:26 UTC by slark_xiao
Modified: 2022-01-19 10:26 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.13.0,5.15.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
host dmesg and related log (238.00 KB, text/plain)
2021-12-30 02:45 UTC, slark_xiao
Details
host dmesg with debug mask open (224.07 KB, text/plain)
2022-01-03 07:48 UTC, slark_xiao
Details
Seems only have that same warning with previous (87.87 KB, text/plain)
2022-01-18 03:24 UTC, slark_xiao
Details

Description slark_xiao 2021-12-28 11:26:02 UTC
For Qualcomm based modem devices, with PCIE interface, and mhi driver based on kernel 5.13.0, it will lose data connection. Details as below:

    Phenomenon:
        1) Can NOT connect to Internet
        2) /dev/wwan0p1QCDM, /dev/wwan0p2MBIM, /dev/wwan0p3AT NOT response

    Reproduce steps
        1) Connect Internet via cellular
        2) Do speedtest(https://www.speedtest.net/) or download a 1GB file (ipv4.download.thinkbroadband.com/1GB.zip)

    Recovery method
        Only can be recovered by reboot host

    Others
        It can NOT be reproduced in Windows 10/11.
Comment 1 slark_xiao 2021-12-30 02:45:34 UTC
Created attachment 300187 [details]
host dmesg and related log

  Default debug mechanism wouldn't report any error message.
  We tried to add some debug message in each related function. Still no findings.
  This should be a common issue.
Comment 2 mani 2022-01-01 04:27:45 UTC
Can you please enable the MHI debug logs and share the dmesg?

diff --git a/drivers/bus/mhi/Makefile b/drivers/bus/mhi/Makefile
index 0a2d778d6fb4..ae4063866f73 100644
--- a/drivers/bus/mhi/Makefile
+++ b/drivers/bus/mhi/Makefile
@@ -1,3 +1,5 @@
+subdir-ccflags-y := -DDEBUG
+
 # core layer
 obj-y += core/


Also, did you see any failure log in firmware console?
Comment 3 slark_xiao 2022-01-03 07:48:30 UTC
Created attachment 300210 [details]
host dmesg with debug mask open

Hi Mani,
  I just reproduced it with the new settings of the makefile.
  As we can see, there is no warning or error could be found.
  
  In the firmware side, Qualcomm said the direct cause is the IPA stucked. But they said it's host (driver) make it stuck.
Comment 4 mani 2022-01-05 09:22:59 UTC
I can see one relevant warning in dmesg:

[   47.768439] mhi_net mhi0_IP_HW0_MBIM mhi_mbim0: Fragmented packets received, fix MTU?

This means that that IPA has splitted the packet into multiple ones due to low MTU/MRU in host.

Can you try adding default MRU in "pci_generic.c" and see if that fixes the issue?

diff --git a/drivers/bus/mhi/pci_generic.c b/drivers/bus/mhi/pci_generic.c
index 29607f7bc8da..c7257154dfd2 100644
--- a/drivers/bus/mhi/pci_generic.c
+++ b/drivers/bus/mhi/pci_generic.c
@@ -367,6 +367,7 @@ static const struct mhi_pci_dev_info mhi_foxconn_sdx55_info = {
        .bar_num = MHI_PCI_DEFAULT_BAR_NUM,
        .dma_data_width = 32,
        .sideband_wake = false,
+       .mru_default = 32768,
 };
 
 static const struct mhi_channel_config mhi_mv31_channels[] = {
Comment 5 slark_xiao 2022-01-06 01:55:45 UTC
Hi Mani,
  I applied with your patch in version 5.15.12 and reproduced this issue again.
  Actually we add some debug print in function mhi_net_dl_callback(), and we can see that mhi_res->transaction_status would always be -EOVERFLOW in version 5.13(with your patch). Just like the comment said, "Since this is not optimal,print a warning (once)".
Comment 6 slark_xiao 2022-01-06 03:53:29 UTC
Correct comment5:
  would always be -EOVERFLOW in version 5.13(with your patch).
=>would always be -EOVERFLOW in version 5.13(without your patch).
Comment 7 mani 2022-01-10 15:35:57 UTC
> =>would always be -EOVERFLOW in version 5.13(without your patch).

Okay, so you didn't see the overflow error with my patch on 5.13 but still the connection drops?
Comment 8 slark_xiao 2022-01-12 08:06:39 UTC
Hi Mani,
  May I know why we should increase the MRU size to 32K? Any reference except the commit https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/bus/mhi/pci_generic.c?id=5c2c85315948c42c6c0258cf9bad596acaa79043 ?
Comment 10 slark_xiao 2022-01-18 02:47:32 UTC
Issue is gone in our local side. But still reproduced in Europe.
Comment 11 slark_xiao 2022-01-18 03:24:21 UTC
Created attachment 300284 [details]
Seems only have that same warning with previous

Test log in Europe side. It's more easier (about 20s)to reproduce than our side(more than 13 hours )with above patches.
Comment 12 slark_xiao 2022-01-18 03:25:39 UTC
(In reply to slark_xiao from comment #11)
> Created attachment 300284 [details]
> Seems only have that same warning with previous
> 
> Test log in Europe side. It's more easier (about 20s)to reproduce than our
> side(more than 13 hours )with above patches.

side(more than 13 hours and result still pass)with above patches.
Comment 13 mani 2022-01-18 09:23:46 UTC
(In reply to slark_xiao from comment #12)

> side(more than 13 hours and result still pass)with above patches.

Does this mean the issue is fixed with mru_default patch?
Comment 14 slark_xiao 2022-01-18 09:32:01 UTC
Hi Mani,
  In my opinion, mru_default patch could lower the the reproducibility in our side. For example, we can reproduce issue no more than 3 hours( average time should be about 1 hour with simulator network) without any changes. After appling with the mru patch, we can't reproduce it over 13 hours(we thought we fix it, actually not).
  So, I think MRU patch is still helpful.
Comment 15 slark_xiao 2022-01-19 10:26:06 UTC
(In reply to slark_xiao from comment #10)
> Issue is gone in our local side. But still reproduced in Europe.

Good news! After checking, previous patches works in Europe. And my summary in Commet 10 could be ignored because this is caused by driver files mismatch issue.

Today we remote to there for further checking and find that 'mistake'.
So the conclusion is that the patches works!

Sorry again for my previous wrong summary!

Note You need to log in before you can comment on or make changes to this bug.