Bug 204385 - Failed to resume from S3 with thunderbolt daisy chain
Summary: Failed to resume from S3 with thunderbolt daisy chain
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-31 02:58 UTC by AceLan Kao
Modified: 2019-12-10 03:13 UTC (History)
6 users (show)

See Also:
Kernel Version: 5.3.0-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg.log (134.69 KB, text/plain)
2019-07-31 02:58 UTC, AceLan Kao
Details
sudo lspci -vv (100.97 KB, text/plain)
2019-08-01 02:13 UTC, AceLan Kao
Details
The message after resumed (3.08 MB, image/jpeg)
2019-08-08 01:48 UTC, AceLan Kao
Details
Screenshot when failed (3.95 MB, image/jpeg)
2019-08-23 03:07 UTC, AceLan Kao
Details
attachment-24680-0.html (3.37 KB, text/html)
2019-08-31 13:13 UTC, Prestige Wang
Details
Unbind 0000:00:17.0, 0000:06:00.0, and 0000:09:00.0 (93.87 KB, text/plain)
2019-09-05 08:34 UTC, AceLan Kao
Details

Description AceLan Kao 2019-07-31 02:58:33 UTC
Created attachment 284047 [details]
dmesg.log

The system can't resume while connecting to a tbt storage and a tbt monitor over Thunderbolt port(daisy-chaining)

Dell Precision 5540 --> Dell WD19TB Thunderbolt Dock --> ASUS-Display PA27AC --> HP P800 Thunderbolt Storage

Reproduce steps:
1. Connect Dell WD19TB Thunderbolt Dock to your system(thunderbolt port)
2. Connect ASUS-Display PA27AC(thuderbolt port) to WD19TB(thunderbolt port)
3. Connect HP P800 thunderbolt storage to PA27AC display(thunderbolt port)
4. mount and access files on HP P800
5. enter S3(echo deep | sudo tee /sys/power/mem_sleep && systemctl suspend) and resume
6. it hangs

It works well if laptop
1. connects to HP P800 directly
2. or connects to ASUS monitor directly
3. or connects to ASUS monitor -> HP P800
4. or connects to WD19TB -> HP P800
5. or connects to WD19TB -> ASUS monitor
Comment 1 AceLan Kao 2019-08-01 02:13:56 UTC
Created attachment 284065 [details]
sudo lspci -vv
Comment 2 Mika Westerberg 2019-08-06 09:53:55 UTC
Can you add "no_console_suspend" to the kernel command line and reproduce? It should log bit more to the console during the suspend/resume. Of course this assumes that you can actually see anything on the console.
Comment 3 AceLan Kao 2019-08-08 01:48:38 UTC
Created attachment 284263 [details]
The message after resumed

There is no more message shown in dmesg after it hang, but it shows something on the screen while the system is trying to resume.
Comment 4 Mika Westerberg 2019-08-09 08:44:53 UTC
It looks like the hardware gets hot removed upon resume for some reason. Since it is using native PCIe hotplug I wonder if the following patch helps to solve the hang part?

https://lkml.org/lkml/2019/6/18/489

Can you give it a try?
Comment 5 AceLan Kao 2019-08-14 03:23:28 UTC
Hi Mika,

The series patch doesn't work, I got the same messages on the screen.
I also tried to remove the thunderbolt cable from laptop during suspended(S3), and it still hangs after waken up.
Comment 6 Mika Westerberg 2019-08-14 09:29:26 UTC
Is that system by default using s2idle (check contents of /sys/power/mem_sleep before you write there)? And does it work if you only enter s2idle?
Comment 7 AceLan Kao 2019-08-15 02:21:48 UTC
It uses s2idle by default, I have to echo deep > /sys/power/mem_sleep every time after booted up.

The storage works under s2idle, but I don't think it's a BIOS issue. The storage also works after S3 if we remove one of the tbt devices between laptop and storage.
Comment 8 Mika Westerberg 2019-08-15 08:27:17 UTC
AFAIK it is the BIOS who is responsible of making the device chain available after S3. Do you have any other device that you could connect in place of the storage (so that it is 3rd device in the chain)?

Also can you try to unbind driver from that storage device host controller before suspend and see if it survives.
Comment 9 AceLan Kao 2019-08-23 03:07:20 UTC
Created attachment 284567 [details]
Screenshot when failed

I got more logs.

The behavior is kind of different with the same machine and the same kernel.
1. S3 without any thunderbolt devices connected, it works
2. S3 with only tbt docking station connected, it works
3. S3 with tbt docking <-> tbt monitor, it hangs with the oops message on the screen
3. S3 with tbt docking <-> tbt monitor <-> tbt storage, same as above
Comment 10 Mika Westerberg 2019-08-23 07:09:16 UTC
In both cases where it hangs, the oops is the same?

Added Mathias because this happens in xhci_endpoint_reset().

Also do you have any USB devices connected to those TBT devices when you do the experiment and if yes, can you try to disconnect them first?
Comment 11 AceLan Kao 2019-08-26 02:44:29 UTC
When resuming, it dumps those messages and hangs.
There is no USB devices connected to the laptop, nor to the docking, nor to the TBT monitor.
All those devices are connected through TBT ports.
Comment 12 Mika Westerberg 2019-08-26 10:28:32 UTC
Can you unbind xHCI driver from all those devices behind TBT and see if it works better wrt. resume?
Comment 13 AceLan Kao 2019-08-27 02:38:30 UTC
I got the error messages as comment #3 this time, so I unbind 0000:06:00.0 and 0000:09:00.0 under /sys/bus/pci/xhci_hcd, and then I got a blank screen with cursor blinks on the top left corner after resumed.

It looks like it's not the xhci driver leads to the issue, I've tried to unbind all usb devices, but it still hangs.
Comment 14 Mika Westerberg 2019-08-27 08:19:58 UTC
Can you also unbind ahci driver from the storage device behind TBT?
Comment 15 AceLan Kao 2019-08-29 02:55:02 UTC
Yes, it works.
Unbind 0000:00:17.0, 0000:06:00.0, and 0000:09:00.0 fix the issue. Only unbind ahci(0000:00:17.0) doesn't work, have to unbind all the 3 devices.
Comment 16 Mika Westerberg 2019-08-29 04:29:34 UTC
17.0 is the PCH SATA controller so that should not matter. What if you unbind only from 06:00.0 and 09:00.0?

Please attach full dmesg and output of 'sudo lspci -vv' after resume.
Comment 17 Prestige Wang 2019-08-29 09:21:01 UTC
I have let ODM do more isolation, and the result is this,

1.HDD -> (TBT) -> Monitor -> (TBT) -> System  >> Reproduced
2.HDD -> (TBT) -> Salomon Docking -> (TBT) -> System  >> Reproduced
3.HDD -> (TBT 15w) -> Monitor -> (TBT 90w) -> WD19TB -> (TBT) -> System  >> Reproduced
4.Monitor -> (TBT) -> Salomon Docking -> (TBT) -> System >> Can’t reproduce
5.HDD->(TBT)->System  >> Can’t reproduce

So the issue looks like from HDD TBT with daisy chain. If there is no HDD, or HDD is connected to system directly, the issue can’t be reproduced.
Comment 18 Mika Westerberg 2019-08-31 13:13:28 UTC
Thanks for sharing the results. I'm suspecting that the PCIe link to the AHCI controller is not up (or something like that) and that causes ahci driver to block the system resume. I would like to get dmesg when the ahci driver is unbound so that we can see what is happening during the suspend/resume cycle.
Comment 19 Prestige Wang 2019-08-31 13:13:42 UTC
Created attachment 284723 [details]
attachment-24680-0.html

I'm on annual leave from Sep 2 to 6, and will come back on Sep 9.

For Vancouver / AIO issues, please contact Zorro Zhang (Zorro_Zhang@Dell.com);
For Pinehills MLK / Fiorano P / Berlinetta P CFL R / Bolt L / Whitehaven issues, please contact Liang Xiao (Liang_Xiao1@Dell.com);
For Mockingbird L / WASP L MLK issues, please contact Yin Zhang (Yin_Z@Dell.com).
Comment 20 AceLan Kao 2019-09-05 08:34:26 UTC
Created attachment 284851 [details]
Unbind 0000:00:17.0, 0000:06:00.0, and 0000:09:00.0

Only unbind ahci doesn't work, here is the dmesg which unbind ahci(0000:00:17.0), and 2 xhic_hcd(0000:06:00.0 and 0000:09:00.0) devices.
Comment 21 Mika Westerberg 2019-09-05 09:14:06 UTC
Hmm, do you have CONFIG_PCI_DEBUG=y in your .config? I don't see much happening over suspend/resume cycle.

Also are you saying that you need to unbind them all as opposed to unbind only for the 06:00.0 and 09:00.0? Does the lspci look the same before and after suspend?
Comment 22 Prestige Wang 2019-09-10 10:23:55 UTC
Hi Acelan,

Is it VP in the latest kernel? 
Or what is the newest kernel you have checked?
Comment 23 AceLan Kao 2019-12-10 03:13:29 UTC
This issue has been fixed by this commit
https://patchwork.ozlabs.org/patch/1145748/

Note You need to log in before you can comment on or make changes to this bug.