Created attachment 284047 [details]
The system can't resume while connecting to a tbt storage and a tbt monitor over Thunderbolt port(daisy-chaining)
Dell Precision 5540 --> Dell WD19TB Thunderbolt Dock --> ASUS-Display PA27AC --> HP P800 Thunderbolt Storage
1. Connect Dell WD19TB Thunderbolt Dock to your system(thunderbolt port)
2. Connect ASUS-Display PA27AC(thuderbolt port) to WD19TB(thunderbolt port)
3. Connect HP P800 thunderbolt storage to PA27AC display(thunderbolt port)
4. mount and access files on HP P800
5. enter S3(echo deep | sudo tee /sys/power/mem_sleep && systemctl suspend) and resume
6. it hangs
It works well if laptop
1. connects to HP P800 directly
2. or connects to ASUS monitor directly
3. or connects to ASUS monitor -> HP P800
4. or connects to WD19TB -> HP P800
5. or connects to WD19TB -> ASUS monitor
Created attachment 284065 [details]
sudo lspci -vv
Can you add "no_console_suspend" to the kernel command line and reproduce? It should log bit more to the console during the suspend/resume. Of course this assumes that you can actually see anything on the console.
Created attachment 284263 [details]
The message after resumed
There is no more message shown in dmesg after it hang, but it shows something on the screen while the system is trying to resume.
It looks like the hardware gets hot removed upon resume for some reason. Since it is using native PCIe hotplug I wonder if the following patch helps to solve the hang part?
Can you give it a try?
The series patch doesn't work, I got the same messages on the screen.
I also tried to remove the thunderbolt cable from laptop during suspended(S3), and it still hangs after waken up.
Is that system by default using s2idle (check contents of /sys/power/mem_sleep before you write there)? And does it work if you only enter s2idle?
It uses s2idle by default, I have to echo deep > /sys/power/mem_sleep every time after booted up.
The storage works under s2idle, but I don't think it's a BIOS issue. The storage also works after S3 if we remove one of the tbt devices between laptop and storage.
AFAIK it is the BIOS who is responsible of making the device chain available after S3. Do you have any other device that you could connect in place of the storage (so that it is 3rd device in the chain)?
Also can you try to unbind driver from that storage device host controller before suspend and see if it survives.
Created attachment 284567 [details]
Screenshot when failed
I got more logs.
The behavior is kind of different with the same machine and the same kernel.
1. S3 without any thunderbolt devices connected, it works
2. S3 with only tbt docking station connected, it works
3. S3 with tbt docking <-> tbt monitor, it hangs with the oops message on the screen
3. S3 with tbt docking <-> tbt monitor <-> tbt storage, same as above
In both cases where it hangs, the oops is the same?
Added Mathias because this happens in xhci_endpoint_reset().
Also do you have any USB devices connected to those TBT devices when you do the experiment and if yes, can you try to disconnect them first?
When resuming, it dumps those messages and hangs.
There is no USB devices connected to the laptop, nor to the docking, nor to the TBT monitor.
All those devices are connected through TBT ports.
Can you unbind xHCI driver from all those devices behind TBT and see if it works better wrt. resume?
I got the error messages as comment #3 this time, so I unbind 0000:06:00.0 and 0000:09:00.0 under /sys/bus/pci/xhci_hcd, and then I got a blank screen with cursor blinks on the top left corner after resumed.
It looks like it's not the xhci driver leads to the issue, I've tried to unbind all usb devices, but it still hangs.
Can you also unbind ahci driver from the storage device behind TBT?
Yes, it works.
Unbind 0000:00:17.0, 0000:06:00.0, and 0000:09:00.0 fix the issue. Only unbind ahci(0000:00:17.0) doesn't work, have to unbind all the 3 devices.
17.0 is the PCH SATA controller so that should not matter. What if you unbind only from 06:00.0 and 09:00.0?
Please attach full dmesg and output of 'sudo lspci -vv' after resume.
I have let ODM do more isolation, and the result is this,
1.HDD -> (TBT) -> Monitor -> (TBT) -> System >> Reproduced
2.HDD -> (TBT) -> Salomon Docking -> (TBT) -> System >> Reproduced
3.HDD -> (TBT 15w) -> Monitor -> (TBT 90w) -> WD19TB -> (TBT) -> System >> Reproduced
4.Monitor -> (TBT) -> Salomon Docking -> (TBT) -> System >> Can’t reproduce
5.HDD->(TBT)->System >> Can’t reproduce
So the issue looks like from HDD TBT with daisy chain. If there is no HDD, or HDD is connected to system directly, the issue can’t be reproduced.
Thanks for sharing the results. I'm suspecting that the PCIe link to the AHCI controller is not up (or something like that) and that causes ahci driver to block the system resume. I would like to get dmesg when the ahci driver is unbound so that we can see what is happening during the suspend/resume cycle.
Created attachment 284723 [details]
I'm on annual leave from Sep 2 to 6, and will come back on Sep 9.
For Vancouver / AIO issues, please contact Zorro Zhang (Zorro_Zhang@Dell.com);
For Pinehills MLK / Fiorano P / Berlinetta P CFL R / Bolt L / Whitehaven issues, please contact Liang Xiao (Liang_Xiao1@Dell.com);
For Mockingbird L / WASP L MLK issues, please contact Yin Zhang (Yin_Z@Dell.com).
Created attachment 284851 [details]
Unbind 0000:00:17.0, 0000:06:00.0, and 0000:09:00.0
Only unbind ahci doesn't work, here is the dmesg which unbind ahci(0000:00:17.0), and 2 xhic_hcd(0000:06:00.0 and 0000:09:00.0) devices.
Hmm, do you have CONFIG_PCI_DEBUG=y in your .config? I don't see much happening over suspend/resume cycle.
Also are you saying that you need to unbind them all as opposed to unbind only for the 06:00.0 and 09:00.0? Does the lspci look the same before and after suspend?
Is it VP in the latest kernel?
Or what is the newest kernel you have checked?
This issue has been fixed by this commit