When I resume my x86_64 system after "echo mem >/sys/power/state", USB 3 storage devices that there present and in use before the suspend disconnect directly after the resume, leaving any device mappers or mounts in a defect state (yielding only I/O errors). At the same time, the USB 3 storage devices are accesible through a new block device that has been created by the kernel after the resume. Relevant excerpts from the dlog - before the suspend: xhci_hcd 0000:03:00.0: xHCI Host Controller xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 8 xhci_hcd 0000:03:00.0: irq 17, io mem 0xf7ffe000 xhci_hcd 0000:03:00.0: irq 43 for MSI/MSI-X xhci_hcd 0000:03:00.0: irq 44 for MSI/MSI-X xhci_hcd 0000:03:00.0: irq 45 for MSI/MSI-X xhci_hcd 0000:03:00.0: irq 46 for MSI/MSI-X xhci_hcd 0000:03:00.0: irq 47 for MSI/MSI-X xHCI xhci_add_endpoint called for root hub xHCI xhci_check_bandwidth called for root hub hub 8-0:1.0: USB hub found hub 8-0:1.0: 4 ports detected xhci_hcd 0000:03:00.0: xHCI Host Controller xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 9 xHCI xhci_add_endpoint called for root hub xHCI xhci_check_bandwidth called for root hub hub 9-0:1.0: USB hub found hub 9-0:1.0: 4 ports detected ... usb 9-2: new SuperSpeed USB device number 2 using xhci_hcd usb 9-2: Parent hub missing LPM exit latency info. Power management will be impacted. ... Initializing USB Mass Storage driver... scsi8 : usb-storage 9-2:1.0 usbcore: registered new interface driver usb-storage USB Mass Storage support registered. scsi 8:0:0:0: Direct-Access TOSHIBA External USB 3.0 0001 PQ: 0 ANSI: 6 sd 8:0:0:0: Attached scsi generic sg6 type 0 sd 8:0:0:0: [sde] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB) sd 8:0:0:0: [sde] Write Protect is off sd 8:0:0:0: [sde] Mode Sense: 33 00 00 08 sd 8:0:0:0: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sde: sde1 sd 8:0:0:0: [sde] Attached SCSI disk ---- At this point, the USB 3 device runs just fine. Now relevant excerpts just before and just after the suspend-to-RAM: PM: Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.01 seconds) done. Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done. Suspending console(s) (use no_console_suspend to debug) sd 6:0:0:0: [sdd] Synchronizing SCSI cache xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 3 with no TDs queued? sd 5:0:0:0: [sdc] Synchronizing SCSI cache sd 2:0:0:0: [sdb] Synchronizing SCSI cache sd 5:0:0:0: [sdc] Stopping disk xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 2 with no TDs queued? sd 0:0:0:0: [sda] Synchronizing SCSI cache sd 2:0:0:0: [sdb] Stopping disk sd 0:0:0:0: [sda] Stopping disk xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 0 with no TDs queued? sd 6:0:0:0: [sdd] Stopping disk ... PM: suspend of devices complete after 907.568 msecs PM: late suspend of devices complete after 0.331 msecs ... PM: noirq suspend of devices complete after 44.728 msecs ACPI: Preparing to enter system sleep state S3 PM: Saving platform NVS memory Disabling non-boot CPUs ... --- here the system sleeps. Notice how [sde] was not mentioned regarding "Stopping disk" ACPI: Low-level resume complete ... PM: noirq resume of devices complete after 66.067 msecs PM: early resume of devices complete after 0.079 msecs usb usb8: root hub lost power or was reset usb usb9: root hub lost power or was reset ... sd 2:0:0:0: [sdb] Starting disk ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: configured for UDMA/133 sd 0:0:0:0: [sda] Starting disk PM: resume of devices complete after 7718.648 msecs usb 9-2: USB disconnect, device number 2 Restarting tasks ... done. xHCI xhci_drop_endpoint called with unaddressed device xHCI xhci_drop_endpoint called with unaddressed device xHCI xhci_check_bandwidth called with unaddressed device xHCI xhci_free_dev called with unaddressed device usb 9-2: new SuperSpeed USB device number 3 using xhci_hcd usb 9-2: Parent hub missing LPM exit latency info. Power management will be impacted. scsi9 : usb-storage 9-2:1.0 scsi 9:0:0:0: Direct-Access TOSHIBA External USB 3.0 0001 PQ: 0 ANSI: 6 sd 9:0:0:0: Attached scsi generic sg6 type 0 sd 9:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB) sd 9:0:0:0: [sdf] Write Protect is off sd 9:0:0:0: [sdf] Mode Sense: 33 00 00 08 sd 9:0:0:0: [sdf] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sdf: sdf1 sd 9:0:0:0: [sdf] Attached SCSI disk So as you can see, [sde] died, and was reborn as [sdf], but of course device mapper and mounts won't know about this and try to access a failed [sde]. I find especially interesting: PM: resume of devices complete after 7718.648 msecs usb 9-2: USB disconnect, device number 2 This seems to indicate that the kernel disconnects the USB 3 device _after_ it assumes to have resumed all devices. This symptom is 100% reproduceable for me, on both an Asus UX32VD Ultrabook (using an Intel xhci controller) and on a Phenom XII desktop computer using a Buffalo IFC-PCIE4U3S controller (that in turn uses the Renesas µPD720201 chip). I have read through a lot of seeminly similar symptoms people describe, but in the end they are all different - most report that systems won't go to sleep or the controller won't work after resume, but both is not the case for me, it's just the discontinuity of the device being re-attached under a different device name that makes things ugly. Also, I have read proposals for work-arounds which essentially say to unload xhci_hcd before suspending, but that obviously would also not allow to keep any filesystem mounted on such a device for use after resume. Any ideas?
I just verified that the same bug exists in linux-3.8-rc6. (I had read about recent xhci power management related changes, but those seem to address other issued.)
Please recompile the latest 3.8 kernel with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and post the full dmesg. I'm wondering in particular if your host controller fails to restore its registers on S3 resume. If it does, we have to tell the USB core that the host controller lost power, disconnect all devices, and re-enumerate them. Software can't do anything about it if the host loses power across S3.
I recompiled & installed the latest 3.8 kernel with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING on - which happenes to write a ridiculous amount of messages to dmesg - but since you asked for the full dmesg outpupt, I can only hope the buffer was large enough to capture all the relevant lines. Find the full dmesg output attached. I should have mentioned before that this is a regression in comparison to linux-3.6.6: Using Linux-3.6.6, the external Toshiba drive is powered down during S3 sleep (the LED on the external drive goes off and the disk spins down). After S3 sleep, the drive is readily available under its existing device, existing mount etc. Using Linux-3.8, the external drive stays powered up during S3 sleep (LED stays lit and disk remains spinning), but after S3 sleep, the external drive gets a new device assigned, and existing mounts or device mappers are broken.
Created attachment 93791 [details] dmesg output under linux-3.8 with debugging options enabled
Comment on attachment 93791 [details] dmesg output under linux-3.8 with debugging options enabled This dmesg output was created while using the following xhci controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 04)
if you are still wondering, see this: http://marc.info/?l=linux-usb&m=137714769606183&w=2 Why does no one update bugs these days?
devsk: That patch was an RFC that was not merged into Linus' tree because my analysis of the situation was wrong. The real fix was merged in commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8 "usb: Fix xHCI host issues on remote wakeup." https://git.kernel.org/cgit/linux/kernel/git/sarah/xhci.git/commit/?id=8b3d45705e54075cfb9d4212dbca9ea82c85c4b8 Further, that patch is not relevant to this bug report because mass storage devices don't signal remote wakup.
Sarah: you mean this bug is not fixed even after the commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8? In that case, do we have a handle on this bug?
After a hint from Linus on the subsurface mailing list I retried with a recent 3.12 kernel and - at least for the Intel xhci controller - the block device now seems to survive the resume from S3. I intended to check on two other machines (with different xhci controllers from NEC / Renesas) if resuming works there, as well, but this is currently made impossible by the nouveau driver crashing the kernel upon resume...
Lutz: do you happen to have a link for that conversation handy?
I ran into the same disconnect and reappearance as a different device ID with 3.12.4 after 4 suspend-to-RAM cycles. So, the issue is definitely not fixed in 3.12.4. Lutz: please update us on the Linus conversation because I am very interested in getting this bug fixed. Thanks!
@devsk: There was no more than this hint: http://article.gmane.org/gmane.sports.diving.subsurface/49
Ok. I don't know what 3.12 version he was talking about. Can you please ask on the list which git commit he was talking about? Because it looked like he was talking about current git.
@devsk: I don't like to write too much off-topic stuff to the subsurface mailing list, what I can tell for sure without asking is that in my above described experiment the xhci block device survived a resume with kernel 3.12.6 (as in: https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.12.6.tar.xz ).
I don't see any commit in 3.12.5 and 3.12.6 which is relevant to XHCI disconnects. So, I don't know. I will try 3.12.6 though!
So, updated to 3.12.6 and it is still the same issue.
@devsk: Please file a separate bug report, since Lutz reports that his bug is fixed by 3.12.6. Make sure to post dmesg with CONFIG_USB_DEBUG enabled. It's possible your bug is different from Lutz', so I'm closing this one.
Well, it looks like its fixed to him because it works some of the times. At best of the runs, I was able to suspend and resume 4 times in succession but then 5th try failed. This bug is so apparent on all my laptops and desktop machines that I find it hard to believe that it can't be reproduced by devs easily. All you have to do is to create an FS on a USB drive, plug it in and suspend. Resume will kill the FS and no IO will be possible to that FS.
I also found similar bug with my embedded USB host running 3.10 kernel and also with a desktop PC running 3.13.6-100.fc19. I have posted a RFC which resolves issue at my end. Please review the patch and test it at your end if possible. Patch is here: http://marc.info/?l=linux-usb&m=140542120320644&w=2
Thanks Pratyush! Looks like this is it. I have the patch applied to 3.15.5 and it has correctly resumed 7 times, the longest streak of success so far. I will update if I see the fail again as I continue to suspend and resume throughout the day to test it. Note that I changed the patch to do 50 retries instead of 20, because if it comes out before retries finish, we break the loop anyway. I am not entirely convinced that those timeouts can be so tightly controlled and sum upto 398ms and then getting approximated to 400ms is enough. What's the harm in going with 1second instead of 400ms? I definitely don't mind my laptop waking up 600ms later in the worst case as long as I avoid corrupt filesystem. Are there other side effects of retries > 20?
Yes, I gave a second thought to timeout calculation and I agree that this can not be a way to calculate such timeout. I have been reported that there are some devices which takes even 2S for link train. I got one sony usb stick (054C:05B8) which does not take 2S but certainly it takes more than 400 ms. I took the analyzer log with this device which tells me that device switch on the termination after long delay of host enabling the VBUS.With another usb disk I observed that device fails to negotiate link training in first attempt. However, succesfull link training happens in second attempt. So the question remains, what should be the timeout? I think we should go with 2S timeout to support all the buggy devices reported so far. For the good device overhead is almost none. While for the bad devices penalty could be the time which it take for link training. Only side effect of long timeout: If a device was connected before suspend, and was removed while system was asleep, then the penalty would be the timeout ie 2000 ms. I think we should go with wait loop as follows: static int wait_for_ss_port_enable(struct usb_device *udev, struct usb_hub *hub, int *port1, u16 *portchange, u16 *portstatus) { int status = 0, delay_ms = 0; while (delay_ms < 2000) { if (status || *portstatus & USB_PORT_STAT_CONNECTION) break; msleep(20); delay_ms += 20; status = hub_port_status(hub, *port1, portstatus, portchange); } return status; }
The patch with 1000ms wait is working great for me. 13 successful resumes without device renaming so far. My overall resume time is between 2-3 seconds, which is perfect for me. I don't believe in sub-second resume time, I have plenty of patience...:) 2-3 seconds of resume time and it helps with reducing writes to the USB drive when I hibernate to my swap file. What else could a geek ask for!
We should push this fix in stable releases. FS corruption is a severe bug in my opinion.
Still working after 21 suspend/resume cycles. Any plans of getting this merged in stable?
A V2 of patch was submitted with cc to stable mailing list and it was acked. Currently available in gregkh's usb-next branch.
Do you have a link to latest submit request?
http://www.spinics.net/lists/linux-usb/msg110762.html
Pratyush: can I ask you a favour? Would you be kind enough to update this bug when the patch does make it to a stable release? Thanks!
Patch has been applied to 3.10-stable, 3.14-stable and 3.16-stable.
Hi, This issue still exists in mainline kernel 4.8.0 also. The tried the below scenario, installed linux kernel in eMMC and playing a video file from USB3.0 kingston pendrive, after putting the system in to low power state (echo mem > /sys/power/state) and with keyboard trigger, video is not resumed back. there is FS corruption and IO error, Can any one help me find a fix for this issue?
On Mon, Nov 07, 2016 at 04:11:20PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: All USB bugs should be sent to the linux-usb@vger.kernel.org mailing list, and not entered into bugzilla. Please bring this issue up there, if it is still a problem in the latest kernel release.
I tried posting this issue to "linux-usb@vger.kernel.org" but it email bounced, is there any other way?
On Wed, Nov 09, 2016 at 02:28:31AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > I tried posting this issue to "linux-usb@vger.kernel.org" but it email > bounced, Turn off html email.