Bug 53211

Summary: USB 3 storage device disconnects after S3 resume, "xhci_drop_endpoint called with unaddressed device"
Product: Drivers Reporter: Lutz Vieweg (lvml)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: RESOLVED CODE_FIX    
Severity: normal CC: funtoos, mvkiran82, pratyush.anand, sarah
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.7.5, 3.8-rc6, 3.8.0 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output under linux-3.8 with debugging options enabled

Description Lutz Vieweg 2013-01-30 00:37:39 UTC
When I resume my x86_64 system after "echo mem >/sys/power/state", USB 3 storage devices that there present and in use before the suspend disconnect directly after the resume, leaving any device mappers or mounts in a defect state (yielding only I/O errors). At the same time, the USB 3 storage devices are accesible through a new block device that has been created by the kernel after the resume.

Relevant excerpts from the dlog - before the suspend:

xhci_hcd 0000:03:00.0: xHCI Host Controller
xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 8
xhci_hcd 0000:03:00.0: irq 17, io mem 0xf7ffe000
xhci_hcd 0000:03:00.0: irq 43 for MSI/MSI-X
xhci_hcd 0000:03:00.0: irq 44 for MSI/MSI-X
xhci_hcd 0000:03:00.0: irq 45 for MSI/MSI-X
xhci_hcd 0000:03:00.0: irq 46 for MSI/MSI-X
xhci_hcd 0000:03:00.0: irq 47 for MSI/MSI-X
xHCI xhci_add_endpoint called for root hub
xHCI xhci_check_bandwidth called for root hub
hub 8-0:1.0: USB hub found
hub 8-0:1.0: 4 ports detected
xhci_hcd 0000:03:00.0: xHCI Host Controller
xhci_hcd 0000:03:00.0: new USB bus registered, assigned bus number 9
xHCI xhci_add_endpoint called for root hub
xHCI xhci_check_bandwidth called for root hub
hub 9-0:1.0: USB hub found
hub 9-0:1.0: 4 ports detected
...
usb 9-2: new SuperSpeed USB device number 2 using xhci_hcd
usb 9-2: Parent hub missing LPM exit latency info.  Power management will be impacted.
...
Initializing USB Mass Storage driver...
scsi8 : usb-storage 9-2:1.0
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
scsi 8:0:0:0: Direct-Access     TOSHIBA  External USB 3.0 0001 PQ: 0 ANSI: 6
sd 8:0:0:0: Attached scsi generic sg6 type 0
sd 8:0:0:0: [sde] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
sd 8:0:0:0: [sde] Write Protect is off
sd 8:0:0:0: [sde] Mode Sense: 33 00 00 08
sd 8:0:0:0: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sde: sde1
sd 8:0:0:0: [sde] Attached SCSI disk

----
At this point, the USB 3 device runs just fine.

Now relevant excerpts just before and just after the suspend-to-RAM:

PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.01 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
Suspending console(s) (use no_console_suspend to debug)
sd 6:0:0:0: [sdd] Synchronizing SCSI cache
xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 3 with no TDs queued?
sd 5:0:0:0: [sdc] Synchronizing SCSI cache
sd 2:0:0:0: [sdb] Synchronizing SCSI cache
sd 5:0:0:0: [sdc] Stopping disk
xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 2 with no TDs queued?
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 2:0:0:0: [sdb] Stopping disk
sd 0:0:0:0: [sda] Stopping disk
xhci_hcd 0000:03:00.0: WARN Event TRB for slot 1 ep 0 with no TDs queued?
sd 6:0:0:0: [sdd] Stopping disk
...
PM: suspend of devices complete after 907.568 msecs
PM: late suspend of devices complete after 0.331 msecs
...
PM: noirq suspend of devices complete after 44.728 msecs
ACPI: Preparing to enter system sleep state S3
PM: Saving platform NVS memory
Disabling non-boot CPUs ...

--- here the system sleeps. Notice how [sde] was not mentioned regarding "Stopping disk"

ACPI: Low-level resume complete
...
PM: noirq resume of devices complete after 66.067 msecs
PM: early resume of devices complete after 0.079 msecs
usb usb8: root hub lost power or was reset
usb usb9: root hub lost power or was reset
...
sd 2:0:0:0: [sdb] Starting disk
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Starting disk
PM: resume of devices complete after 7718.648 msecs
usb 9-2: USB disconnect, device number 2
Restarting tasks ... done.
xHCI xhci_drop_endpoint called with unaddressed device
xHCI xhci_drop_endpoint called with unaddressed device
xHCI xhci_check_bandwidth called with unaddressed device
xHCI xhci_free_dev called with unaddressed device
usb 9-2: new SuperSpeed USB device number 3 using xhci_hcd
usb 9-2: Parent hub missing LPM exit latency info.  Power management will be impacted.
scsi9 : usb-storage 9-2:1.0
scsi 9:0:0:0: Direct-Access     TOSHIBA  External USB 3.0 0001 PQ: 0 ANSI: 6
sd 9:0:0:0: Attached scsi generic sg6 type 0
sd 9:0:0:0: [sdf] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
sd 9:0:0:0: [sdf] Write Protect is off
sd 9:0:0:0: [sdf] Mode Sense: 33 00 00 08
sd 9:0:0:0: [sdf] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sdf: sdf1
sd 9:0:0:0: [sdf] Attached SCSI disk

So as you can see, [sde] died, and was reborn as [sdf], but of course device mapper and mounts won't know about this and try to access a failed [sde].

I find especially interesting:
 PM: resume of devices complete after 7718.648 msecs
 usb 9-2: USB disconnect, device number 2

This seems to indicate that the kernel disconnects the USB 3 device _after_ it assumes to have resumed all devices.

This symptom is 100% reproduceable for me, on both an Asus UX32VD Ultrabook (using an Intel xhci controller) and on a Phenom XII desktop computer using a Buffalo IFC-PCIE4U3S controller (that in turn uses the Renesas µPD720201 chip).

I have read through a lot of seeminly similar symptoms people describe, but in the end they are all different - most report that systems won't go to sleep or the controller won't work after resume, but both is not the case for me, it's just the discontinuity of the device being re-attached under a different device name that makes things ugly.
Also, I have read proposals for work-arounds which essentially say to unload xhci_hcd before suspending, but that obviously would also not allow to keep any filesystem mounted on such a device for use after resume.

Any ideas?
Comment 1 Lutz Vieweg 2013-02-03 18:51:22 UTC
I just verified that the same bug exists in linux-3.8-rc6.

(I had read about recent xhci power management related changes, but those seem to address other issued.)
Comment 2 Sarah Sharp 2013-02-20 19:40:07 UTC
Please recompile the latest 3.8 kernel with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and post the full dmesg.

I'm wondering in particular if your host controller fails to restore its registers on S3 resume.  If it does, we have to tell the USB core that the host controller lost power, disconnect all devices, and re-enumerate them.  Software can't do anything about it if the host loses power across S3.
Comment 3 Lutz Vieweg 2013-02-21 14:43:20 UTC
I recompiled & installed the latest 3.8 kernel with CONFIG_USB_DEBUG and
CONFIG_USB_XHCI_HCD_DEBUGGING on - which happenes to write a ridiculous amount of messages to dmesg - but since you asked for the full dmesg outpupt, I can only hope the buffer was large enough to capture all the relevant lines. Find the full dmesg output attached.

I should have mentioned before that this is a regression in comparison to linux-3.6.6: 

Using Linux-3.6.6, the external Toshiba drive is powered down during S3 sleep (the LED on the external drive goes off and the disk spins down). After S3 sleep, the drive is readily available under its existing device, existing mount etc.

Using Linux-3.8, the external drive stays powered up during S3 sleep (LED stays lit and disk remains spinning), but after S3 sleep, the external drive gets a new device assigned, and existing mounts or device mappers are broken.
Comment 4 Lutz Vieweg 2013-02-21 14:44:38 UTC
Created attachment 93791 [details]
dmesg output under linux-3.8 with debugging options enabled
Comment 5 Lutz Vieweg 2013-02-21 14:51:00 UTC
Comment on attachment 93791 [details]
dmesg output under linux-3.8 with debugging options enabled

This dmesg output was created while using the following xhci controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 04)
Comment 6 devsk 2013-12-07 06:45:53 UTC
if you are still wondering, see this: http://marc.info/?l=linux-usb&m=137714769606183&w=2

Why does no one update bugs these days?
Comment 7 Sarah Sharp 2013-12-09 17:08:46 UTC
devsk: That patch was an RFC that was not merged into Linus' tree because my analysis of the situation was wrong.  The real fix was merged in commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8 "usb: Fix xHCI host issues on remote wakeup."

https://git.kernel.org/cgit/linux/kernel/git/sarah/xhci.git/commit/?id=8b3d45705e54075cfb9d4212dbca9ea82c85c4b8


Further, that patch is not relevant to this bug report because mass storage devices don't signal remote wakup.
Comment 8 devsk 2013-12-09 17:38:59 UTC
Sarah: you mean this bug is not fixed even after the commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8?

In that case, do we have a handle on this bug?
Comment 9 Lutz Vieweg 2013-12-21 00:42:59 UTC
After a hint from Linus on the subsurface mailing list I retried with a recent 3.12 kernel and - at least for the Intel xhci controller - the block device now seems to survive the resume from S3.

I intended to check on two other machines (with different xhci controllers from NEC / Renesas) if resuming works there, as well, but this is currently made impossible by the nouveau driver crashing the kernel upon resume...
Comment 10 devsk 2013-12-21 02:00:21 UTC
Lutz: do you happen to have a link for that conversation handy?
Comment 11 devsk 2013-12-21 04:36:44 UTC
I ran into the same disconnect and reappearance as a different device ID with 3.12.4 after 4 suspend-to-RAM cycles. So, the issue is definitely not fixed in 3.12.4.

Lutz: please update us on the Linus conversation because I am very interested in getting this bug fixed. Thanks!
Comment 12 Lutz Vieweg 2013-12-21 14:26:12 UTC
@devsk: There was no more than this hint: http://article.gmane.org/gmane.sports.diving.subsurface/49
Comment 13 devsk 2013-12-21 15:27:27 UTC
Ok. I don't know what 3.12 version he was talking about. Can you please ask on the list which git commit he was talking about? Because it looked like he was talking about current git.
Comment 14 Lutz Vieweg 2013-12-21 22:59:50 UTC
@devsk: I don't like to write too much off-topic stuff to the subsurface mailing list, what I can tell for sure without asking is that in my above described experiment the xhci block device survived a resume with kernel 3.12.6 (as in: https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.12.6.tar.xz ).
Comment 15 devsk 2013-12-22 08:21:56 UTC
I don't see any commit in 3.12.5 and 3.12.6 which is relevant to XHCI disconnects. So, I don't know. I will try 3.12.6 though!
Comment 16 devsk 2013-12-27 04:41:50 UTC
So, updated to 3.12.6 and it is still the same issue.
Comment 17 Sarah Sharp 2014-01-02 22:45:51 UTC
@devsk: Please file a separate bug report, since Lutz reports that his bug is fixed by 3.12.6.  Make sure to post dmesg with CONFIG_USB_DEBUG enabled.  It's possible your bug is different from Lutz', so I'm closing this one.
Comment 18 devsk 2014-01-03 00:51:42 UTC
Well, it looks like its fixed to him because it works some of the times. At best of the runs, I was able to suspend and resume 4 times in succession but then 5th try failed.

This bug is so apparent on all my laptops and desktop machines that I find it hard to believe that it can't be reproduced by devs easily. All you have to do is to create an FS on a USB drive, plug it in and suspend. Resume will kill the FS and no IO will be possible to that FS.
Comment 19 Pratyush Anand 2014-07-15 11:31:38 UTC
I also found similar bug with my embedded USB host running 3.10 kernel and also with a desktop PC running 3.13.6-100.fc19. I have posted a RFC which resolves issue at my end. Please review the patch and test it at your end if possible.

Patch is here:

http://marc.info/?l=linux-usb&m=140542120320644&w=2
Comment 20 devsk 2014-07-16 21:25:44 UTC
Thanks Pratyush! Looks like this is it. I have the patch applied to 3.15.5 and it has correctly resumed 7 times, the longest streak of success so far. I will update if I see the fail again as I continue to suspend and resume throughout the day to test it.

Note that I changed the patch to do 50 retries instead of 20, because if it comes out before retries finish, we break the loop anyway. I am not entirely convinced that those timeouts can be so tightly controlled and sum upto 398ms and then getting approximated to 400ms is enough. What's the harm in going with 1second instead of 400ms? I definitely don't mind my laptop waking up 600ms later in the worst case as long as I avoid corrupt filesystem. Are there other side effects of retries > 20?
Comment 21 Pratyush Anand 2014-07-17 09:10:40 UTC
Yes, I gave a second thought to timeout calculation and I agree that this can not be a way to calculate such timeout. I have been reported that there are some devices which takes even 2S for link train.

I got one sony usb stick (054C:05B8) which does not take 2S but certainly it takes more than 400 ms. I took the analyzer log with this device which tells me that device switch on the termination after long delay of host enabling the VBUS.With another usb disk I observed that device fails to negotiate link training in first attempt. However, succesfull link training happens in second attempt.

So the question remains, what should be the timeout? I think we should go with 2S timeout to support all the buggy devices reported so far.

For the good device overhead is almost none. While for the bad devices penalty could be the time which it take for link training.

Only side effect of long timeout: If a device was connected before suspend, and was removed while system was asleep, then the penalty would be the timeout ie
2000 ms.

I think we should go with wait loop as follows:

static int wait_for_ss_port_enable(struct usb_device *udev,
		struct usb_hub *hub, int *port1,
		u16 *portchange, u16 *portstatus)
{
	int status = 0, delay_ms = 0;

	while (delay_ms < 2000) {
		if (status || *portstatus & USB_PORT_STAT_CONNECTION)
			break;
		msleep(20);
		delay_ms += 20;
		status = hub_port_status(hub, *port1, portstatus, portchange);
	}
	return status;
}
Comment 22 devsk 2014-07-17 16:26:01 UTC
The patch with 1000ms wait is working great for me. 13 successful resumes without device renaming so far.

My overall resume time is between 2-3 seconds, which is perfect for me. I don't believe in sub-second resume time, I have plenty of patience...:) 2-3 seconds of resume time and it helps with reducing writes to the USB drive when I hibernate to my swap file. What else could a geek ask for!
Comment 23 devsk 2014-07-17 16:27:02 UTC
We should push this fix in stable releases. FS corruption is a severe bug in my opinion.
Comment 24 devsk 2014-07-21 17:56:33 UTC
Still working after 21 suspend/resume cycles. Any plans of getting this merged in stable?
Comment 25 Pratyush Anand 2014-07-21 23:57:54 UTC
A V2 of patch was submitted with cc to stable mailing list and it was acked. Currently available in gregkh's usb-next branch.
Comment 26 devsk 2014-07-21 23:59:02 UTC
Do you have a link to latest submit request?
Comment 27 Pratyush Anand 2014-07-22 00:04:16 UTC
http://www.spinics.net/lists/linux-usb/msg110762.html
Comment 28 devsk 2014-07-25 23:32:53 UTC
Pratyush: can I ask you a favour? Would you be kind enough to update this bug when the patch does make it to a stable release? Thanks!
Comment 29 Pratyush Anand 2014-08-28 03:29:48 UTC
Patch has been applied to 3.10-stable, 3.14-stable and 3.16-stable.
Comment 30 Mvkiran 2016-11-07 16:11:20 UTC
Hi,
   This issue still exists in mainline kernel 4.8.0 also.
 The tried the below scenario, installed linux kernel in eMMC and playing a video file from USB3.0 kingston pendrive, after putting the system in to low power state (echo mem > /sys/power/state) and with keyboard trigger, video is not resumed back. there is FS corruption and IO error, Can any one help me find a fix for this issue?
Comment 31 Greg Kroah-Hartman 2016-11-07 16:17:47 UTC
On Mon, Nov 07, 2016 at 04:11:20PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:


All USB bugs should be sent to the linux-usb@vger.kernel.org mailing
list, and not entered into bugzilla.  Please bring this issue up there,
if it is still a problem in the latest kernel release.
Comment 32 Mvkiran 2016-11-09 02:28:31 UTC
I tried posting this issue to "linux-usb@vger.kernel.org" but it email bounced, is there any other way?
Comment 33 Greg Kroah-Hartman 2016-11-09 06:02:18 UTC
On Wed, Nov 09, 2016 at 02:28:31AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> I tried posting this issue to "linux-usb@vger.kernel.org" but it email
> bounced,

Turn off html email.