Bug 53211
Summary: | USB 3 storage device disconnects after S3 resume, "xhci_drop_endpoint called with unaddressed device" | ||
---|---|---|---|
Product: | Drivers | Reporter: | Lutz Vieweg (lvml) |
Component: | USB | Assignee: | Greg Kroah-Hartman (greg) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | funtoos, mvkiran82, pratyush.anand, sarah |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.7.5, 3.8-rc6, 3.8.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | dmesg output under linux-3.8 with debugging options enabled |
Description
Lutz Vieweg
2013-01-30 00:37:39 UTC
I just verified that the same bug exists in linux-3.8-rc6. (I had read about recent xhci power management related changes, but those seem to address other issued.) Please recompile the latest 3.8 kernel with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING turned on, and post the full dmesg. I'm wondering in particular if your host controller fails to restore its registers on S3 resume. If it does, we have to tell the USB core that the host controller lost power, disconnect all devices, and re-enumerate them. Software can't do anything about it if the host loses power across S3. I recompiled & installed the latest 3.8 kernel with CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING on - which happenes to write a ridiculous amount of messages to dmesg - but since you asked for the full dmesg outpupt, I can only hope the buffer was large enough to capture all the relevant lines. Find the full dmesg output attached. I should have mentioned before that this is a regression in comparison to linux-3.6.6: Using Linux-3.6.6, the external Toshiba drive is powered down during S3 sleep (the LED on the external drive goes off and the disk spins down). After S3 sleep, the drive is readily available under its existing device, existing mount etc. Using Linux-3.8, the external drive stays powered up during S3 sleep (LED stays lit and disk remains spinning), but after S3 sleep, the external drive gets a new device assigned, and existing mounts or device mappers are broken. Created attachment 93791 [details]
dmesg output under linux-3.8 with debugging options enabled
Comment on attachment 93791 [details]
dmesg output under linux-3.8 with debugging options enabled
This dmesg output was created while using the following xhci controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 04)
if you are still wondering, see this: http://marc.info/?l=linux-usb&m=137714769606183&w=2 Why does no one update bugs these days? devsk: That patch was an RFC that was not merged into Linus' tree because my analysis of the situation was wrong. The real fix was merged in commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8 "usb: Fix xHCI host issues on remote wakeup." https://git.kernel.org/cgit/linux/kernel/git/sarah/xhci.git/commit/?id=8b3d45705e54075cfb9d4212dbca9ea82c85c4b8 Further, that patch is not relevant to this bug report because mass storage devices don't signal remote wakup. Sarah: you mean this bug is not fixed even after the commit 8b3d45705e54075cfb9d4212dbca9ea82c85c4b8? In that case, do we have a handle on this bug? After a hint from Linus on the subsurface mailing list I retried with a recent 3.12 kernel and - at least for the Intel xhci controller - the block device now seems to survive the resume from S3. I intended to check on two other machines (with different xhci controllers from NEC / Renesas) if resuming works there, as well, but this is currently made impossible by the nouveau driver crashing the kernel upon resume... Lutz: do you happen to have a link for that conversation handy? I ran into the same disconnect and reappearance as a different device ID with 3.12.4 after 4 suspend-to-RAM cycles. So, the issue is definitely not fixed in 3.12.4. Lutz: please update us on the Linus conversation because I am very interested in getting this bug fixed. Thanks! @devsk: There was no more than this hint: http://article.gmane.org/gmane.sports.diving.subsurface/49 Ok. I don't know what 3.12 version he was talking about. Can you please ask on the list which git commit he was talking about? Because it looked like he was talking about current git. @devsk: I don't like to write too much off-topic stuff to the subsurface mailing list, what I can tell for sure without asking is that in my above described experiment the xhci block device survived a resume with kernel 3.12.6 (as in: https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.12.6.tar.xz ). I don't see any commit in 3.12.5 and 3.12.6 which is relevant to XHCI disconnects. So, I don't know. I will try 3.12.6 though! So, updated to 3.12.6 and it is still the same issue. @devsk: Please file a separate bug report, since Lutz reports that his bug is fixed by 3.12.6. Make sure to post dmesg with CONFIG_USB_DEBUG enabled. It's possible your bug is different from Lutz', so I'm closing this one. Well, it looks like its fixed to him because it works some of the times. At best of the runs, I was able to suspend and resume 4 times in succession but then 5th try failed. This bug is so apparent on all my laptops and desktop machines that I find it hard to believe that it can't be reproduced by devs easily. All you have to do is to create an FS on a USB drive, plug it in and suspend. Resume will kill the FS and no IO will be possible to that FS. I also found similar bug with my embedded USB host running 3.10 kernel and also with a desktop PC running 3.13.6-100.fc19. I have posted a RFC which resolves issue at my end. Please review the patch and test it at your end if possible. Patch is here: http://marc.info/?l=linux-usb&m=140542120320644&w=2 Thanks Pratyush! Looks like this is it. I have the patch applied to 3.15.5 and it has correctly resumed 7 times, the longest streak of success so far. I will update if I see the fail again as I continue to suspend and resume throughout the day to test it. Note that I changed the patch to do 50 retries instead of 20, because if it comes out before retries finish, we break the loop anyway. I am not entirely convinced that those timeouts can be so tightly controlled and sum upto 398ms and then getting approximated to 400ms is enough. What's the harm in going with 1second instead of 400ms? I definitely don't mind my laptop waking up 600ms later in the worst case as long as I avoid corrupt filesystem. Are there other side effects of retries > 20? Yes, I gave a second thought to timeout calculation and I agree that this can not be a way to calculate such timeout. I have been reported that there are some devices which takes even 2S for link train. I got one sony usb stick (054C:05B8) which does not take 2S but certainly it takes more than 400 ms. I took the analyzer log with this device which tells me that device switch on the termination after long delay of host enabling the VBUS.With another usb disk I observed that device fails to negotiate link training in first attempt. However, succesfull link training happens in second attempt. So the question remains, what should be the timeout? I think we should go with 2S timeout to support all the buggy devices reported so far. For the good device overhead is almost none. While for the bad devices penalty could be the time which it take for link training. Only side effect of long timeout: If a device was connected before suspend, and was removed while system was asleep, then the penalty would be the timeout ie 2000 ms. I think we should go with wait loop as follows: static int wait_for_ss_port_enable(struct usb_device *udev, struct usb_hub *hub, int *port1, u16 *portchange, u16 *portstatus) { int status = 0, delay_ms = 0; while (delay_ms < 2000) { if (status || *portstatus & USB_PORT_STAT_CONNECTION) break; msleep(20); delay_ms += 20; status = hub_port_status(hub, *port1, portstatus, portchange); } return status; } The patch with 1000ms wait is working great for me. 13 successful resumes without device renaming so far. My overall resume time is between 2-3 seconds, which is perfect for me. I don't believe in sub-second resume time, I have plenty of patience...:) 2-3 seconds of resume time and it helps with reducing writes to the USB drive when I hibernate to my swap file. What else could a geek ask for! We should push this fix in stable releases. FS corruption is a severe bug in my opinion. Still working after 21 suspend/resume cycles. Any plans of getting this merged in stable? A V2 of patch was submitted with cc to stable mailing list and it was acked. Currently available in gregkh's usb-next branch. Do you have a link to latest submit request? Pratyush: can I ask you a favour? Would you be kind enough to update this bug when the patch does make it to a stable release? Thanks! Patch has been applied to 3.10-stable, 3.14-stable and 3.16-stable. Hi, This issue still exists in mainline kernel 4.8.0 also. The tried the below scenario, installed linux kernel in eMMC and playing a video file from USB3.0 kingston pendrive, after putting the system in to low power state (echo mem > /sys/power/state) and with keyboard trigger, video is not resumed back. there is FS corruption and IO error, Can any one help me find a fix for this issue? On Mon, Nov 07, 2016 at 04:11:20PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: All USB bugs should be sent to the linux-usb@vger.kernel.org mailing list, and not entered into bugzilla. Please bring this issue up there, if it is still a problem in the latest kernel release. I tried posting this issue to "linux-usb@vger.kernel.org" but it email bounced, is there any other way? On Wed, Nov 09, 2016 at 02:28:31AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > I tried posting this issue to "linux-usb@vger.kernel.org" but it email > bounced, Turn off html email. |