Bug 29902

Summary: System suspend failed after 1-2 successful resumes
Product: Drivers Reporter: Alex Terekhov (a.terekhov)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, maciej.rutecki, rjw, stern, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc6 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Syslog
lspci -vvv
dmesg
dmesg with CONFIG_USB_DEBUG
usbmon trace
Use atomic flags instead of hcd->state
dmesg with uhci_hcd debug param = 2
dmesg for the first case
dmesg for the second case
Add delay during UHCI startup
Delay 50 microseconds
dmesg (suspend-resume for all USB ports)

Description Alex Terekhov 2011-02-25 22:24:29 UTC
Created attachment 49122 [details]
Syslog

After starting my system going to suspend, then resume the system. After that while trying to suspend my system again I get some warnings from libata and that's all.

Feb 26 00:10:06 immortal kernel: ------------[ cut here ]------------
Feb 26 00:10:06 immortal kernel: WARNING: at drivers/ata/libata-eh.c:3976 ata_scsi_error+0x14f/0x57d [libata]()
Feb 26 00:10:06 immortal kernel: Hardware name: VPCY11M1R
Feb 26 00:10:06 immortal kernel: Modules linked in: i915 drm_kms_helper drm i2c_algo_bit loop sha256_generic aes_i586 aes_generic cbc dm_crypt dm_mod arc4 ecb ath9k joydev snd_hda_codec_realtek mac80211 ath9k_common snd_hda_intel snd_hda_codec snd_hwdep snd_pcm ath9k_hw ath snd_seq uvcvideo snd_timer sony_laptop i2c_i801 cfg80211 snd_seq_device videodev psmouse battery snd ac rfkill evdev power_supply soundcore i2c_core snd_page_alloc video button processor ext3 jbd mbcache sd_mod uhci_hcd ahci libahci libata sdhci_pci sdhci scsi_mod ehci_hcd usbcore atl1c thermal mmc_core thermal_sys nls_base [last unloaded: scsi_wait_scan]
Feb 26 00:10:06 immortal kernel: Pid: 184, comm: scsi_eh_0 Not tainted 2.6.38-rc6 #15
Feb 26 00:10:06 immortal kernel: Call Trace:
Feb 26 00:10:06 immortal kernel: [<c10304a8>] ? warn_slowpath_common+0x6a/0x7b
Feb 26 00:10:06 immortal kernel: [<f88f1875>] ? ata_scsi_error+0x14f/0x57d [libata]
Feb 26 00:10:06 immortal kernel: [<c10304c6>] ? warn_slowpath_null+0xd/0x10
Feb 26 00:10:06 immortal kernel: [<f88f1875>] ? ata_scsi_error+0x14f/0x57d [libata]
Feb 26 00:10:06 immortal kernel: [<f88395f2>] ? scsi_error_handler+0xaa/0x36f [scsi_mod]
Feb 26 00:10:06 immortal kernel: [<c10266c9>] ? complete+0x28/0x36
Feb 26 00:10:06 immortal kernel: [<f8839548>] ? scsi_error_handler+0x0/0x36f [scsi_mod]
Feb 26 00:10:06 immortal kernel: [<c1042c61>] ? kthread+0x63/0x68
Feb 26 00:10:06 immortal kernel: [<c1042bfe>] ? kthread+0x0/0x68
Feb 26 00:10:06 immortal kernel: [<c100933e>] ? kernel_thread_helper+0x6/0x10
Feb 26 00:10:06 immortal kernel: ---[ end trace 19a7d3ffb6ba8aa9 ]---
Comment 1 Alex Terekhov 2011-02-25 22:25:07 UTC
Created attachment 49132 [details]
lspci -vvv
Comment 2 Rafael J. Wysocki 2011-02-25 22:40:45 UTC
Does it only fail to suspend, or does it also crash or something?

If it doesn't crash during the failing suspend, please attach full dmesg
output including the failing attempt to suspend.
Comment 3 Alex Terekhov 2011-02-25 23:56:51 UTC
Created attachment 49152 [details]
dmesg
Comment 4 Alex Terekhov 2011-02-25 23:58:11 UTC
No, it doesn't crash or something else, It just fails to suspend
Comment 5 Rafael J. Wysocki 2011-02-26 10:26:43 UTC
Here's the source of the problem:

uhci_hcd 0000:00:1d.2: Root hub is not suspended
pci_pm_suspend(): hcd_pci_suspend+0x0/0x19 [usbcore] returns -16
sdhci-pci 0000:03:00.0: PCI INT A disabled
pm_op(): pci_pm_suspend+0x0/0xbe returns -16
PM: Device 0000:00:1d.2 failed to suspend async: error -16

(reassigning to USB).

Is the problem reproducible after you've done

# echo 0 > /sys/power/pm_async

?
Comment 6 Rafael J. Wysocki 2011-02-26 10:27:59 UTC
BTW, the SATA messages are from the error code path that's not the nicest
one, so to speak, but they are harmless.
Comment 7 Tejun Heo 2011-02-26 11:12:41 UTC
Hmm.. the WARN_ON() indicates that ATA port was requested to resume without successfully completing suspend.  Does PM core calls in resume for devices which haven't successfully complete suspend on suspend failure?  Or is libata doing something wrong?
Comment 8 Alex Terekhov 2011-02-26 12:45:40 UTC
(In reply to comment #5)
 
> Is the problem reproducible after you've done
> 
> # echo 0 > /sys/power/pm_async
> 
> ?

Hi, thanks, the problem didn't encounter after that.
Comment 9 Alan Stern 2011-02-26 14:22:31 UTC
Is the USB problem reproducible?  If it is, please build a kernel with CONFIG_USB_DEBUG and CONFIG_USB_MON enabled and post the dmesg log.  During the same test, acquire a usbmon trace (see Documentation/usb/usbmon.txt) using the 0u file, and attach that to the bug report as well.
Comment 10 Alex Terekhov 2011-02-26 15:24:50 UTC
(In reply to comment #9)
> Is the USB problem reproducible?  If it is, please build a kernel with

After

# echo 0 > /sys/power/pm_async

USB problem isn't reproducible.

Do you want me to get a usbmon trace without previous command?
Comment 11 Alex Terekhov 2011-02-26 16:02:47 UTC
Created attachment 49242 [details]
dmesg with CONFIG_USB_DEBUG
Comment 12 Alex Terekhov 2011-02-26 16:03:27 UTC
Created attachment 49252 [details]
usbmon trace
Comment 13 Alan Stern 2011-02-26 19:42:23 UTC
I'm a little surprised that pm_async made any difference.

Anyway, looking through your dmesg log, the problem starts with these two errors:

usb usb6: usb resume
usb usb6: wakeup_rh
uhci_hcd 0000:00:1d.0: host controller halted, very bad!
uhci_hcd 0000:00:1d.0: HC died; cleaning up
usb usb7: usb resume
usb usb7: wakeup_rh
usb usb8: usb resume
usb usb8: wakeup_rh
uhci_hcd 0000:00:1d.2: host controller halted, very bad!
uhci_hcd 0000:00:1d.2: HC died; cleaning up

It's not clear why this happened.  You'll get a little more information if you do "echo 2 >/sys/module/uhci_hcd/parameters/debug" before starting the first suspend.  (And forget about the usbmon trace, it's not going to help.)

But knowing why it happened won't fix the bug.  I'll write up a patch in a few days.
Comment 14 Alan Stern 2011-02-28 20:46:46 UTC
Created attachment 49662 [details]
Use atomic flags instead of hcd->state

Here's a patch that ought to fix the problem.  But I'd still like to see the debugging output; your UHCI controllers should not have died like that.  Or maybe they didn't die, maybe they're just slow to get started.
Comment 15 Alex Terekhov 2011-03-01 20:49:15 UTC
Created attachment 49782 [details]
dmesg with uhci_hcd debug param = 2
Comment 16 Alex Terekhov 2011-03-01 23:07:10 UTC
Created attachment 49822 [details]
dmesg for the first case

Alan, looks like USB + suspend still not working properly.
1. unplug after suspend, plug in after resume or replug after suspend to another USB-socket
2. right it dies without any plugging/unplugging

Both logs attached
Comment 17 Alex Terekhov 2011-03-01 23:07:49 UTC
Created attachment 49832 [details]
dmesg for the second case

The second log
Comment 18 Alan Stern 2011-03-02 16:17:00 UTC
Created attachment 49902 [details]
Add delay during UHCI startup

It looks like the patch worked perfectly.  Even though the host controllers died, you were able to go ahead and suspend the system afterwards with no problem.

The next problem is why the controllers died.  I can't tell exactly, but it looks like there's a race with the hardware.  This new patch adds a short (10 microsecond) delay, which may give the hardware enough time to start up properly.  Try it along with the other patch and see what happens.  Also try changing the length of the delay; see if you can find the shortest value that will work.
Comment 19 Alex Terekhov 2011-03-02 20:39:28 UTC
Created attachment 49942 [details]
Delay 50 microseconds

Nothing changed, even when I set 50 microseconds delay.
Comment 20 Alan Stern 2011-03-02 22:10:03 UTC
I don't know what's causing this problem with your controllers, or how to fix it.  It may be a hardware issue.  Have you tried plugging the two affected devices (the Saitek Cyborg USB Stick and the Broadcom Bluetooth Device) into different ports, connected to different controllers?  For some reason the 0000:00:1a.0 - 0000:00:1a.2 controllers don't seem to have this problem.
Comment 21 Rafael J. Wysocki 2011-03-06 21:42:32 UTC
Dropping from the list of recent regressions.
Comment 22 Alan Stern 2011-03-07 15:02:19 UTC
There's one more thing you can try.  Before starting a suspend, do:

   echo -n 0000:00:1d.0 >/sys/bus/pci/drivers/uhci_hcd/unbind
   echo -n 0000:00:1d.2 >/sys/bus/pci/drivers/uhci_hcd/unbind

Then after the resume, do:

   echo -n 0000:00:1d.0 >/sys/bus/pci/drivers/uhci_hcd/bind
   echo -n 0000:00:1d.2 >/sys/bus/pci/drivers/uhci_hcd/bind

If you get the "controller died" or similar errors during the "bind" part, then I think there's nothing more that can be done.  Either way, the dmesg log will be informative.
Comment 23 Alex Terekhov 2011-03-11 16:44:26 UTC
Created attachment 50632 [details]
dmesg (suspend-resume for all USB ports)

Hi, Alan

Sorry for delay, looks like you're right

I've tested all USB ports and looks like one of them works fine and two other not.

Thanks for your last suggestions, I think making rmmod uhci_hcd before suspend and modprobe uhci_hcd after resume would be the easiest way, don't you think so?
Comment 24 Alan Stern 2011-03-11 18:00:57 UTC
Yes, using rmmod would be the easier than doing those "echo" commands, if uhci-hcd is built as a module in your kernel.  But until you try it, there's no way to know whether the problem will still occur.

Your experiment does show pretty clearly that two of the USB ports don't work right.  This definitely looks like a hardware bug.
Comment 25 Florian Mickler 2011-03-27 10:51:54 UTC
A commit referencing this bug has been merged for .39-rc1:

commit 9b37596a2e860404503a3f2a6513db60c296bfdc
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Mon Mar 7 11:11:52 2011 -0500

    USB: move usbcore away from hcd->state
Comment 26 Alan Stern 2011-03-27 14:48:17 UTC
It looks like this bug report can be closed.  The software part has been fixed, and the rest is just a hardware problem.
Comment 27 Alex Terekhov 2011-03-28 10:53:39 UTC
Thanks, Alan
Comment 28 Florian Mickler 2011-05-30 07:57:27 UTC
A patch referencing a commit referencing this bug report has been merged in v3.0-rc1:

commit 69fff59de4d844f8b4c2454c3c23d32b69dcbfd7
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Tue May 17 17:27:12 2011 -0400

    USB: remove remaining usages of hcd->state from usbcore and fix regression