Bug 10345

Summary: USB HID problems after resume
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: CLOSED CODE_FIX    
Severity: normal CC: bunk, dbrownell, mlord, rjw, stern
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 9832    
Attachments: complete syslog from boot to suspend to resume, plus alt-sysrq-T at the end
Kernel .config file from latest incident, gzip'd.
Additional syslog entries + lsmod, after hung rmmod ehci_hcd
Complete syslog of incident.
.config file used with 2.6.25-rc8

Description Rafael J. Wysocki 2008-03-27 15:03:07 UTC
Subject    : 2.6.25-rc7:  Ugh. 
Submitter  : Mark Lord <lkml@rtr.ca>
Date       : 2008-03-27 11:29
References : http://lkml.org/lkml/2008/3/27/228
Handled-By : Greg KH <gregkh@suse.de>

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Adrian Bunk 2008-03-28 05:26:05 UTC
Patch: http://lkml.org/lkml/2008/3/28/58
Comment 2 Rafael J. Wysocki 2008-03-28 15:35:44 UTC
Handled-By : Mark Lord <lkml@rtr.ca>
References : http://lkml.org/lkml/2008/3/28/346

Mark said:

"Also, for now it seems to be "resolved" via a seemingly-unrelated
change in the .config file on that machine.  If it dies again,
I'll re-post and investigate further."
Comment 3 Anonymous Emailer 2008-03-29 10:13:53 UTC
Reply-To: lkml@rtr.ca

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10345
> 
> 
> rjw@sisk.pl changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |REJECTED
>          Resolution|                            |UNREPRODUCIBLE
> 
> 
> 
> 
> ------- Comment #2 from rjw@sisk.pl  2008-03-28 15:35 -------
> Handled-By : Mark Lord <lkml@rtr.ca>
> References : http://lkml.org/lkml/2008/3/28/346
> 
> Mark said:
> 
> "Also, for now it seems to be "resolved" via a seemingly-unrelated
> change in the .config file on that machine.  If it dies again,
> I'll re-post and investigate further."
..

Just happened again.  The machine resumed from RAM with no functioning USB.

WTF?
Comment 4 Anonymous Emailer 2008-03-29 10:15:43 UTC
Reply-To: lkml@rtr.ca

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10345
>
>
> rjw@sisk.pl changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |REJECTED
>          Resolution|                            |UNREPRODUCIBLE
>
>
>
>
> ------- Comment #2 from rjw@sisk.pl  2008-03-28 15:35 -------
> Handled-By : Mark Lord <lkml@rtr.ca>
> References : http://lkml.org/lkml/2008/3/28/346
>
> Mark said:
>
> "Also, for now it seems to be "resolved" via a seemingly-unrelated
> change in the .config file on that machine.  If it dies again,
> I'll re-post and investigate further."
..

Just happened again.  The machine resumed from RAM with no functioning USB.
Still running the exact same 2.6.25-rc7 kernel as before, with the RTC conflict fixed.

WTF?
Comment 5 David Brownell 2008-03-29 12:00:48 UTC
One would *never* expect RTC activity to affect USB ...
Comment 6 Alan Stern 2008-03-29 15:02:06 UTC
On Sat, 29 Mar 2008, Mark Lord wrote:

> Just happened again.  The machine resumed from RAM with no functioning USB.
> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
> conflict fixed.
> 
> WTF?

The first step in debugging this is to build a kernel with
CONFIG_USB_DEBUG and send the dmesg log following an unsuccessful
resume.  Include events from before the suspend; don't strip them out.  
If the symptoms are the same as before then you should also get a stack
dump for the ksuspend_usbd task.

Alan Stern
Comment 7 Anonymous Emailer 2008-03-30 02:30:43 UTC
Reply-To: oliver@neukum.org

Am Samstag, 29. März 2008 18:15:40 schrieb Mark Lord:
> bugme-daemon@bugzilla.kernel.org wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=10345
> >
> >
> > rjw@sisk.pl changed:
> >
> >            What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >              Status|NEW                         |REJECTED
> >          Resolution|                            |UNREPRODUCIBLE
> >
> >
> >
> >
> > ------- Comment #2 from rjw@sisk.pl  2008-03-28 15:35 -------
> > Handled-By : Mark Lord <lkml@rtr.ca>
> > References : http://lkml.org/lkml/2008/3/28/346
> >
> > Mark said:
> >
> > "Also, for now it seems to be "resolved" via a seemingly-unrelated
> > change in the .config file on that machine.  If it dies again,
> > I'll re-post and investigate further."
> ..
> 
> Just happened again.  The machine resumed from RAM with no functioning USB.
> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
> conflict fixed.
> 
> WTF?

Is khubd stuck in D state?

	Regards
		Oliver
Comment 8 Anonymous Emailer 2008-03-30 04:48:01 UTC
Reply-To: pavel@ucw.cz

On Sat 2008-03-29 13:15:40, Mark Lord wrote:
> bugme-daemon@bugzilla.kernel.org wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=10345
>>
>>
>> rjw@sisk.pl changed:
>>
>>            What    |Removed                     |Added
>> ----------------------------------------------------------------------------
>>              Status|NEW                         |REJECTED
>>          Resolution|                            |UNREPRODUCIBLE
>>
>>
>>
>>
>> ------- Comment #2 from rjw@sisk.pl  2008-03-28 15:35 -------
>> Handled-By : Mark Lord <lkml@rtr.ca>
>> References : http://lkml.org/lkml/2008/3/28/346
>>
>> Mark said:
>>
>> "Also, for now it seems to be "resolved" via a seemingly-unrelated
>> change in the .config file on that machine.  If it dies again,
>> I'll re-post and investigate further."
> ..
>
> Just happened again.  The machine resumed from RAM with no functioning USB.
> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
> conflict fixed.
>
> WTF?

>3GB of RAM? Try iommu=soft.
									Pavel
Comment 9 Mark Lord 2008-03-30 10:35:26 UTC
Just got it again, after an overnight suspend.  It seems much more prone to happening when the suspended interval is long-ish (hours), than when short (minutes).

I tried to rmmod all loaded USB modules afterward:  got usb-storage, visor, usbserial, and uhci-hcd unloaded just fine.  Then tried ehci_hcd, and rmmod hung:


rmmod         D f5d06dc0     0  4547   4470
       f5dacdd4 00000086 f698af20 f5d06dc0 f5d06f0c c283c940 00000001 f6cf2dc0
       c012b440 00000000 f698af20 c0373550 f5dacddc c0113ca1 00000000 00000001
       7fffffff 7fffffff f5dace38 f5dace14 c029f64f f5dacdfc f6eeece4 00000082
Call Trace:
 [autoremove_wake_function+15/51] ? autoremove_wake_function+0xf/0x33
 [__wake_up_common+47/90] ? __wake_up_common+0x2f/0x5a
 [schedule_timeout+22/139] schedule_timeout+0x16/0x8b
 [_spin_lock_irq+43/51] ? _spin_lock_irq+0x2b/0x33
 [wait_for_common+202/297] wait_for_common+0xca/0x129
 [default_wake_function+0/13] ? default_wake_function+0x0/0xd
 [wait_for_completion+18/20] wait_for_completion+0x12/0x14
 [__cancel_work_timer+281/352] __cancel_work_timer+0x119/0x160
 [wq_barrier_func+0/13] ? wq_barrier_func+0x0/0xd
 [cancel_work_sync+10/12] cancel_work_sync+0xa/0xc
 [<f88a68b8>] usb_remove_hcd+0x66/0xe1 [usbcore]
 [<f88b0128>] usb_hcd_pci_remove+0x18/0x6a [usbcore]
 [pci_device_remove+25/57] pci_device_remove+0x19/0x39
 [__device_release_driver+96/125] __device_release_driver+0x60/0x7d
 [driver_detach+130/193] driver_detach+0x82/0xc1
 [bus_remove_driver+99/127] bus_remove_driver+0x63/0x7f
 [driver_unregister+42/46] driver_unregister+0x2a/0x2e
 [pci_unregister_driver+33/100] pci_unregister_driver+0x21/0x64
 [<f890307a>] ehci_hcd_cleanup+0xd/0xf [ehci_hcd]
 [sys_delete_module+386/446] sys_delete_module+0x182/0x1be
 [down_read_trylock+55/65] ? down_read_trylock+0x37/0x41
 [up_read+22/43] ? up_read+0x16/0x2b
 [do_page_fault+584/1218] ? do_page_fault+0x248/0x4c2
 [sysenter_past_esp+95/145] sysenter_past_esp+0x5f/0x91
 =======================

Will build now with CONFIG_USB_DEBUG (thanks for the suggestion, Alan).
Comment 10 Mark Lord 2008-03-30 10:37:24 UTC
> Is khubd stuck in D state?

Dunno -- it didn't show up in the alt-sysrq-T dump before I rebooted, probably because /proc/kmsg overflowed.  I'll check next time, and have doubled /proc/kmsg size too.
Comment 11 Mark Lord 2008-03-30 10:38:45 UTC
Pavel wrote:
>3GB of RAM? Try iommu=soft.

What would be the rationale behind that suggestion?

Thanks.
Comment 12 Anonymous Emailer 2008-03-30 10:46:32 UTC
Reply-To: lkml@rtr.ca

Pavel Machek wrote:
> On Sat 2008-03-29 13:15:40, Mark Lord wrote:
>..
>> Just happened again.  The machine resumed from RAM with no functioning USB.
>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
>> conflict fixed.
>>
>> WTF?
> 
>> 3GB of RAM? Try iommu=soft.
..

What does that option do (it is not in Documentation/kernel-parameters.txt),
and why do you suspect it may make a difference here?

Thanks
Comment 13 Anonymous Emailer 2008-03-30 11:22:40 UTC
Reply-To: pavel@ucw.cz

On Sun 2008-03-30 13:46:25, Mark Lord wrote:
> Pavel Machek wrote:
>> On Sat 2008-03-29 13:15:40, Mark Lord wrote:
>> ..
>>> Just happened again.  The machine resumed from RAM with no functioning USB.
>>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
>>> conflict fixed.
>>>
>>> WTF?
>>
>>> 3GB of RAM? Try iommu=soft.
> ..
>
> What does that option do (it is not in Documentation/kernel-parameters.txt),
> and why do you suspect it may make a difference here?

See x86_64/boot-options.txt . Yes, that should be fixed.

I had strange problems, and it looks like iommu does not have proper
suspend/resume support. > 3GB machines hit that. Do you have one?

									Pavel
Comment 14 Anonymous Emailer 2008-03-30 13:16:25 UTC
Reply-To: lkml@rtr.ca

Pavel Machek wrote:
> On Sun 2008-03-30 13:46:25, Mark Lord wrote:
>> Pavel Machek wrote:
>>> On Sat 2008-03-29 13:15:40, Mark Lord wrote:
>>> ..
>>>> Just happened again.  The machine resumed from RAM with no functioning
>>>> USB.
>>>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
>>>> conflict fixed.
>>>>
>>>> WTF?
>>>> 3GB of RAM? Try iommu=soft.
>> ..
>>
>> What does that option do (it is not in Documentation/kernel-parameters.txt),
>> and why do you suspect it may make a difference here?
> 
> See x86_64/boot-options.txt . Yes, that should be fixed.
> 
> I had strange problems, and it looks like iommu does not have proper
> suspend/resume support. > 3GB machines hit that. Do you have one?
..

This is a 32-bit system / kernel, so I'm guessing that this option is not applicable.

Right?  (it does have 3GB of RAM, though).

Cheers
Comment 15 Mark Lord 2008-03-30 13:18:00 UTC
Created attachment 15516 [details]
complete syslog from boot to suspend to resume, plus alt-sysrq-T at the end

Complete gzip'd syslog from boot to suspend to resume, plus alt-sysrq-T at the end.
Comment 16 Mark Lord 2008-03-30 13:18:45 UTC
Created attachment 15517 [details]
Kernel .config file from latest incident, gzip'd.
Comment 17 Anonymous Emailer 2008-03-30 13:20:41 UTC
Reply-To: pavel@ucw.cz

On Sun 2008-03-30 16:15:50, Mark Lord wrote:
> Pavel Machek wrote:
>> On Sun 2008-03-30 13:46:25, Mark Lord wrote:
>>> Pavel Machek wrote:
>>>> On Sat 2008-03-29 13:15:40, Mark Lord wrote:
>>>> ..
>>>>> Just happened again.  The machine resumed from RAM with no functioning
>>>>> USB.
>>>>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC
>>>>> conflict fixed.
>>>>>
>>>>> WTF?
>>>>> 3GB of RAM? Try iommu=soft.
>>> ..
>>>
>>> What does that option do (it is not in
>>> Documentation/kernel-parameters.txt),
>>> and why do you suspect it may make a difference here?
>>
>> See x86_64/boot-options.txt . Yes, that should be fixed.
>>
>> I had strange problems, and it looks like iommu does not have proper
>> suspend/resume support. > 3GB machines hit that. Do you have one?
> ..
>
> This is a 32-bit system / kernel, so I'm guessing that this option is not
> applicable.
>
> Right?  (it does have 3GB of RAM, though).

Not 100% sure... I've seen strangeness on 32-bit systems, too. Can you
simply pass mem=2G and see what happens?
									Pavel
Comment 18 Mark Lord 2008-03-30 13:21:03 UTC
I've just now attached a syslog.txt.gz, with USB_DEBUG enabled.  This goes from boot to suspend to resume, with a final alt-sysrq-T at the end.  So you can see what things are doing, and what khubd is up to.

There's also a full kernel .config file attached showing the flags used.

Next, I'll unload as much USB stuff as I can, and post another alt-sysrq-T after it hangs on ehci_hcd again.
Comment 19 Anonymous Emailer 2008-03-30 13:23:09 UTC
Reply-To: lkml@rtr.ca

>>>>>> 3GB of RAM? Try iommu=soft.
..
>>
>> This is a 32-bit system / kernel, so I'm guessing that this option is not
>> applicable.
>>
>> Right?  (it does have 3GB of RAM, though).
> 
> Not 100% sure... I've seen strangeness on 32-bit systems, too. Can you
> simply pass mem=2G and see what happens?
..

Sure, after I finish collecting data on the problem, I'll give that a try.
It will be somewhat inconclusive, though, since the problem does not happen
on every resume.  

cheers
Comment 20 Mark Lord 2008-03-30 13:27:47 UTC
Created attachment 15518 [details]
Additional syslog entries + lsmod, after hung rmmod ehci_hcd

Here I've unloaded much of the USB stuff, until rmmod got stuck on ehci_hcd.  This attachment shows the additional syslog messages since the first syslog.txt.gz attachment.  There are a couple of interesting ones in there.

I also did another alt-sysrq-T, to help figure out where USB is getting hung on the rmmod.

And there's a final lsmod at the end.
Comment 21 Alan Stern 2008-03-30 14:28:53 UTC
On Sun, 30 Mar 2008 bugme-daemon@bugzilla.kernel.org wrote:

> Complete gzip'd syslog from boot to suspend to resume, plus alt-sysrq-T at
> the
> end.

This shows exactly where the problem is.  See the khubd task.  In
ehci-hcd, ehci_endpoint_disable() is stuck in a loop waiting for the QH
for ep0 to become idle.  This will happen when the driver receives an
appropriate interrupt or a watchdog timeout -- neither of which appears
to have occurred.

The watchdog timer code has been changed recently, so it's possible 
that the fault is there.  On the other hand, maybe there's something 
wrong with the central timer facility.

I'm not sure what is the best way to debug this.  Can you unplug your 
USB mouse and maybe also the hub it is attached to?  That would leave 
only 1 USB device on your system, making it easier to use printk 
statements for debugging (we wouldn't get swamped with messages arising 
from normal I/O).

If you can, then the places to add some printks are in
drivers/usb/host/ehci.h: iaa_watchdog_start() and iaa_watchdog_end()  
(they start and delete the watchdog timer), and also in
ehci-hcd.c:ehci_iaa_watchdog() (which is the timer's handler routine).  
If it turns out the timer is started but then never fires and is never
deleted, you'll have a smoking gun.

Alan Stern
Comment 22 Mark Lord 2008-03-30 19:38:31 UTC
Mmm.. on the most recent resume-from-RAM, the system was toast.  Not even alt-sysrq-* would work.  Totally hung at black screen of death.

-ml
Comment 23 Mark Lord 2008-03-30 20:10:35 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> 
> If you can, then the places to add some printks are in
> drivers/usb/host/ehci.h: iaa_watchdog_start() and iaa_watchdog_end()  
> (they start and delete the watchdog timer), and also in
> ehci-hcd.c:ehci_iaa_watchdog() (which is the timer's handler routine).  
> If it turns out the timer is started but then never fires and is never
> deleted, you'll have a smoking gun.
..

Well, I kinda like to keep the mouse, so we'll see if it floods the logs.
But the hub is unplugged now, and there's nothing else on USB except the mouse.

Just did a bunch of suspend-resumes, and one in five failed with black screen
of death (no alt-sysrq, no nothing.. 5 seconds on power button to reboot).
Comment 24 Mark Lord 2008-03-30 20:15:23 UTC
Mmm.. in addition to the second instance of the "black screen of death", I've had a couple of resume cyles where the (Kubuntu) desktop background is corrupted.  The most recent has it with the basic background image, but missing may alternate rows of detail.

Something's screwy.
Comment 25 Mark Lord 2008-03-30 21:02:13 UTC
..

Mmm.. yet another B.S.O.D on resume.
Generally, it mostly behaves on short suspend/resume cycles.
But on longer ones ( > 5 minutes?) odds of a failure are quite high now.

I'm going to hack my suspend/resume script to unload the USB modules before suspend,
and reload them on resume.  If it then behaves itself, we know it's the USB subsystem
that's messing up.  Otherwise, we know it's something else.
Comment 26 Mark Lord 2008-03-31 10:19:11 UTC
Cut'n'paste of latest text from the related kernel thread, just so this information stays with the "BUG" here in bugzilla:

Mark Lord wrote:
> Oliver Neukum wrote:
>> Am Montag, 31. März 2008 17:04:46 schrieb Mark Lord:
>>> Oliver Neukum wrote:
>>>> Am Montag, 31. März 2008 16:39:33 schrieb Mark Lord:
..
>>>>> Still want USB_SUSPEND=n ?  Please explain.
>>>> It looks like you are hanging in the kthread for autosuspending.
>>>> Compiling that out should confirm it.
>>> ..
>>>
>>> Okay, and once we see that it works fine:  then what?
>>
>> We'll combine that information with the result of only removing usbhid
>> and arrive at a pretty good idea where in the kernel the hang occurs.
>> There are only two functions that touch autosuspend in the usbhid code.
>> So if it works with usbhid unloaded, either of them should be to blame.
> ..
> It does still hang with *usbhid* unloaded,
> but not if all USB stuff is unloaded.
..

And it does *not* hang with # CONFIG_USB_SUSPEND is not set.

> I have figured out a way to make this much more reproducible now:
> When suspended, the notebook does not supply +5V over USB.
> But with a voltmeter, I discovered that there is sufficient capacitance
> on the USB +5V, that it takes many minutes for the voltage to decay
> from 5.1V down to near 0V.
>
> Resuming while the voltage is still relatively high, generally works.
> Resuming after the voltage drops to near zero, always fails (with USB modules
> loaded).
>
> So I've put a 2Kohm resistor across the USB +5V lines,
> forcing it to decay to zero within about 5 seconds.
> This helps a lot for debugging here.
>
> It probably also provides a vital clue as to what is wrong.
> Resume seems to generally work when the USB devices maintain
> some amount of standby power, and always fails when they don't.
Comment 27 Alan Stern 2008-03-31 12:45:06 UTC
> Resuming while the voltage is still relatively high, generally works.
> Resuming after the voltage drops to near zero, always fails (with USB
> modules$
> 
> So I've put a 2Kohm resistor across the USB +5V lines,
> forcing it to decay to zero within about 5 seconds.   
> This helps a lot for debugging here.
  
You could get the same effect much more simply by unplugging and replugging your USB hub while the computer is suspended.  In fact, you might even be able to provoke the problem just by plugging in a new USB device after the computer has resumed.

> It probably also provides a vital clue as to what is wrong.
> Resume seems to generally work when the USB devices maintain   
> some amount of standby power, and always fails when they don't.

This merely confirms what I wrote before.  The problem occurs when ehci-hcd hangs while trying to remove a QH.
Comment 28 Mark Lord 2008-04-02 07:57:00 UTC
Still busted in 2.6.25-rc8.

New .config and syslog to be attached below.
Comment 29 Mark Lord 2008-04-02 08:01:09 UTC
Created attachment 15568 [details]
Complete syslog of incident.

Full -rc8 syslog from boot, with USB_DEBUG=y, USB_SUSPEND=y.

USB is dead (nothing works) on resume from suspend.
Alt-sysrq-T performed then.

USB modules unloaded one by one.
Hangs on rmmod ehci_hcd.
Another Alt-sysrq-T performed here.

Note that suspend/resume work fine if all USB modules (including ehci_hcd) are rmmod'd before suspend, and then modprobe'd again on resume.

It also works fine with just CONFIG_USB_SUSPEND=n, and sometimes when resuming very soon after the suspend -- I've now got a 2.2Kohm resistor across the USB power pins to more quickly drain capacitance on suspend, making this problem more easily reproduceable.
Comment 30 Mark Lord 2008-04-02 08:02:18 UTC
Created attachment 15569 [details]
.config file used with 2.6.25-rc8

(also nuking old attachments now..)
Comment 31 Rafael J. Wysocki 2008-04-02 10:45:43 UTC
Could you try with

commit e82cc1288fa57857c6af8c57f3d07096d4bcd9d9
Author: David Brownell <david-b@pacbell.net>
Date:   Fri Mar 7 13:49:42 2008 -0800

    USB: fix ehci unlink regressions

reverted?
Comment 32 Rafael J. Wysocki 2008-04-02 15:15:52 UTC
Handled-By : Alan Stern <stern@rowland.harvard.edu>
Handled-By : David Brownell <david-b@pacbell.net>
Patch : http://lkml.org/lkml/2008/4/2/297
Comment 33 Rafael J. Wysocki 2008-04-02 15:28:17 UTC
*** Bug 10291 has been marked as a duplicate of this bug. ***