Subject : 2.6.25-rc7: Ugh. Submitter : Mark Lord <lkml@rtr.ca> Date : 2008-03-27 11:29 References : http://lkml.org/lkml/2008/3/27/228 Handled-By : Greg KH <gregkh@suse.de> This entry is being used for tracking a regression from 2.6.24. Please don't close it until the problem is fixed in the mainline.
Patch: http://lkml.org/lkml/2008/3/28/58
Handled-By : Mark Lord <lkml@rtr.ca> References : http://lkml.org/lkml/2008/3/28/346 Mark said: "Also, for now it seems to be "resolved" via a seemingly-unrelated change in the .config file on that machine. If it dies again, I'll re-post and investigate further."
Reply-To: lkml@rtr.ca bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10345 > > > rjw@sisk.pl changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |REJECTED > Resolution| |UNREPRODUCIBLE > > > > > ------- Comment #2 from rjw@sisk.pl 2008-03-28 15:35 ------- > Handled-By : Mark Lord <lkml@rtr.ca> > References : http://lkml.org/lkml/2008/3/28/346 > > Mark said: > > "Also, for now it seems to be "resolved" via a seemingly-unrelated > change in the .config file on that machine. If it dies again, > I'll re-post and investigate further." .. Just happened again. The machine resumed from RAM with no functioning USB. WTF?
Reply-To: lkml@rtr.ca bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10345 > > > rjw@sisk.pl changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |REJECTED > Resolution| |UNREPRODUCIBLE > > > > > ------- Comment #2 from rjw@sisk.pl 2008-03-28 15:35 ------- > Handled-By : Mark Lord <lkml@rtr.ca> > References : http://lkml.org/lkml/2008/3/28/346 > > Mark said: > > "Also, for now it seems to be "resolved" via a seemingly-unrelated > change in the .config file on that machine. If it dies again, > I'll re-post and investigate further." .. Just happened again. The machine resumed from RAM with no functioning USB. Still running the exact same 2.6.25-rc7 kernel as before, with the RTC conflict fixed. WTF?
One would *never* expect RTC activity to affect USB ...
On Sat, 29 Mar 2008, Mark Lord wrote: > Just happened again. The machine resumed from RAM with no functioning USB. > Still running the exact same 2.6.25-rc7 kernel as before, with the RTC > conflict fixed. > > WTF? The first step in debugging this is to build a kernel with CONFIG_USB_DEBUG and send the dmesg log following an unsuccessful resume. Include events from before the suspend; don't strip them out. If the symptoms are the same as before then you should also get a stack dump for the ksuspend_usbd task. Alan Stern
Reply-To: oliver@neukum.org Am Samstag, 29. März 2008 18:15:40 schrieb Mark Lord: > bugme-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=10345 > > > > > > rjw@sisk.pl changed: > > > > What |Removed |Added > > > ---------------------------------------------------------------------------- > > Status|NEW |REJECTED > > Resolution| |UNREPRODUCIBLE > > > > > > > > > > ------- Comment #2 from rjw@sisk.pl 2008-03-28 15:35 ------- > > Handled-By : Mark Lord <lkml@rtr.ca> > > References : http://lkml.org/lkml/2008/3/28/346 > > > > Mark said: > > > > "Also, for now it seems to be "resolved" via a seemingly-unrelated > > change in the .config file on that machine. If it dies again, > > I'll re-post and investigate further." > .. > > Just happened again. The machine resumed from RAM with no functioning USB. > Still running the exact same 2.6.25-rc7 kernel as before, with the RTC > conflict fixed. > > WTF? Is khubd stuck in D state? Regards Oliver
Reply-To: pavel@ucw.cz On Sat 2008-03-29 13:15:40, Mark Lord wrote: > bugme-daemon@bugzilla.kernel.org wrote: >> http://bugzilla.kernel.org/show_bug.cgi?id=10345 >> >> >> rjw@sisk.pl changed: >> >> What |Removed |Added >> ---------------------------------------------------------------------------- >> Status|NEW |REJECTED >> Resolution| |UNREPRODUCIBLE >> >> >> >> >> ------- Comment #2 from rjw@sisk.pl 2008-03-28 15:35 ------- >> Handled-By : Mark Lord <lkml@rtr.ca> >> References : http://lkml.org/lkml/2008/3/28/346 >> >> Mark said: >> >> "Also, for now it seems to be "resolved" via a seemingly-unrelated >> change in the .config file on that machine. If it dies again, >> I'll re-post and investigate further." > .. > > Just happened again. The machine resumed from RAM with no functioning USB. > Still running the exact same 2.6.25-rc7 kernel as before, with the RTC > conflict fixed. > > WTF? >3GB of RAM? Try iommu=soft. Pavel
Just got it again, after an overnight suspend. It seems much more prone to happening when the suspended interval is long-ish (hours), than when short (minutes). I tried to rmmod all loaded USB modules afterward: got usb-storage, visor, usbserial, and uhci-hcd unloaded just fine. Then tried ehci_hcd, and rmmod hung: rmmod D f5d06dc0 0 4547 4470 f5dacdd4 00000086 f698af20 f5d06dc0 f5d06f0c c283c940 00000001 f6cf2dc0 c012b440 00000000 f698af20 c0373550 f5dacddc c0113ca1 00000000 00000001 7fffffff 7fffffff f5dace38 f5dace14 c029f64f f5dacdfc f6eeece4 00000082 Call Trace: [autoremove_wake_function+15/51] ? autoremove_wake_function+0xf/0x33 [__wake_up_common+47/90] ? __wake_up_common+0x2f/0x5a [schedule_timeout+22/139] schedule_timeout+0x16/0x8b [_spin_lock_irq+43/51] ? _spin_lock_irq+0x2b/0x33 [wait_for_common+202/297] wait_for_common+0xca/0x129 [default_wake_function+0/13] ? default_wake_function+0x0/0xd [wait_for_completion+18/20] wait_for_completion+0x12/0x14 [__cancel_work_timer+281/352] __cancel_work_timer+0x119/0x160 [wq_barrier_func+0/13] ? wq_barrier_func+0x0/0xd [cancel_work_sync+10/12] cancel_work_sync+0xa/0xc [<f88a68b8>] usb_remove_hcd+0x66/0xe1 [usbcore] [<f88b0128>] usb_hcd_pci_remove+0x18/0x6a [usbcore] [pci_device_remove+25/57] pci_device_remove+0x19/0x39 [__device_release_driver+96/125] __device_release_driver+0x60/0x7d [driver_detach+130/193] driver_detach+0x82/0xc1 [bus_remove_driver+99/127] bus_remove_driver+0x63/0x7f [driver_unregister+42/46] driver_unregister+0x2a/0x2e [pci_unregister_driver+33/100] pci_unregister_driver+0x21/0x64 [<f890307a>] ehci_hcd_cleanup+0xd/0xf [ehci_hcd] [sys_delete_module+386/446] sys_delete_module+0x182/0x1be [down_read_trylock+55/65] ? down_read_trylock+0x37/0x41 [up_read+22/43] ? up_read+0x16/0x2b [do_page_fault+584/1218] ? do_page_fault+0x248/0x4c2 [sysenter_past_esp+95/145] sysenter_past_esp+0x5f/0x91 ======================= Will build now with CONFIG_USB_DEBUG (thanks for the suggestion, Alan).
> Is khubd stuck in D state? Dunno -- it didn't show up in the alt-sysrq-T dump before I rebooted, probably because /proc/kmsg overflowed. I'll check next time, and have doubled /proc/kmsg size too.
Pavel wrote: >3GB of RAM? Try iommu=soft. What would be the rationale behind that suggestion? Thanks.
Reply-To: lkml@rtr.ca Pavel Machek wrote: > On Sat 2008-03-29 13:15:40, Mark Lord wrote: >.. >> Just happened again. The machine resumed from RAM with no functioning USB. >> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC >> conflict fixed. >> >> WTF? > >> 3GB of RAM? Try iommu=soft. .. What does that option do (it is not in Documentation/kernel-parameters.txt), and why do you suspect it may make a difference here? Thanks
Reply-To: pavel@ucw.cz On Sun 2008-03-30 13:46:25, Mark Lord wrote: > Pavel Machek wrote: >> On Sat 2008-03-29 13:15:40, Mark Lord wrote: >> .. >>> Just happened again. The machine resumed from RAM with no functioning USB. >>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC >>> conflict fixed. >>> >>> WTF? >> >>> 3GB of RAM? Try iommu=soft. > .. > > What does that option do (it is not in Documentation/kernel-parameters.txt), > and why do you suspect it may make a difference here? See x86_64/boot-options.txt . Yes, that should be fixed. I had strange problems, and it looks like iommu does not have proper suspend/resume support. > 3GB machines hit that. Do you have one? Pavel
Reply-To: lkml@rtr.ca Pavel Machek wrote: > On Sun 2008-03-30 13:46:25, Mark Lord wrote: >> Pavel Machek wrote: >>> On Sat 2008-03-29 13:15:40, Mark Lord wrote: >>> .. >>>> Just happened again. The machine resumed from RAM with no functioning >>>> USB. >>>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC >>>> conflict fixed. >>>> >>>> WTF? >>>> 3GB of RAM? Try iommu=soft. >> .. >> >> What does that option do (it is not in Documentation/kernel-parameters.txt), >> and why do you suspect it may make a difference here? > > See x86_64/boot-options.txt . Yes, that should be fixed. > > I had strange problems, and it looks like iommu does not have proper > suspend/resume support. > 3GB machines hit that. Do you have one? .. This is a 32-bit system / kernel, so I'm guessing that this option is not applicable. Right? (it does have 3GB of RAM, though). Cheers
Created attachment 15516 [details] complete syslog from boot to suspend to resume, plus alt-sysrq-T at the end Complete gzip'd syslog from boot to suspend to resume, plus alt-sysrq-T at the end.
Created attachment 15517 [details] Kernel .config file from latest incident, gzip'd.
Reply-To: pavel@ucw.cz On Sun 2008-03-30 16:15:50, Mark Lord wrote: > Pavel Machek wrote: >> On Sun 2008-03-30 13:46:25, Mark Lord wrote: >>> Pavel Machek wrote: >>>> On Sat 2008-03-29 13:15:40, Mark Lord wrote: >>>> .. >>>>> Just happened again. The machine resumed from RAM with no functioning >>>>> USB. >>>>> Still running the exact same 2.6.25-rc7 kernel as before, with the RTC >>>>> conflict fixed. >>>>> >>>>> WTF? >>>>> 3GB of RAM? Try iommu=soft. >>> .. >>> >>> What does that option do (it is not in >>> Documentation/kernel-parameters.txt), >>> and why do you suspect it may make a difference here? >> >> See x86_64/boot-options.txt . Yes, that should be fixed. >> >> I had strange problems, and it looks like iommu does not have proper >> suspend/resume support. > 3GB machines hit that. Do you have one? > .. > > This is a 32-bit system / kernel, so I'm guessing that this option is not > applicable. > > Right? (it does have 3GB of RAM, though). Not 100% sure... I've seen strangeness on 32-bit systems, too. Can you simply pass mem=2G and see what happens? Pavel
I've just now attached a syslog.txt.gz, with USB_DEBUG enabled. This goes from boot to suspend to resume, with a final alt-sysrq-T at the end. So you can see what things are doing, and what khubd is up to. There's also a full kernel .config file attached showing the flags used. Next, I'll unload as much USB stuff as I can, and post another alt-sysrq-T after it hangs on ehci_hcd again.
Reply-To: lkml@rtr.ca >>>>>> 3GB of RAM? Try iommu=soft. .. >> >> This is a 32-bit system / kernel, so I'm guessing that this option is not >> applicable. >> >> Right? (it does have 3GB of RAM, though). > > Not 100% sure... I've seen strangeness on 32-bit systems, too. Can you > simply pass mem=2G and see what happens? .. Sure, after I finish collecting data on the problem, I'll give that a try. It will be somewhat inconclusive, though, since the problem does not happen on every resume. cheers
Created attachment 15518 [details] Additional syslog entries + lsmod, after hung rmmod ehci_hcd Here I've unloaded much of the USB stuff, until rmmod got stuck on ehci_hcd. This attachment shows the additional syslog messages since the first syslog.txt.gz attachment. There are a couple of interesting ones in there. I also did another alt-sysrq-T, to help figure out where USB is getting hung on the rmmod. And there's a final lsmod at the end.
On Sun, 30 Mar 2008 bugme-daemon@bugzilla.kernel.org wrote: > Complete gzip'd syslog from boot to suspend to resume, plus alt-sysrq-T at > the > end. This shows exactly where the problem is. See the khubd task. In ehci-hcd, ehci_endpoint_disable() is stuck in a loop waiting for the QH for ep0 to become idle. This will happen when the driver receives an appropriate interrupt or a watchdog timeout -- neither of which appears to have occurred. The watchdog timer code has been changed recently, so it's possible that the fault is there. On the other hand, maybe there's something wrong with the central timer facility. I'm not sure what is the best way to debug this. Can you unplug your USB mouse and maybe also the hub it is attached to? That would leave only 1 USB device on your system, making it easier to use printk statements for debugging (we wouldn't get swamped with messages arising from normal I/O). If you can, then the places to add some printks are in drivers/usb/host/ehci.h: iaa_watchdog_start() and iaa_watchdog_end() (they start and delete the watchdog timer), and also in ehci-hcd.c:ehci_iaa_watchdog() (which is the timer's handler routine). If it turns out the timer is started but then never fires and is never deleted, you'll have a smoking gun. Alan Stern
Mmm.. on the most recent resume-from-RAM, the system was toast. Not even alt-sysrq-* would work. Totally hung at black screen of death. -ml
bugme-daemon@bugzilla.kernel.org wrote: > > If you can, then the places to add some printks are in > drivers/usb/host/ehci.h: iaa_watchdog_start() and iaa_watchdog_end() > (they start and delete the watchdog timer), and also in > ehci-hcd.c:ehci_iaa_watchdog() (which is the timer's handler routine). > If it turns out the timer is started but then never fires and is never > deleted, you'll have a smoking gun. .. Well, I kinda like to keep the mouse, so we'll see if it floods the logs. But the hub is unplugged now, and there's nothing else on USB except the mouse. Just did a bunch of suspend-resumes, and one in five failed with black screen of death (no alt-sysrq, no nothing.. 5 seconds on power button to reboot).
Mmm.. in addition to the second instance of the "black screen of death", I've had a couple of resume cyles where the (Kubuntu) desktop background is corrupted. The most recent has it with the basic background image, but missing may alternate rows of detail. Something's screwy.
.. Mmm.. yet another B.S.O.D on resume. Generally, it mostly behaves on short suspend/resume cycles. But on longer ones ( > 5 minutes?) odds of a failure are quite high now. I'm going to hack my suspend/resume script to unload the USB modules before suspend, and reload them on resume. If it then behaves itself, we know it's the USB subsystem that's messing up. Otherwise, we know it's something else.
Cut'n'paste of latest text from the related kernel thread, just so this information stays with the "BUG" here in bugzilla: Mark Lord wrote: > Oliver Neukum wrote: >> Am Montag, 31. März 2008 17:04:46 schrieb Mark Lord: >>> Oliver Neukum wrote: >>>> Am Montag, 31. März 2008 16:39:33 schrieb Mark Lord: .. >>>>> Still want USB_SUSPEND=n ? Please explain. >>>> It looks like you are hanging in the kthread for autosuspending. >>>> Compiling that out should confirm it. >>> .. >>> >>> Okay, and once we see that it works fine: then what? >> >> We'll combine that information with the result of only removing usbhid >> and arrive at a pretty good idea where in the kernel the hang occurs. >> There are only two functions that touch autosuspend in the usbhid code. >> So if it works with usbhid unloaded, either of them should be to blame. > .. > It does still hang with *usbhid* unloaded, > but not if all USB stuff is unloaded. .. And it does *not* hang with # CONFIG_USB_SUSPEND is not set. > I have figured out a way to make this much more reproducible now: > When suspended, the notebook does not supply +5V over USB. > But with a voltmeter, I discovered that there is sufficient capacitance > on the USB +5V, that it takes many minutes for the voltage to decay > from 5.1V down to near 0V. > > Resuming while the voltage is still relatively high, generally works. > Resuming after the voltage drops to near zero, always fails (with USB modules > loaded). > > So I've put a 2Kohm resistor across the USB +5V lines, > forcing it to decay to zero within about 5 seconds. > This helps a lot for debugging here. > > It probably also provides a vital clue as to what is wrong. > Resume seems to generally work when the USB devices maintain > some amount of standby power, and always fails when they don't.
> Resuming while the voltage is still relatively high, generally works. > Resuming after the voltage drops to near zero, always fails (with USB > modules$ > > So I've put a 2Kohm resistor across the USB +5V lines, > forcing it to decay to zero within about 5 seconds. > This helps a lot for debugging here. You could get the same effect much more simply by unplugging and replugging your USB hub while the computer is suspended. In fact, you might even be able to provoke the problem just by plugging in a new USB device after the computer has resumed. > It probably also provides a vital clue as to what is wrong. > Resume seems to generally work when the USB devices maintain > some amount of standby power, and always fails when they don't. This merely confirms what I wrote before. The problem occurs when ehci-hcd hangs while trying to remove a QH.
Still busted in 2.6.25-rc8. New .config and syslog to be attached below.
Created attachment 15568 [details] Complete syslog of incident. Full -rc8 syslog from boot, with USB_DEBUG=y, USB_SUSPEND=y. USB is dead (nothing works) on resume from suspend. Alt-sysrq-T performed then. USB modules unloaded one by one. Hangs on rmmod ehci_hcd. Another Alt-sysrq-T performed here. Note that suspend/resume work fine if all USB modules (including ehci_hcd) are rmmod'd before suspend, and then modprobe'd again on resume. It also works fine with just CONFIG_USB_SUSPEND=n, and sometimes when resuming very soon after the suspend -- I've now got a 2.2Kohm resistor across the USB power pins to more quickly drain capacitance on suspend, making this problem more easily reproduceable.
Created attachment 15569 [details] .config file used with 2.6.25-rc8 (also nuking old attachments now..)
Could you try with commit e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 Author: David Brownell <david-b@pacbell.net> Date: Fri Mar 7 13:49:42 2008 -0800 USB: fix ehci unlink regressions reverted?
Handled-By : Alan Stern <stern@rowland.harvard.edu> Handled-By : David Brownell <david-b@pacbell.net> Patch : http://lkml.org/lkml/2008/4/2/297
*** Bug 10291 has been marked as a duplicate of this bug. ***
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cdc647a9b75741659bfc6acc44a6b3a646ad53bf