Most recent kernel where this bug did not occur: 2.6.23 Distribution: Debian Hardware Environment: Thinkpad R60, Intel Core 2 Duo Software Environment: x86_64 kernel, XFS, X.org, KDE. Problem Description: When using the system for some time, usually at most a few hours, it suddenly hangs completely, the screen goes black, and it can only be reset with the power switch. The fan is still spinning however and the system seems to generate heat as if it were doing something CPU-intensive. This happens consistently but at seemingly random times. It's a desktop system, used for some browsing and e-mail mostly.
Created attachment 13475 [details] Kernel configuration
The CPU hung somewhere. Please try enabling the NMI watchdog, see if that triggers anything. (Add nmi_watchdog=1 to the kernel boot command line)
> ------- Comment #2 from akpm@osdl.org 2007-11-09 00:43 ------- > Please try enabling the NMI watchdog, see if that triggers anything. If I'm in X, will the oops message get lost? Do I need netconsole? Marcus
Well, please read Documentation/nmi_watchdog.txt in the kernel sources. Still, on your system the NMI watchdog should be enabled by default. Do you see any suspicious messages in dmesg appearing before the box hangs?
> ------- Comment #4 from rjwysocki@sisk.pl 2007-11-11 09:57 ------- > Well, please read Documentation/nmi_watchdog.txt in the kernel sources. I have, but I don't see how that answers my question. Anyway the system refuses to boot with nmi_watchdog=1. It freezes at the beginning of the boot process, somewhere after starting udev and loading some modules. The last message is prints is from loading the Bluetooth HCI stuff. > Still, on your system the NMI watchdog should be enabled by default. Do you > see any suspicious messages in dmesg appearing before the box hangs? No, I haven't noticed any. Will try some more. Marcus
You can try nmi_watchdog=2, this will set up APIC based watchdog.
You can also check for platform errors with mcelog. If its not installed, apt-get install mcelog. It will read fatal errors logged by the kernel. MCE should be configured, and it looks like you have it enabled in your kernel: CONFIG_X86_MCE=y
(In reply to comment #6) > You can try nmi_watchdog=2, this will set up APIC based watchdog. That was better, I got the oops: netconsole: network logging started NMI Watchdog detected LOCKUP on CPU 0 CPU 0 Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4 cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async crc_ccitt fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb soundcore hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor fan Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3 RIP: 0010:[<ffffffff804876c9>] [<ffffffff804876c9>] _spin_lock+0x59/0x70 RSP: 0018:ffffffff805ecc18 EFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8 RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001 RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780 R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0 FS: 0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340) Stack: ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea 00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0 ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a Call Trace: <IRQ> [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220 [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990 [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 [<ffffffff8020c921>] ret_from_intr+0x0/0xa [<ffffffff8048768e>] _spin_lock+0x1e/0x70 [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0 [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110 [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170 [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 [<ffffffff8020c921>] ret_from_intr+0x0/0xa <EOI> [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0 [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0 [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0 [<ffffffff8020af50>] default_idle+0x0/0x40 [<ffffffff8020aff7>] cpu_idle+0x67/0xd0 [<ffffffff805aabba>] start_kernel+0x2aa/0x330 [<ffffffff805aa117>] _sinittext+0x117/0x120 Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00 Kernel panic - not syncing: Aiee, killing interrupt handler! Update: It crashes after at most half an hour's normal use or so. The screen doesn't go black anymore, but everything freezes (caps lock LED blinking).
Reply-To: akpm@linux-foundation.org On Wed, 21 Nov 2007 13:02:58 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9335 > This USB crash is a post-2.6.23 regression. > > marcus@better.se changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|system hangs with blank |system hangs after a few > |screen after some time |minutes > > > > > ------- Comment #8 from marcus@better.se 2007-11-21 13:02 ------- > (In reply to comment #6) > > You can try nmi_watchdog=2, this will set up APIC based watchdog. > > That was better, I got the oops: > > netconsole: network logging started > NMI Watchdog detected LOCKUP on CPU 0 > CPU 0 > Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user > xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4 > cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave > rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc > deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish > des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async > crc_ccitt > fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu > coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss > snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb > ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia > snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb > soundcore > hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket > rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc > psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc > blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core > crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor > fan > Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3 > RIP: 0010:[<ffffffff804876c9>] [<ffffffff804876c9>] _spin_lock+0x59/0x70 > RSP: 0018:ffffffff805ecc18 EFLAGS: 00000002 > RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8 > RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001 > RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d > R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780 > R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0 > FS: 0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340) > Stack: ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea > 00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0 > ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a > Call Trace: > <IRQ> [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 > [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220 > [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990 > [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0 > [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 > [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 > [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 > [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 > [<ffffffff8020c921>] ret_from_intr+0x0/0xa > [<ffffffff8048768e>] _spin_lock+0x1e/0x70 > [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 > [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0 > [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110 > [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170 > [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370 > [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 > [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 > [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 > [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 > [<ffffffff8020c921>] ret_from_intr+0x0/0xa > <EOI> [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0 > [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0 > [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0 > [<ffffffff8020af50>] default_idle+0x0/0x40 > [<ffffffff8020aff7>] cpu_idle+0x67/0xd0 > [<ffffffff805aabba>] start_kernel+0x2aa/0x330 > [<ffffffff805aa117>] _sinittext+0x117/0x120 > > > Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00 > Kernel panic - not syncing: Aiee, killing interrupt handler! > > > Update: It crashes after at most half an hour's normal use or so. The screen > doesn't go black anymore, but everything freezes (caps lock LED blinking). >
I don't have time to make a patch right now. However I think you'll find that converting all the spin_lock() and spin_unlock() calls in drivers/usb/host/ohci-hcd.c:ohci_irq() to spin_lock_irqsave() and spin_unlock_irqrestore() will solve the problem. A similar change is probably needed in ehci-hcd.c:ehci_irq().
Created attachment 13762 [details] Use spin_lock_irqsave in ohci_irq Here's the patch I promised. See if it helps.
Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would probably have reported this bug before the lockup, making analysis of this bug quite a bit easier. Perhaps we should turn lockdep on by default in -rc's?
(In reply to comment #12) > Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would > probably have reported this bug before the lockup, Those options turned up the following: BUG: MAX_LOCK_DEPTH too low! turning off the locking correctness validator. This is with 2.6.24-rc4 (with the patch).
Created attachment 13856 [details] dmesg output with 2.6.24-rc4 Full dmesg output with the lockdep BUG message.
> ------- Comment #11 from stern@rowland.harvard.edu 2007-11-26 08:37 > Here's the patch I promised. See if it helps. I'm testing it with 2.6.23-rc4. So far so good after two and a half hours.
Still working after 8 hours. Looks good.
Created attachment 13871 [details] Add the missing IRQF_DISABLED flag Try using this patch instead of the earlier one. It's a simpler and more general fix.
> ------- Comment #17 from stern@rowland.harvard.edu 2007-12-05 11:50 ------- > Try using this patch instead of the earlier one. It's a simpler and more > general fix. Seems to work, it's running for four hours now. Marcus
(In reply to comment #17) > Add the missing IRQF_DISABLED flag > > Try using this patch instead of the earlier one. It's a simpler and more > general fix. It seems to work, no problems so far.
The patch has been submitted for 2.6.24. You can go ahead and close out this bug report.
Fixed by: commit 442258e2ff69276ff767f3703b30ce6a31fdd181 Author: Alan Stern <stern@rowland.harvard.edu> Date: Thu Dec 6 14:47:08 2007 -0500 USB: use IRQF_DISABLED for HCD interrupt handlers http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=442258e2ff69276ff767f3703b30ce6a31fdd181