Bug 9335
Summary: | system hangs after a few minutes | ||
---|---|---|---|
Product: | Other | Reporter: | Marcus Better (marcus) |
Component: | Other | Assignee: | Greg Kroah-Hartman (greg) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, mingo, protasnb, rjwysocki, stern |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 9243 | ||
Attachments: |
Kernel configuration
Use spin_lock_irqsave in ohci_irq dmesg output with 2.6.24-rc4 Add the missing IRQF_DISABLED flag |
Description
Marcus Better
2007-11-08 23:15:31 UTC
Created attachment 13475 [details]
Kernel configuration
The CPU hung somewhere. Please try enabling the NMI watchdog, see if that triggers anything. (Add nmi_watchdog=1 to the kernel boot command line) > ------- Comment #2 from akpm@osdl.org 2007-11-09 00:43 -------
> Please try enabling the NMI watchdog, see if that triggers anything.
If I'm in X, will the oops message get lost? Do I need netconsole?
Marcus
Well, please read Documentation/nmi_watchdog.txt in the kernel sources. Still, on your system the NMI watchdog should be enabled by default. Do you see any suspicious messages in dmesg appearing before the box hangs? > ------- Comment #4 from rjwysocki@sisk.pl 2007-11-11 09:57 ------- > Well, please read Documentation/nmi_watchdog.txt in the kernel sources. I have, but I don't see how that answers my question. Anyway the system refuses to boot with nmi_watchdog=1. It freezes at the beginning of the boot process, somewhere after starting udev and loading some modules. The last message is prints is from loading the Bluetooth HCI stuff. > Still, on your system the NMI watchdog should be enabled by default. Do you > see any suspicious messages in dmesg appearing before the box hangs? No, I haven't noticed any. Will try some more. Marcus You can try nmi_watchdog=2, this will set up APIC based watchdog. You can also check for platform errors with mcelog. If its not installed, apt-get install mcelog. It will read fatal errors logged by the kernel. MCE should be configured, and it looks like you have it enabled in your kernel: CONFIG_X86_MCE=y (In reply to comment #6) > You can try nmi_watchdog=2, this will set up APIC based watchdog. That was better, I got the oops: netconsole: network logging started NMI Watchdog detected LOCKUP on CPU 0 CPU 0 Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4 cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async crc_ccitt fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb soundcore hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor fan Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3 RIP: 0010:[<ffffffff804876c9>] [<ffffffff804876c9>] _spin_lock+0x59/0x70 RSP: 0018:ffffffff805ecc18 EFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8 RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001 RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780 R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0 FS: 0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340) Stack: ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea 00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0 ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a Call Trace: <IRQ> [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220 [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990 [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 [<ffffffff8020c921>] ret_from_intr+0x0/0xa [<ffffffff8048768e>] _spin_lock+0x1e/0x70 [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0 [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110 [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170 [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 [<ffffffff8020c921>] ret_from_intr+0x0/0xa <EOI> [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0 [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0 [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0 [<ffffffff8020af50>] default_idle+0x0/0x40 [<ffffffff8020aff7>] cpu_idle+0x67/0xd0 [<ffffffff805aabba>] start_kernel+0x2aa/0x330 [<ffffffff805aa117>] _sinittext+0x117/0x120 Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00 Kernel panic - not syncing: Aiee, killing interrupt handler! Update: It crashes after at most half an hour's normal use or so. The screen doesn't go black anymore, but everything freezes (caps lock LED blinking). Reply-To: akpm@linux-foundation.org On Wed, 21 Nov 2007 13:02:58 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9335 > This USB crash is a post-2.6.23 regression. > > marcus@better.se changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|system hangs with blank |system hangs after a few > |screen after some time |minutes > > > > > ------- Comment #8 from marcus@better.se 2007-11-21 13:02 ------- > (In reply to comment #6) > > You can try nmi_watchdog=2, this will set up APIC based watchdog. > > That was better, I got the oops: > > netconsole: network logging started > NMI Watchdog detected LOCKUP on CPU 0 > CPU 0 > Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user > xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4 > cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave > rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc > deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish > des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async > crc_ccitt > fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu > coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss > snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb > ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia > snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb > soundcore > hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket > rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc > psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc > blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core > crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor > fan > Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3 > RIP: 0010:[<ffffffff804876c9>] [<ffffffff804876c9>] _spin_lock+0x59/0x70 > RSP: 0018:ffffffff805ecc18 EFLAGS: 00000002 > RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8 > RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001 > RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d > R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780 > R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0 > FS: 0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340) > Stack: ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea > 00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0 > ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a > Call Trace: > <IRQ> [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 > [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220 > [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990 > [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0 > [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 > [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 > [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 > [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 > [<ffffffff8020c921>] ret_from_intr+0x0/0xa > [<ffffffff8048768e>] _spin_lock+0x1e/0x70 > [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40 > [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0 > [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110 > [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170 > [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370 > [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60 > [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70 > [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110 > [<ffffffff8020f9db>] do_IRQ+0x7b/0x100 > [<ffffffff8020c921>] ret_from_intr+0x0/0xa > <EOI> [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0 > [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0 > [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0 > [<ffffffff8020af50>] default_idle+0x0/0x40 > [<ffffffff8020aff7>] cpu_idle+0x67/0xd0 > [<ffffffff805aabba>] start_kernel+0x2aa/0x330 > [<ffffffff805aa117>] _sinittext+0x117/0x120 > > > Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00 > Kernel panic - not syncing: Aiee, killing interrupt handler! > > > Update: It crashes after at most half an hour's normal use or so. The screen > doesn't go black anymore, but everything freezes (caps lock LED blinking). > I don't have time to make a patch right now. However I think you'll find that converting all the spin_lock() and spin_unlock() calls in drivers/usb/host/ohci-hcd.c:ohci_irq() to spin_lock_irqsave() and spin_unlock_irqrestore() will solve the problem. A similar change is probably needed in ehci-hcd.c:ehci_irq(). Created attachment 13762 [details]
Use spin_lock_irqsave in ohci_irq
Here's the patch I promised. See if it helps.
Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would probably have reported this bug before the lockup, making analysis of this bug quite a bit easier. Perhaps we should turn lockdep on by default in -rc's? (In reply to comment #12) > Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would > probably have reported this bug before the lockup, Those options turned up the following: BUG: MAX_LOCK_DEPTH too low! turning off the locking correctness validator. This is with 2.6.24-rc4 (with the patch). Created attachment 13856 [details]
dmesg output with 2.6.24-rc4
Full dmesg output with the lockdep BUG message.
> ------- Comment #11 from stern@rowland.harvard.edu 2007-11-26 08:37 > Here's the patch I promised. See if it helps. I'm testing it with 2.6.23-rc4. So far so good after two and a half hours. Still working after 8 hours. Looks good. Created attachment 13871 [details]
Add the missing IRQF_DISABLED flag
Try using this patch instead of the earlier one. It's a simpler and more general fix.
> ------- Comment #17 from stern@rowland.harvard.edu 2007-12-05 11:50 -------
> Try using this patch instead of the earlier one. It's a simpler and more
> general fix.
Seems to work, it's running for four hours now.
Marcus
(In reply to comment #17) > Add the missing IRQF_DISABLED flag > > Try using this patch instead of the earlier one. It's a simpler and more > general fix. It seems to work, no problems so far. The patch has been submitted for 2.6.24. You can go ahead and close out this bug report. Fixed by: commit 442258e2ff69276ff767f3703b30ce6a31fdd181 Author: Alan Stern <stern@rowland.harvard.edu> Date: Thu Dec 6 14:47:08 2007 -0500 USB: use IRQF_DISABLED for HCD interrupt handlers http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=442258e2ff69276ff767f3703b30ce6a31fdd181 |