Bug 9335 - system hangs after a few minutes
Summary: system hangs after a few minutes
Status: CLOSED CODE_FIX
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks: 9243
  Show dependency tree
 
Reported: 2007-11-08 23:15 UTC by Marcus Better
Modified: 2007-12-17 14:29 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.24-rc2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel configuration (51.56 KB, text/plain)
2007-11-08 23:18 UTC, Marcus Better
Details
Use spin_lock_irqsave in ohci_irq (2.40 KB, patch)
2007-11-26 08:37 UTC, Alan Stern
Details | Diff
dmesg output with 2.6.24-rc4 (28.63 KB, text/plain)
2007-12-05 00:32 UTC, Marcus Better
Details
Add the missing IRQF_DISABLED flag (376 bytes, patch)
2007-12-05 11:50 UTC, Alan Stern
Details | Diff

Description Marcus Better 2007-11-08 23:15:31 UTC
Most recent kernel where this bug did not occur: 2.6.23
Distribution: Debian
Hardware Environment: Thinkpad R60, Intel Core 2 Duo
Software Environment: x86_64 kernel, XFS, X.org, KDE.
Problem Description:

When using the system for some time, usually at most a few hours, it suddenly hangs completely, the screen goes black, and it can only be reset with the power switch. The fan is still spinning however and the system seems to generate heat as if it were doing something CPU-intensive.

This happens consistently but at seemingly random times. It's a desktop system, used for some browsing and e-mail mostly.
Comment 1 Marcus Better 2007-11-08 23:18:30 UTC
Created attachment 13475 [details]
Kernel configuration
Comment 2 Andrew Morton 2007-11-09 00:43:09 UTC
The CPU hung somewhere.

Please try enabling the NMI watchdog, see if that triggers anything.

(Add nmi_watchdog=1 to the kernel boot command line)
Comment 3 Marcus Better 2007-11-11 07:28:56 UTC
> ------- Comment #2 from akpm@osdl.org  2007-11-09 00:43 -------
> Please try enabling the NMI watchdog, see if that triggers anything.

If I'm in X, will the oops message get lost? Do I need netconsole?

Marcus
Comment 4 Rafael J. Wysocki 2007-11-11 09:57:49 UTC
Well, please read Documentation/nmi_watchdog.txt in the kernel sources.

Still, on your system the NMI watchdog should be enabled by default.  Do you see any suspicious messages in dmesg appearing before the box hangs?
Comment 5 Marcus Better 2007-11-11 17:17:00 UTC
> ------- Comment #4 from rjwysocki@sisk.pl  2007-11-11 09:57 -------
> Well, please read Documentation/nmi_watchdog.txt in the kernel sources.

I have, but I don't see how that answers my question.

Anyway the system refuses to boot with nmi_watchdog=1. It freezes at the 
beginning of the boot process, somewhere after starting udev and loading 
some modules. The last message is prints is from loading the Bluetooth 
HCI stuff.

> Still, on your system the NMI watchdog should be enabled by default.  Do you
> see any suspicious messages in dmesg appearing before the box hangs?

No, I haven't noticed any. Will try some more.

Marcus
Comment 6 Natalie Protasevich 2007-11-12 12:54:07 UTC
You can try nmi_watchdog=2, this will set up APIC based watchdog.
Comment 7 Natalie Protasevich 2007-11-17 09:30:12 UTC
You can also check for platform errors with mcelog.
If its not installed, apt-get install mcelog. It will read fatal errors logged by the kernel. MCE should be configured, and it looks like you have it enabled in your kernel:
CONFIG_X86_MCE=y
Comment 8 Marcus Better 2007-11-21 13:02:56 UTC
(In reply to comment #6)
> You can try nmi_watchdog=2, this will set up APIC based watchdog.

That was better, I got the oops:

netconsole: network logging started
NMI Watchdog detected LOCKUP on CPU 0
CPU 0
Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4 cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async crc_ccitt fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb soundcore hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor fan
Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3
RIP: 0010:[<ffffffff804876c9>]  [<ffffffff804876c9>] _spin_lock+0x59/0x70
RSP: 0018:ffffffff805ecc18  EFLAGS: 00000002
RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8
RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001
RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d
R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780
R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0
FS:  0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340)
Stack:  ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea
 00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0
 ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a
Call Trace:
 <IRQ>  [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40
 [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220
 [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990
 [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0
 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60
 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70
 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110
 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100
 [<ffffffff8020c921>] ret_from_intr+0x0/0xa
 [<ffffffff8048768e>] _spin_lock+0x1e/0x70
 [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40
 [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0
 [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110
 [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170
 [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370
 [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60
 [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70
 [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110
 [<ffffffff8020f9db>] do_IRQ+0x7b/0x100
 [<ffffffff8020c921>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0
 [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0
 [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0
 [<ffffffff8020af50>] default_idle+0x0/0x40
 [<ffffffff8020aff7>] cpu_idle+0x67/0xd0
 [<ffffffff805aabba>] start_kernel+0x2aa/0x330
 [<ffffffff805aa117>] _sinittext+0x117/0x120


Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00
Kernel panic - not syncing: Aiee, killing interrupt handler!


Update: It crashes after at most half an hour's normal use or so. The screen doesn't go black anymore, but everything freezes (caps lock LED blinking).
Comment 9 Anonymous Emailer 2007-11-21 14:38:44 UTC
Reply-To: akpm@linux-foundation.org

On Wed, 21 Nov 2007 13:02:58 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9335
> 

This USB crash is a post-2.6.23 regression.

> 
> marcus@better.se changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>             Summary|system hangs with blank     |system hangs after a few
>                    |screen after some time      |minutes
> 
> 
> 
> 
> ------- Comment #8 from marcus@better.se  2007-11-21 13:02 -------
> (In reply to comment #6)
> > You can try nmi_watchdog=2, this will set up APIC based watchdog.
> 
> That was better, I got the oops:
> 
> netconsole: network logging started
> NMI Watchdog detected LOCKUP on CPU 0
> CPU 0
> Modules linked in: netconsole configfs i915 drm rfcomm l2cap xfrm_user
> xfrm4_tunnel af_key xfrm4_mode_tunnel nfsd exportfs autofs4
> cpufreq_conservative cpufreq_userspace cpufreq_stats cpufreq_powersave
> rpcsec_gss_krb5 auth_rpcgss tunnel4 ipcomp nfs esp4 lockd ah4 nfs_acl sunrpc
> deflate zlib_deflate twofish_x86_64 twofish_common camellia serpent blowfish
> des_generic xcbc sha1_generic crypto_null hmac crypto_hash ppp_async
> crc_ccitt
> fuse ipv6 ppp_generic nls_utf8 slhc ntfs xfs pl2303 option usbserial kqemu
> coretemp cpufreq_ondemand acpi_cpufreq freq_table snd_seq_dummy snd_seq_oss
> snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_seq_device arc4 ecb
> ohci_hcd snd_hda_intel snd_pcm_oss snd_mixer_oss iwl3945 snd_pcm pcmcia
> snd_timer firmware_class snd thinkpad_acpi mousedev mac80211 hci_usb
> soundcore
> hwmon serio_raw snd_page_alloc bluetooth video backlight output yenta_socket
> rsrc_nonstatic pcmcia_core button i2c_i801 cfg80211 pcspkr iTCO_wdt nvram rtc
> psmouse evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc
> blkcipher dm_crypt dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core
> crc_itu_t uhci_hcd ehci_hcd usbcore tg3 sr_mod cdrom sd_mod thermal processor
> fan
> Pid: 0, comm: swapper Not tainted 2.6.24-rc2-melech #3
> RIP: 0010:[<ffffffff804876c9>]  [<ffffffff804876c9>] _spin_lock+0x59/0x70
> RSP: 0018:ffffffff805ecc18  EFLAGS: 00000002
> RAX: 0000000000000001 RBX: ffffffff8807afd0 RCX: ffff81000113c8b8
> RDX: 0000000000020004 RSI: ffff810004af0028 RDI: 0000000000000001
> RBP: ffff81000427cd48 R08: ffff810005a713e0 R09: 00000000ffffff8d
> R10: 0000000000000000 R11: ffff810003add80c R12: ffff8100042e2780
> R13: ffff810004af0028 R14: ffff81000427cc00 R15: ffff8100042e27b0
> FS:  0000000000000000(0000) GS:ffffffff80591000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002ae0240a2a08 CR3: 000000000b81a000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff805a2000, task ffffffff8055b340)
> Stack:  ffff810004af0028 ffffffff88058d60 ffff81000427cdd0 ffffffff8808cbea
>  00000000042a9940 00000000030007ff ffff8100042e2780 ffff8100042a9ac0
>  ffff810004af0028 ffff810005a713e8 ffff81000427cd48 ffffffff8808d15a
> Call Trace:
>  <IRQ>  [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40
>  [<ffffffff8808cbea>] :uhci_hcd:uhci_giveback_urb+0x9a/0x220
>  [<ffffffff8808d15a>] :uhci_hcd:uhci_scan_schedule+0x29a/0x990
>  [<ffffffff8808f77e>] :uhci_hcd:uhci_irq+0xbe/0x1a0
>  [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60
>  [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70
>  [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110
>  [<ffffffff8020f9db>] do_IRQ+0x7b/0x100
>  [<ffffffff8020c921>] ret_from_intr+0x0/0xa
>  [<ffffffff8048768e>] _spin_lock+0x1e/0x70
>  [<ffffffff88058d60>] :usbcore:usb_hcd_unlink_urb_from_ep+0x10/0x40
>  [<ffffffff882d363b>] :ohci_hcd:finish_urb+0x5b/0xe0
>  [<ffffffff882d37ae>] :ohci_hcd:takeback_td+0xee/0x110
>  [<ffffffff882d38ac>] :ohci_hcd:dl_done_list+0xdc/0x170
>  [<ffffffff882d6452>] :ohci_hcd:ohci_irq+0x1e2/0x370
>  [<ffffffff880591ed>] :usbcore:usb_hcd_irq+0x2d/0x60
>  [<ffffffff802752b4>] handle_IRQ_event+0x34/0x70
>  [<ffffffff80276add>] handle_fasteoi_irq+0x8d/0x110
>  [<ffffffff8020f9db>] do_IRQ+0x7b/0x100
>  [<ffffffff8020c921>] ret_from_intr+0x0/0xa
>  <EOI>  [<ffffffff880052c0>] :processor:acpi_processor_idle+0x2e0/0x4c0
>  [<ffffffff880052bc>] :processor:acpi_processor_idle+0x2dc/0x4c0
>  [<ffffffff88004fe0>] :processor:acpi_processor_idle+0x0/0x4c0
>  [<ffffffff8020af50>] default_idle+0x0/0x40
>  [<ffffffff8020aff7>] cpu_idle+0x67/0xd0
>  [<ffffffff805aabba>] start_kernel+0x2aa/0x330
>  [<ffffffff805aa117>] _sinittext+0x117/0x120
> 
> 
> Code: 8b 03 85 c0 7e f1 eb a3 e8 ba e4 ff ff eb d4 0f 1f 84 00 00
> Kernel panic - not syncing: Aiee, killing interrupt handler!
> 
> 
> Update: It crashes after at most half an hour's normal use or so. The screen
> doesn't go black anymore, but everything freezes (caps lock LED blinking).
> 
Comment 10 Alan Stern 2007-11-21 14:50:59 UTC
I don't have time to make a patch right now.  However I think you'll find that converting all the spin_lock() and spin_unlock() calls in drivers/usb/host/ohci-hcd.c:ohci_irq() to spin_lock_irqsave() and spin_unlock_irqrestore() will solve the problem.

A similar change is probably needed in ehci-hcd.c:ehci_irq().
Comment 11 Alan Stern 2007-11-26 08:37:37 UTC
Created attachment 13762 [details]
Use spin_lock_irqsave in ohci_irq

Here's the patch I promised.  See if it helps.
Comment 12 Ingo Molnar 2007-11-30 06:36:05 UTC
Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would probably have reported this bug before the lockup, making analysis of this bug quite a bit easier.

Perhaps we should turn lockdep on by default in -rc's?
Comment 13 Marcus Better 2007-12-05 00:31:45 UTC
(In reply to comment #12)
> Also, booting with CONFIG_DEBUG_KERNEL=y and CONFIG_PROVE_LOCKING=y would
> probably have reported this bug before the lockup,

Those options turned up the following:

BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.

This is with 2.6.24-rc4 (with the patch).
Comment 14 Marcus Better 2007-12-05 00:32:56 UTC
Created attachment 13856 [details]
dmesg output with 2.6.24-rc4

Full dmesg output with the lockdep BUG message.
Comment 15 Marcus Better 2007-12-05 01:58:11 UTC
> ------- Comment #11 from stern@rowland.harvard.edu  2007-11-26 08:37

> Here's the patch I promised.  See if it helps.

I'm testing it with 2.6.23-rc4. So far so good after two and a half hours.
Comment 16 Marcus Better 2007-12-05 07:28:34 UTC
Still working after 8 hours. Looks good.
Comment 17 Alan Stern 2007-12-05 11:50:34 UTC
Created attachment 13871 [details]
Add the missing IRQF_DISABLED flag

Try using this patch instead of the earlier one.  It's a simpler and more general fix.
Comment 18 Marcus Better 2007-12-06 01:06:00 UTC
> ------- Comment #17 from stern@rowland.harvard.edu  2007-12-05 11:50 -------
> Try using this patch instead of the earlier one.  It's a simpler and more
> general fix.


Seems to work, it's running for four hours now.

Marcus
Comment 19 Marcus Better 2007-12-13 23:27:16 UTC
(In reply to comment #17)
> Add the missing IRQF_DISABLED flag
> 
> Try using this patch instead of the earlier one.  It's a simpler and more
> general fix.

It seems to work, no problems so far.
Comment 20 Alan Stern 2007-12-14 08:07:07 UTC
The patch has been submitted for 2.6.24.  You can go ahead and close out this bug report.
Comment 21 Rafael J. Wysocki 2007-12-17 14:29:42 UTC
Fixed by:

commit 442258e2ff69276ff767f3703b30ce6a31fdd181
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Thu Dec 6 14:47:08 2007 -0500

    USB: use IRQF_DISABLED for HCD interrupt handlers

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=442258e2ff69276ff767f3703b30ce6a31fdd181

Note You need to log in before you can comment on or make changes to this bug.