Bug 5154

Summary: 2.6.13 crashes on heavy used server
Product: IO/Storage Reporter: Danny ter Haar (osdl)
Component: SCSIAssignee: Diego Calleja (diegocg)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: high CC: andi-bz, jejb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13-rc7-git1 Subsystem:
Regression: --- Bisected commit-id:
Attachments: serial console: touch NMI watchdog

Description Danny ter Haar 2005-08-30 01:41:25 UTC
Most recent kernel where this bug did not occur:2.6.12-mm1
Distribution: debian-pure64
Hardware Environment: tyan b2882 amd64
Software Environment:
Problem Description:
DevQ(0:0:0): 0 waiting
DevQ(0:1:0): NMI Watchdog detected LOCKUP on CPU0CPU 0
Modules linked in: rawfs rtc evdev hw_random i2c_amd8111 tg3 e100 mii w83627hf
eeprom lm85 i2c_sensor i2c_isa i2c_amd756 i2c_core psmouse
Pid: 168, comm: scsi_eh_0 Not tainted 2.6.13-rc7-git1
RIP: 0010:[<ffffffff802644f9>] <ffffffff802644f9>{serial_in+105}
RSP: 0018:ffff81007fc17b80  EFLAGS: 00000002
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff80473a40
RBP: 0000000000002705 R08: 0000000000000020 R09: 0000000000007930
R10: 0000000000000034 R11: 000000000000000a R12: ffffffff80473a40
R13: ffffffff8045f6fe R14: 000000000000000d R15: 000000000000000d
FS:  00002aaaab3cbe90(0000) GS:ffffffff80485800(0000) knlGS:00000000556ada40
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000515970 CR3: 000000007dc27000 CR4: 00000000000006e0
Process scsi_eh_0 (pid: 168, threadinfo ffff81007fc16000, task ffff8100033607c0)
Stack: ffffffff8026682d 0000000500000002 ffffffff803ebc60 0000000000007931
       000000000000000d 0000000000000096 0000000000000010 0000000000000046
       ffffffff8012ed9c 000000000000793e
Call Trace:<ffffffff8026682d>{serial8250_console_write+413}
<ffffffff8012ed9c>{__call_console_drivers+76}
       <ffffffff8012f053>{release_console_sem+339} <ffffffff8012fbc9>{vprintk+601}
       <ffffffff8012fbc9>{vprintk+601} <ffffffff8012fc3e>{printk+78}
       <ffffffff80325a40>{thread_return+0} <ffffffff8012fc3e>{printk+78}
       <ffffffff8028c235>{ahd_print_register+261}
<ffffffff802abc34>{ahd_platform_dump_card_state+100}
       <ffffffff80296b0d>{ahd_dump_card_state+8973}
<ffffffff802ad320>{ahd_linux_abort+624}
       <ffffffff802aa590>{ahd_linux_sem_timeout+0}
<ffffffff80284f5c>{scsi_error_handler+1324}
       <ffffffff8010e396>{child_rip+8} <ffffffff80284a30>{scsi_error_handler+0}
       <ffffffff8010e38e>{child_rip+0}

Code: 0f b6 c0 c3 66 66 90 41 57 49 89 f7 41 56 41 55 41 bd 00 01
console shuts up ...
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

Steps to reproduce:
Comment 1 Andrew Morton 2005-08-30 02:24:38 UTC
I'd be assuming that the adaptec driver decided to do a great console
spew for some reason, and that took so long over the serial console that
the NMI watchdog triggered.

Try this patch, then let's see if you get a decent dump from the adaptec
driver.

Comment 2 Andrew Morton 2005-08-30 02:26:00 UTC
Created attachment 5813 [details]
serial console: touch NMI watchdog
Comment 3 Danny ter Haar 2005-08-30 03:46:58 UTC
Yup, i did capture _some_ of the adaptec output.
Kernel log & configuration @ http://newsgate.newsserver.nl/kernel/
Also there is an overview which kernel lasted how long before crashing (i tested 
quite a few)

ok, 2.6.13+nmi-patch is running .
It will crash within 24 hours... so let's wait ;-)

Danny

Comment 4 Diego Calleja 2006-07-30 09:16:35 UTC
*** Bug 5140 has been marked as a duplicate of this bug. ***
Comment 5 Diego Calleja 2006-07-30 09:17:47 UTC
This bug seems that can be closed because:

o According with #5140 (which is a duplicate of this, this bug shouldn't depend
on it) it has been fixed in current kernels and 2.6.15.6 and onwards should work OK

o akpm's proposed patch was merged months ago
(78512ece148992a5c00c63fbf4404f3cde635016)

o akpm's patch was merged on Nov 7. This means it was merged for 2.6.15, which
is probably why the reporter from #5140 fixed his problem by updating his kernel
to 2.6.15.6

So, unless someone has a reason to reopen it again, I'm closing this and #5140