Bug 5140

Summary: none of the 2.6.13-rc? kernels survives longer than 4 days.
Product: IO/Storage Reporter: Danny ter Haar (osdl)
Component: SCSIAssignee: Diego Calleja (diegocg)
Status: REJECTED DUPLICATE    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13-rc7-git1 Subsystem:
Regression: --- Bisected commit-id:

Description Danny ter Haar 2005-08-27 11:08:31 UTC
Most recent kernel where this bug did not occur: 2.6.12-mm1
Distribution: debian-amd64
Hardware Environment: tyan TA26 B2882 
http://www.tyan.com/products/html/ta26b2882.html
gig-E (FO) coupled with internet (frontend
gig-E (onboard cupper) coupled with backend
Software Environment:usenet gateway
Problem Description:
kernel crashed after 20 hours of usage
Steps to reproduce:

just run it and wait

More info is on:

http://newsgate.newsserver.nl/kernel/
COnfig file is there, as is kernel output on serial console 
when things went wrong.
Comment 1 Martin J. Bligh 2005-08-27 15:31:54 UTC
DevQ(0:1:0): NMI Watchdog detected LOCKUP on CPU0CPU 0
Modules linked in: rawfs rtc evdev hw_random i2c_amd8111 tg3 e100 mii w83627hf
eeprom lm85 i2c_sensor i2c_isa i2c_amd756 i2c_core psmouse
Pid: 168, comm: scsi_eh_0 Not tainted 2.6.13-rc7-git1
RIP: 0010:[<ffffffff802644f9>] <ffffffff802644f9>{serial_in+105}
RSP: 0018:ffff81007fc17b80  EFLAGS: 00000002
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 00000000000003fd RSI: 0000000000000005 RDI: ffffffff80473a40
RBP: 0000000000002705 R08: 0000000000000020 R09: 0000000000007930
R10: 0000000000000034 R11: 000000000000000a R12: ffffffff80473a40
R13: ffffffff8045f6fe R14: 000000000000000d R15: 000000000000000d
FS:  00002aaaab3cbe90(0000) GS:ffffffff80485800(0000) knlGS:00000000556ada40
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000515970 CR3: 000000007dc27000 CR4: 00000000000006e0
Process scsi_eh_0 (pid: 168, threadinfo ffff81007fc16000, task ffff8100033607c0)
Stack: ffffffff8026682d 0000000500000002 ffffffff803ebc60 0000000000007931
       000000000000000d 0000000000000096 0000000000000010 0000000000000046
       ffffffff8012ed9c 000000000000793e
Call Trace:<ffffffff8026682d>{serial8250_console_write+413}
<ffffffff8012ed9c>{__call_console_drivers+76}
       <ffffffff8012f053>{release_console_sem+339} <ffffffff8012fbc9>{vprintk+601}
       <ffffffff8012fbc9>{vprintk+601} <ffffffff8012fc3e>{printk+78}
       <ffffffff80325a40>{thread_return+0} <ffffffff8012fc3e>{printk+78}
       <ffffffff8028c235>{ahd_print_register+261}
<ffffffff802abc34>{ahd_platform_dump_card_state+100}
       <ffffffff80296b0d>{ahd_dump_card_state+8973}
<ffffffff802ad320>{ahd_linux_abort+624}
       <ffffffff802aa590>{ahd_linux_sem_timeout+0}
<ffffffff80284f5c>{scsi_error_handler+1324}
       <ffffffff8010e396>{child_rip+8} <ffffffff80284a30>{scsi_error_handler+0}
       <ffffffff8010e38e>{child_rip+0}

Comment 2 Patrick BURNAND 2006-03-18 03:31:05 UTC
Hello to all,

I was victim of the same problem on my newly acquired AMD 64 dual core.  I'm
using SuSE 10.  The kernel was stable when I installed it.  Suddenly, after a
kernel update from SuSE using Yast, me system crashed everyday, sometimes even
several times a day.  Nothing was found in the logs.

This usually happened like this.

After some minutes or hours, a hard disk switched off, then the hard disk LED
switched on and remained always on.  By hearing the fans running at maximal
speed, I guess that this would trigger an infinite loop.  Everything was frozen
and only a hardware reboot worked.

I thought logically that this could come from a nVidia driver bug.  I simply
started the system without X-Window, but it crashed also.

I also removed acpi, apm, microcode, mce, anything that I could imagine.  I
continued to crash daily.

I then suspected an unfinished/beta/experimental driver.  So I recompiled the
kernel to remove unnecessary code.  I recompiled it 20 times each time removing
unnecessary drivers or features in a hope to be able to know which part fails an
d report it.  I noticed that the kernel was faster and smaller, but this didn't
fix anything.  It continued to crash daily or even several times a day.

Since the xfs filesystem doesn't compile cleanly and the system freezes usually
involved a hard disk switching off and remaining blocked, I thought xfs could
corrupt kernel memory.  So I converted all my partitions to reiserfs and finally
removed xfs from the kernel.  It continued to crash.  (By the way, it's not easy
to make reliable backups when the kernel crashes randomly...  I had to use find
and cmp to make sure all files were backed up and that they were really identical)

Finally, I decided to update my kernel to the latest stable version, although
it's apparently not supported by SuSE.  I installed and compiled the version
2.6.15.6 using my old configuration and it works perfectly now...

So to sum up:
A bug has been introduced with the kernel provided in the latest SuSE 10
security update and has been fixed in 2.6.15.6.
Comment 3 Diego Calleja 2006-07-30 09:16:30 UTC

*** This bug has been marked as a duplicate of 5154 ***