Bug 5700

Summary: Panic: Fatal exception in interrupt w/ Intel AHCI (repeatable)
Product: IO/Storage Reporter: Lasse K (tronic+539x)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED CODE_FIX    
Severity: high    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.15 Subsystem:
Regression: --- Bisected commit-id:

Description Lasse K 2005-12-04 12:58:25 UTC
Possibly a hardware failure, but should not cause a system crash.

Hardware: A Maxtor MaxLine II Plus, connected to Intel AHCI controller (Asus
P5AD2-E Premium / i925XE). Three other HDDs connected to the same controller,
many more HDDs on other controllers.

Steps to reproduce: read data from the faulty drive (when it is in its faulty
state). Hardly any data will get read (it's very slow) and then kernel panic
occurs (after maybe 20 seconds). After a reset it is repeatable and will still
crash. After a poweroff and touching the cables the disk can once again be read
and written fine, until it fails at later time.

This hasn't happened enough times for me to try guessing whether a poweroff or
replugging the loose SATA cables was what fixed it. I could use the HDD fine for
several days until it failed again.

None of the other HDDs are showing any signs of failure, while this particular
HDD has failed numerous times.

This is what I gathered of the panic message (numerical data omitted):
handle_IRQ_event
__do_IRQ
__wake_up_common
do_IRQ
common_interrupt
scsi_request_fn
blk_run_queue
scsi_next_command
scsi_end_request
scsi_io_completion
sd_rw_intr
ata_scsi_qc_complete
ata_qc_complete
ahci_eng_timeout
scsi_error_handler
ata_scsi_error
scsi_error_handler
kthread
kthread
kernel_thread_helper
<0>Kernel panic - not syncing. Fatal exception in interrupt
Comment 1 Lasse K 2006-01-20 09:22:14 UTC
Still happens with 2.6.15. Could someone kick Jeff Garzik? I'd really appreciate
a reply. I would post a full error message, but netconsole stops working when
init is started (prints messages fine until "Freeing unused kernel memory", but
nothing after that row).
Comment 2 Lasse K 2006-01-20 09:38:59 UTC
Got the netconsole working (by using /bin/sh as init). Here's the log after that
crash occurs.

I'm reading data off the disk here, the reading stalls, and after a couple of
seconds this gets printed:

ata4: handling error/timeout
ata4: port reset, p_is 4000000 is 0 pis 4000000 cmd 4017 tf d0 ss 113 se 400000
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key=0x0
    ASC=0x0 ASCQ=0x0
Assertion failed! qc->flags &
ATA_QCFLAG_ACTIVE,drivers/scsi/libata-core.c,ata_qc_complete,line=3513
------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi.c:295!
invalid operand: 0000 [#1]
Modules linked in:
CPU:    0
EIP:    0060:[<c037dcaf>]    Not tainted VLI
EFLAGS: 00010046   (2.6.15-gentoo)
EIP is at scsi_put_command+0x8b/0x95
eax: f7ea7990   ebx: f7ed9b00   ecx: f7ed9b0c   edx: f7ed9b0c
esi: f7e9f000   edi: 00000282   ebp: f7ea7800   esp: f7ebdc80
ds: 007b   es: 007b   ss: 0068
Process scsi_eh_3 (pid: 950, threadinfo=f7ebc000 task=f7e82030)
Stack: f7ea79f8 c026301f f7ed9b00 f7ea7990 f7e71808 f7e71808 c0382684 f7ed9b00
       f73ebd94 f7ed9b00 00000286 c038279d f7ed9b00 00000001 00000000 f73ebd94
       00000000 00000000 f7ed9b00 c0382a8a f7ed9b00 00000001 00000000 00000001
Call Trace:
 [<c026301f>] kobject_get+0x17/0x1e
 [<c0382684>] scsi_next_command+0x2f/0x4f
 [<c038279d>] scsi_end_request+0xc3/0xe7
 [<c0382a8a>] scsi_io_completion+0x137/0x4d5
 [<c03936c9>] sd_rw_intr+0x13d/0x256
 [<c037e378>] scsi_finish_command+0x24/0xa4
 [<c038d9ca>] ata_scsi_qc_complete+0x5b/0xaf
 [<c038af41>] ata_qc_complete+0x3a/0xb4
 [<c038f681>] ahci_interrupt+0xe0/0x20a
 [<c01049dd>] do_IRQ+0x1e/0x24
 [<c010309a>] common_interrupt+0x1a/0x20
 [<c01391b0>] handle_IRQ_event+0x39/0x6d
 [<c0139243>] __do_IRQ+0x5f/0xc0
 [<c011a3db>] __wake_up_common+0x38/0x57
 [<c01049d8>] do_IRQ+0x19/0x24
 [<c010309a>] common_interrupt+0x1a/0x20
 [<c03835a7>] scsi_request_fn+0x1b1/0x2e2
 [<c02565eb>] blk_run_queue+0x3a/0x3c
 [<c038268c>] scsi_next_command+0x37/0x4f
 [<c038279d>] scsi_end_request+0xc3/0xe7
 [<c0382a8a>] scsi_io_completion+0x137/0x4d5
 [<c03936c9>] sd_rw_intr+0x13d/0x256
 [<c038da02>] ata_scsi_qc_complete+0x93/0xaf
 [<c038af41>] ata_qc_complete+0x3a/0xb4
 [<c038f575>] ahci_eng_timeout+0x83/0xae
 [<c0381b54>] scsi_error_handler+0x0/0xa0
 [<c038d175>] ata_scsi_error+0x17/0x2b
 [<c0381bd8>] scsi_error_handler+0x84/0xa0
 [<c038d175>] ata_scsi_error+0x17/0x2b
 [<c0381bd8>] scsi_error_handler+0x84/0xa0
 [<c012fb4c>] kthread+0xb4/0xea
 [<c012fa98>] kthread+0x0/0xea
 [<c0101329>] kernel_thread_helper+0x5/0xb
Code: 5c 24 08 8b 74 24 0c 89 44 24 1c 8b 7c 24 10 8b 6c 24 14 83 c4 18 e9 a8 11
f6 ff 89 43 0c 31 db 89 48 04 89 4e 14 89 51 04 eb b7 <0f> 0b 27 01 f6 27 5e c0
eb 95 57 56 53 83 ec 28 8b 74 24 38 8d
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <6>SysRq : Terminate All Tasks
SysRq : Terminate All Tasks
SysRq : Kill All Tasks
SysRq : Emergency Remount R/O
SysRq : Emergency Sync
SysRq : Power Off
SysRq : Power Off
SysRq : Power Off

The system is dead.
Comment 3 Alan 2007-06-18 08:07:46 UTC
libata now has full error handling. Please re-open if still seen