Bug 14515

Summary: ATA controller losing interrupt, system stall
Product: IO/Storage Reporter: Lloyd Weehuizen (lloyd)
Component: IDEAssignee: io_ide (io_ide)
Status: CLOSED DOCUMENTED    
Severity: normal CC: 21cnbao, alan, andrewnz.simpson, hancockrwd, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Kernel Log
PCI Devices

Description Lloyd Weehuizen 2009-10-31 01:19:37 UTC
Created attachment 23602 [details]
Kernel Log

I've just encountered a problem when moving from Ubuntu 9.04 to 9.10. There are
large stalls every now and again and looking at the kernel log, I'm getting lost interrupts on the ATA interface. This did not happen with the previous 2.6.28 kernel.

Here's a sample from the log.

[   86.816503] ata1: lost interrupt (Status 0x58)
[   86.820070] ata1: drained 32768 bytes to clear DRQ.
[   86.928979] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   86.929068] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[   86.929070]          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
[   86.929072]          res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM
violation)
[   86.929277] ata1.01: status: { DRDY DRQ }
[   86.929353] ata1: soft resetting link

This is on an Acer Aspire laptop with a PIIX chipset and an Intel Mobile 945
GMA

The hard disk is a PATA WD600UE.

See attached kernel log and pci dump.
Comment 1 Lloyd Weehuizen 2009-10-31 01:20:11 UTC
Created attachment 23603 [details]
PCI Devices
Comment 2 Barry Song 2009-10-31 02:01:11 UTC
Yes. I've encountered the same problem "lost interrupt" with Lloyd Weehuizen after upgrading kernel to 2.6.31. What I am using is HP laptop.
I think recent commits must have destroyed some ATA controllers.
Comment 3 Lloyd Weehuizen 2009-10-31 23:44:12 UTC
Rolling back to kernel 2.6.30 reduces the occurrence of this issue, but it still occurs.
Comment 4 Andrew Simpson 2009-11-03 06:21:05 UTC
Ubuntu bug #445852

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/445852

Reporting as affecting Acer Aspire One, Asus EEE and Dell Mini 9 (All SSD models).  More prevalent with SSD's that have been upgraded by the user to newer models, but also reported on the stock 8Gb SSD on the AA1.
Comment 5 Tejun Heo 2009-11-09 14:47:12 UTC
This has nothing to do with the harddrive.  It's the optical drive choking up on commands (TEST_UNIT_READY) used for polling for media presence events.  As it's PATA and the hard drive shares bus with the optical drive, while the optical drive is choking, the hard drive can't access the bus so hard drive access also stalls.

Disabling media presence polling should work around the problem for now.  Long term, I think we shouldn't use TUR for media presence polling.  Windows doesn't use it and devices with crappy firmware which chokes on repeated TUR aren't too rare.  That's in the pipeline but unfortunately I don't think it will be ready for this release cycle.  It's gonna be a pretty pervasive change.  :-(

So, for now, "hal-disable-polling /dev/sr0" seems like the only solution.

Thanks.
Comment 6 Robert Hancock 2009-11-09 23:58:02 UTC
AFAIK the cdrom code normally uses GET EVENT STATUS NOTIFICATION for this rather than TEST UNIT READY. It would be interesting to know where exactly the TUR is coming from..
Comment 7 Tejun Heo 2009-11-10 00:41:38 UTC
TUR is coming from open and I have patches to implement in-kernel polling so that polling doesn't have to go through open but it depends on workqueue patches which are yet to be merged.  I think I can pull it for 2.6.34 but I might be too optimistic.  :-P

Thanks.
Comment 8 Andrew Simpson 2009-11-10 06:38:36 UTC
O.K., it would appear then that this is not related to Ubuntu bug #445852, since the machines in this bug report don't have cdrom drives.

Trying "hal-disable-polling --device /dev/sr0", just gives an error.

Should I open a different bug report?
Comment 9 Tejun Heo 2009-11-10 09:30:57 UTC
Andrew, yes and please attach kernel log with printk timestamp turned on which includes boot and the failures and the output of "lspci -nn".

Thanks.