Bug 14171 - ata exception and hang when booting, at T=n.81606 s
Summary: ata exception and hang when booting, at T=n.81606 s
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jeff Garzik
URL: https://bugs.edge.launchpad.net/bugs/...
Keywords:
Depends on:
Blocks:
 
Reported: 2009-09-13 11:00 UTC by Tormod Volden
Modified: 2012-06-13 17:01 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.31
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg from the case (2) where boot continues after 30s (46.45 KB, text/plain)
2009-09-13 11:00 UTC, Tormod Volden
Details
lspci -v output (7.67 KB, text/plain)
2009-09-14 20:48 UTC, Tormod Volden
Details
netconsole log (54.01 KB, text/plain)
2009-09-14 20:49 UTC, Tormod Volden
Details
/var/log/messages from 2 boots (165.21 KB, text/plain)
2009-09-16 08:03 UTC, Tormod Volden
Details
bootchart diagram with 40s benign hang (606.86 KB, image/png)
2009-09-16 21:34 UTC, Tormod Volden
Details

Description Tormod Volden 2009-09-13 11:00:36 UTC
Created attachment 23084 [details]
dmesg from the case (2) where boot continues after 30s

I have seen some "ata exception" hangs for several months but it has much worse after 2.6.28. I have found many similar bug reports and understand that many of them are hardware issues. However in my case I never have problems once the system has successfully booted. Moreover, the exceptions all show a peculiar timestamp with the same sub-second value:

Aug 21 23:35:34 acer-tormod kernel: [ 51.816054] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 24 23:55:50 acer-tormod kernel: [ 53.816050] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 25 19:14:45 acer-tormod kernel: [ 44.816064] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 4 20:26:07 acer-tormod kernel: [ 46.816045] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 5 11:42:30 acer-tormod kernel: [ 52.816068] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 5 11:48:56 acer-tormod kernel: [ 52.816056] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 5 12:10:56 acer-tormod kernel: [ 52.816056] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 5 12:16:47 acer-tormod kernel: [ 52.816056] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

I am curious to know if the freeze is triggered at specific times or if it is the freeze detection that is done at 1 second intervals.

Currently I see two cases:
1) Hang when udev starts up and a handful of drivers are loaded. This is most of the time fatal, the bus resets can not recover the access to the HD and I have to sysrq-B.
2) Hang when X is starting up. At the same time I know that hdparm is run, but I am not sure I can isolate it to this. Most often, after 30s hang, the bus recovers and the boot continues as normal.

I wonder if the fatality of (1) is due to the firmware loading of ipw2200 that makes some kind of deadlock or just stresses the bus so that it can not recover as it would do in (2).

I saw some comments in other reports about interrupts sharing but I haven't seen a difference in /proc/interrupts between successful boots and those in (2).

As can be seen in the original Ubuntu bug, I have tried blacklisting different modules and noapic, nolapic, acpi=noirq, acpi_irq_nobalance, acpi_irq_balance without success. The fact that it is an intermittent issue makes this debugging long and difficult of course. I see when booting that the kernel modules loaded when udev starts are not always in the same order, and maybe this accounts for the randomness.

I am not sure which component this should be filed against but since the message is from ata and SATA is used for this IDE disk I start here. Please shuffle it around if needed.
Comment 1 Tejun Heo 2009-09-14 09:17:37 UTC
The failures probably don't have much to do with other activities.  I suspect the only difference between 1) and 2) would be that ubuntu initrd isn't ready to handle such timeout during boot.  This is PATA drive connected over PATA-SATA bridge, right?  Does libata.force=1:UDMA/66 make any difference?
Comment 2 Tormod Volden 2009-09-14 20:46:27 UTC
> I suspect the only difference between 1) and 2) would be that ubuntu initrd
> isn't ready to handle such timeout during boot.

Thanks for your comment! No, in fact this after the initrd: I have to correct myself, it is not when udev is started for the first time, it is when "udevadmin trigger" is run in /etc/rcS.d/S10udev. The user-space tries to continue after some unsuccessful bus resets, but fails with I/O errors cause the HD is hosed.

> This is PATA drive connected over PATA-SATA bridge, right?

The IDE interface is a "SATA Controller" (see attached lspci output).

> Does libata.force=1:UDMA/66 make any difference?

No, then it hangs quite consistently while gdm is starting, but unless (2) above, this is fatal like (1). I also tried /33.

Do you have any comment on the regularity of the timestamps? I haven't seen this in other "ata exception" reports.
Comment 3 Tormod Volden 2009-09-14 20:48:18 UTC
Created attachment 23093 [details]
lspci -v output
Comment 4 Tormod Volden 2009-09-14 20:49:20 UTC
Created attachment 23094 [details]
netconsole log
Comment 5 Tejun Heo 2009-09-16 02:02:04 UTC
Tormod, the controller is SATA but the drive is PATA and libata detects and applies bridge limit.  I was trying to confirm this actually was the case.

[    1.564820] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]
                                            ^^^^^ ^^^^^^^^
                          two sata ports on primary ide, slave is PATA

[    1.730989] ata1.00: ATA-8: SAMSUNG HM160HC, LQ100-10, max UDMA/100

HM160HC is a PATA drive.

http://www.samsung.com/global/business/hdd/productmodel.do?type=62&subtype=67&model_cd=294

[    1.731040] ata1.00: 312581808 sectors, multi 16: LBA48 
[    1.731102] ata1.00: applying bridge limits

and libata detects that and applies bridge limit.  I still think the most likely culprit is the bridge and in your last log libata.force parameter isn't being applied can you please try that again?  libata will print out messages indicating that it's forcing the specified transfer rate.  If UDMA/66 doesn't work, please try UDMA/33 too.

Thanks.
Comment 6 Tejun Heo 2009-09-16 02:03:14 UTC
Oh.. and the time stamp.  I wouldn't pay too much attention to it.  Early boot procedure can be pretty deterministic and kernel tries to round up timeout delays so that many of them fall on the same tick to save power.
Comment 7 Tormod Volden 2009-09-16 08:03:19 UTC
Created attachment 23099 [details]
/var/log/messages from 2 boots

As I said in comment 2, it would hang as much or worse with UDMA/66 and UDMA/33, see attached log (from disk, so you won't see the exception errors). The netconsole.log from comment 4 was an older one.

I have since switched to upstart, so the boot runs different, and I could now boot successfully with UDMA/66 (or with default value) but that doesn't say much since the issue is intermittent. I have the feeling small timing differences at boot makes for the hangs.
Comment 8 Tejun Heo 2009-09-16 10:59:25 UTC
Ah... okay, so UDMA/66 or 33 doesn't make that much of difference.  Tormod, to me, it just seems the bridge is malfunctioning from time to time.  The initial failure itself could be happening for a variety of reasons including power issue or whatever (does having the power cord plugged in or not make any difference?), but libata EH should be able to recover it by issuing resets.  Most SATA bridge chips don't behave too well and it's not too uncommon for them to behave erratically after a failure and your controller doesn't have access for SCRs so the driver can't issue hardreset.  It's an unfortunate combination but I doubt there's anything driver can do to fix or work around the problem.  :-(
Comment 9 Tormod Volden 2009-09-16 19:08:06 UTC
Ok. But I still wonder why this only happens at boot, and only in recent kernels... After booting up I can fsck, copy big files between partitions, compile, etc as much as I want without hangs.

BTW, it is the same with/without AC power. I have used this laptop for four years but replaced the HD one year ago. But this hangs became an issue only after 2.6.28. So I guess it is a combination of controller, disk, kernel, and early userspace.
Comment 10 Tormod Volden 2009-09-16 21:34:09 UTC
Created attachment 23105 [details]
bootchart diagram with 40s benign hang

Here is seems like it got stuck when many blkid and modprobe were running, also firmware.sh was running over this time, but from the bootchart colour it was sleeping and not waiting for i/o.
Comment 11 Tejun Heo 2009-09-17 09:56:20 UTC
Hmmm... So if you keep the current userland identical and use 2.6.27, the occasional hang go away?
Comment 12 Tormod Volden 2009-09-17 11:22:08 UTC
Thanks for the suggestion. I will try it out for a while. I think it got worse in Ubuntu 9.04 after I traded the stock 2.6.28 kernel with 2.6.30 or 2.6.31. So to start with I will try the 2.6.28 kernel in my Ubuntu 9.10 installation.

Since I try out so many kernels and xorg drivers I am not sure exactly when this became a problem. I remember seeing I/O errors once in February and thought the disk was dying, but maybe that was a hang like here. And for a while I guess I would just see the machine locked up and reboot it, while it was just hanging and the "ata exception" messages were hidden by X or usplash.

I should mention the HD LED indicator: When the disk hangs, it is constantly lit. After the 30 s timeout the light goes off, and stays off in the "fatal" case.
Comment 13 Tejun Heo 2009-09-17 15:50:23 UTC
I'm not entirely sure how that LED is wired but it's most likely coming from the controller and is controlled by whether BSY bit is set or not.  Sending SRST clears it so the behavior isn't unexpected.  Please keep me posted on how different versions of kernel work.  Thanks.
Comment 14 Tormod Volden 2009-09-19 08:42:25 UTC
I tried the 2.6.28-15-generic kernel (from Ubuntu 9.04) and it hung on the first try. Then 2.6.27-02062731-generic (a "mainline" build) which I booted twice without problems. I noticed that there was no radeon in /proc/interrupts so I got suspicious. You would think that after all the xorg testing I would be quick to blame drm modules... So now I am running 2.6.31 again but with the radeon.ko module removed (blacklisting was not working) and I have so far rebooted a dozen times without hangs. This is significant but I can not be 100% sure the hangs are gone.

Is there a way I can force radeon to use its own interrupt? I can also try hacking the module to not enable interrupts.
Comment 15 Tejun Heo 2009-09-25 04:14:27 UTC
I don't think there's a way to do that from the kernel.  You can try to put it in a different slot or tweak BIOS setting to change IRQ routing.  Do your test results still indicate that radeon.ko is the culprit?
Comment 16 Lloyd Weehuizen 2009-10-29 22:25:56 UTC
I've just encountered this bug when moving from Ubuntu 9.04 to 9.10. There are large stalls in the bootup process and I get similar errors.

[   86.816503] ata1: lost interrupt (Status 0x58)
[   86.820070] ata1: drained 32768 bytes to clear DRQ.
[   86.928979] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   86.929068] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[   86.929070]          cdb 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
[   86.929072]          res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation)
[   86.929277] ata1.01: status: { DRDY DRQ }
[   86.929353] ata1: soft resetting link

This is on an Acer Aspire laptop with a PIIX chipset and an Intel Mobile 945 GMA. So its likely not radeon related?

The hard disk is a PATA WD600UE.

I do not encounter this issue when booting the old 9.04 kernel (2.6.28).
Comment 17 Tormod Volden 2009-10-29 23:19:47 UTC
I have found that my freezes seem to happen at the same times as hdparm is run. Karmic used to run hdparm excessively but it is not cleaned up quite a bit. Lloyd, try booting with the "nohdparm" boot option. See the linked Ubuntu bug for more info.
Comment 18 Lloyd Weehuizen 2009-10-29 23:31:28 UTC
Thanks for the tip Tormod. Unfortunately that makes no difference, I even tried renaming hdparm with no luck.
Comment 19 Tejun Heo 2009-10-30 00:42:55 UTC
Lloyd, your problem doesn't have much to do with this one.  Can you please file a separate bug report and attach the output of "lspci -nn" and full kernel log including booting messages?

Tormod, what's the offending hdparm command?

Thanks.
Comment 20 Tormod Volden 2009-10-30 12:15:05 UTC
These hdparm scripts call first "hdparm -i /dev/sda" to grep for APM capabilities, then "hdparm -B 254 /dev/sda" to set head load/unload frequency.

I _think_ the first one is as dangerous as the second, but I haven't tried yet to let only one run.
Comment 21 Tejun Heo 2009-10-30 16:32:42 UTC
"hdparm -i" isn't dangerous but doing "hdparm -B 254" blindly definitely can be dangerous.
Comment 22 Tormod Volden 2009-10-30 17:43:18 UTC
I just meant dangerous as in triggering this hang. Once the system is booted, there is no danger in running "hdparm -B 254" on this drive.
Comment 23 Tejun Heo 2009-10-31 13:37:07 UTC
Okay, but your problem was caused by radeon driver, right?  I'm getting a bit confused here.

Thanks.
Comment 24 Tormod Volden 2009-10-31 23:27:18 UTC
This issue has always been confusing to me :) I often thought I had narrowed it down to one thing, and then all of a sudden it reappears anyway. But now I think it only happens in conjunction with hdparm, and running hdparm at a "bad" moment. Just running hdparm when the system is idle is harmless.

It just happened again now on thawing from hibernation. Unfortunately the /proc/uptime numbers (which I log in my hdparm wrapper) seem to be skewed compared to the kernel timestamps in dmesg (with the ATA exception), so I can not "prove" it is hdparm. Is there a way I can access the time that the kernel is using? Otherwise I guess I can trigger some kernel message to get in sync and back-calculate.
Comment 25 Tejun Heo 2009-11-01 09:25:04 UTC
Turning on CONFIG_PRINTK_TIME and CONFIG_MAGIC_SYSRQ should do the trick.  Every kernel message will be time stamped and you can trigger a message using sysrq.  So, the problem can be reproduced without radeon.ko?  Hmmm....
Comment 26 Tormod Volden 2009-11-01 11:42:25 UTC
I have CONFIG_PRINTK_TIME=y (and forgot I could simply use sysrq to get a sync point :) ) but I wonder if there is a way to get the same time as the one in printk_time through /proc, for my hdparm wrapper? For now I will just add echo p > /proc/sysrq-trigger to it.
 
A few times I have had these hangs before X is starting ("udevadmin trigger" like in comment 2, or comment 10), but I guess radeon.ko is already loaded at that point. I could blacklist radeon.ko but at the current rate it would take weeks before I can say if that helps.
Comment 27 Tormod Volden 2010-06-27 10:13:12 UTC
Just want to let you know that this is still happening with 2.6.35-rc3. I see it more often now, but again this can be user-space changes which trigger it more easily. (BTW, although it doesn't mean much as explained in comment #6, the time stamp is not always n.81606 any longer).
Comment 28 Tormod Volden 2010-10-02 09:53:29 UTC
After fighting with graphics related lock-ups in https://bugs.freedesktop.org/show_bug.cgi?id=28474 I am now using the patch from https://bugs.freedesktop.org/show_bug.cgi?id=29389 to disable drm polling. Since then I have only seen these ata hangs if I boot an unpatched kernel or without the drm_kms_helper.poll=0 option. So my theory is that the polling easily triggers this ata condition.
Comment 29 Tejun Heo 2010-10-03 15:32:11 UTC
Thanks for keeping track of the issue.  It definitely looks like the graphics controller is causing a havoc on interrupt delivery.  Given that this is an isolated case (among ATA bug reports), at this point, I don't believe there is much to be done from libata.

Note You need to log in before you can comment on or make changes to this bug.