Bug 9393
Summary: | SError: { DevExch } occuring and causing disruption | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Avuton Olrich (avuton) |
Component: | Serial ATA | Assignee: | Tejun Heo (htejun) |
Status: | REJECTED INSUFFICIENT_DATA | ||
Severity: | normal | CC: | bunk, kristen.c.accardi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | v2.6.24-rc2-g6e800af2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 9243 | ||
Attachments: |
dmesg showing offenders
config The config used for this system. lspci -vvv output smartctl -a New dmesg with times dmesg as requested. bug13596-clear-ehi_serror.patch dmesg-112107 dmesg-112207 dmesg-113007 dmesg-120707 |
Description
Avuton Olrich
2007-11-15 22:39:40 UTC
Created attachment 13573 [details]
dmesg showing offenders
Created attachment 13574 [details]
config The config used for this system.
Created attachment 13575 [details]
lspci -vvv output
A few additional things, it usually takes about 1 1/2 minutes to reset(?) (appx). Reply-To: akpm@linux-foundation.org (switched to email - please reply via emailed reply-to-all, not via the bugzilla web interface) On Thu, 15 Nov 2007 22:39:41 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9393 > > Summary: SError: { DevExch } occuring and causing disruption > Product: IO/Storage > Version: 2.5 > KernelVersion: v2.6.24-rc2-g6e800af2 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Serial ATA > AssignedTo: jgarzik@pobox.com > ReportedBy: avuton@gmail.com > > > Most recent kernel where this bug did not occur: v2.6.23 Rafael, one for the regression list please. > Distribution: Gentoo > > Since running in the latest kernels I have been getting the following, but > not > often. When it happens the computer grinds to a complete halt. Cursor will > move > a small small amount every 20 seconds or so, making things pretty painful > until > they complete. > > v2.6.24-rc2-g6e800af2 was the first place I saw it, I didn't see it in > v2.6.24-rc1, but I was only on it briefly due to unrelated issues. > > This hasn't completely been a problem too often, it happens once every day, > maybe two even, and I would bisect it but this occurs so infrequently I'm > sure > I wouldn't get this bug report out until v2.6.24 :). > > Now that I'm looking over my dmesg I also notice: > atl1 0000:03:00.0: tx busy > which could also be the problem > > Any way I can help please let me know. > Yes, that's an ATA problem. (there is dmesg output and a few other things at the bugzilla URL). Hello, ata1.00: exception Emask 0x2 SAct 0x3fffb SErr 0x0 action 0x2 frozen ata1.00: spurious completions during NCQ issue=0x0 SAct=0x3fffb FIS=004040a1:00000004 ata1.00: cmd 61/00:00:ea:78:a8/04:00:0b:00:00/40 tag 0 cdb 0x0 data 524288 out res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) ata1.00: status: { DRDY } ata1.00: cmd 61/00:08:9a:e3:a8/04:00:0b:00:00/40 tag 1 cdb 0x0 data 524288 out res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) ata1.00: status: { DRDY } ata1.00: cmd 61/18:18:7a:61:a7/00:00:0b:00:00/40 tag 3 cdb 0x0 data 12288 out res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) This one we know and are trying to determine what's going on. Do you happen to have smartd running? ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata1.00: configured for UDMA/133 ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4 ata1: irq_stat 0x00400040, connection status changed ata1: SError: { PHYRdyChg DevExch } This one very likely indicates hardware problem (e.g. dying power supply or air conditioner kicking in). Can you please turn on CONFIG_PRINTK_TIME and report the failing dmesg? Let's see what's happening when. Also, please post the result of "smartctl -a /dev/sda". Thanks. (In reply to comment #6) > Hello, > > ata1.00: exception Emask 0x2 SAct 0x3fffb SErr 0x0 action 0x2 frozen > ata1.00: spurious completions during NCQ issue=0x0 SAct=0x3fffb > FIS=004040a1:00000004 > ata1.00: cmd 61/00:00:ea:78:a8/04:00:0b:00:00/40 tag 0 cdb 0x0 data 524288 > out > res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) > ata1.00: status: { DRDY } > ata1.00: cmd 61/00:08:9a:e3:a8/04:00:0b:00:00/40 tag 1 cdb 0x0 data 524288 > out > res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) > ata1.00: status: { DRDY } > ata1.00: cmd 61/18:18:7a:61:a7/00:00:0b:00:00/40 tag 3 cdb 0x0 data 12288 out > res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation) > > This one we know and are trying to determine what's going on. Do you > happen to have smartd running? No. > ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > ata1.00: configured for UDMA/133 > ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4 > ata1: irq_stat 0x00400040, connection status changed > ata1: SError: { PHYRdyChg DevExch } > > This one very likely indicates hardware problem (e.g. dying power > supply or air conditioner kicking in). Can you please turn on > CONFIG_PRINTK_TIME and report the failing dmesg? Let's see what's > happening when. Also, please post the result of "smartctl -a > /dev/sda". I'm weary when I'm told things are 'hardware' problems, though I'm sure they do exist, I was told that before, went out bought a new mobo and a year later found out it was a kernel bug :). I will turn on CONFIG_PRINTK_TIME and dump the dmesg here next time it occurs. Also, will attach smartctl-a output now. > Thanks. Thanks for your time. Created attachment 13578 [details]
smartctl -a
(In reply to comment #7) > (In reply to comment #6) > > This one we know and are trying to determine what's going on. Do you > > happen to have smartd running? > > No. Hmmm... > > ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > > ata1.00: configured for UDMA/133 > > ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4 > > ata1: irq_stat 0x00400040, connection status changed > > ata1: SError: { PHYRdyChg DevExch } > > > > This one very likely indicates hardware problem (e.g. dying power > > supply or air conditioner kicking in). Can you please turn on > > CONFIG_PRINTK_TIME and report the failing dmesg? Let's see what's > > happening when. Also, please post the result of "smartctl -a > > /dev/sda". > > I'm weary when I'm told things are 'hardware' problems, though I'm sure they > do > exist, I was told that before, went out bought a new mobo and a year later > found out it was a kernel bug :). I will turn on CONFIG_PRINTK_TIME and dump > the dmesg here next time it occurs. Also, will attach smartctl-a output now. PHY ready changed and device exchanged bits are in SError register which indicates the state of the SATA link as seen from the host side. PHY ready changed means link on/offline status has changed and device exchanged means link came online with new device. If this happens while the drive is connected, it means link went offline briefly and came back up. Really, not much software involved there and symptoms like this are pretty typical. The most likely cause is unstable power. SATA links are very susceptible to power fluctuation and interferences. If your power supply isn't stable enough for whatever reason, SATA links will be the first ones to fail. The best way to verify this is to connect harddrives to a separate power supply. Buy a al-cheapo power supply, wire it such that it powers up without motherboard attached and connect only harddrives to it. Rather than fussing with hardware upgrades just to demonstrate a bug, you could instead just back up to an older kernel which you believe did not have this problem. Prove that it still works fine there, and you've eliminated hardware as a primary cause. Cheers (In reply to comment #10) > Rather than fussing with hardware upgrades just to demonstrate a bug, you > could > instead just back up to an older kernel which you believe did not have this > problem. Prove that it still works fine there, and you've eliminated > hardware > as a primary cause. Sounds like a plan, I'll keep the computer on this kernel until it happens again to give more info to Tejun as requested, then I'll roll back afterwards (if it reproduces) and again wait for it to reproduce. Thanks Created attachment 13582 [details]
New dmesg with times
Here's a new dmesg with times and this stalls things for small slices of time, very noticeable when watching video. I mentioned it before, but here are more of them:
atl1 0000:03:00.0: tx busy
If this is a separate issue that needs to be dealt with in a separate bug please inform me and I will file a new bug. Thanks again.
bugme-daemon@bugzilla.kernel.org wrote: > Created an attachment (id=13582) > --> (http://bugzilla.kernel.org/attachment.cgi?id=13582&action=view) > New dmesg with times > > Here's a new dmesg with times and this stalls things for small slices of > time, > very noticeable when watching video. I mentioned it before, but here are more > of them: I don't see any libata related messages in the log other than probing messages? > atl1 0000:03:00.0: tx busy > > If this is a separate issue that needs to be dealt with in a separate bug > please inform me and I will file a new bug. Thanks again. Yes, it's a separate issue. Created attachment 13596 [details]
dmesg as requested.
And the new dmesg as requested. This one has more status DRDY stuff I will revert back to 2.6.23 and I'll report go/nogo in a few days.
Ah.. okay, so those device exchanged exceptions occur after resuming from hibernation. That now makes much more sense. PHY status changed and device exchanged are kind of expected during resume w/ controller being reset and all. EH is supposed to keep quiet. I'll look into what's going on && there's nothing wrong with your setup. It's just libata whining a bit too much when it shouldn't. Thanks. Created attachment 13613 [details] bug13596-clear-ehi_serror.patch Does this patch fix the problem? OK, I am testing. No reason not to close this as resolved/fixed and I'll reopen it if it occurs again, otherwise I'll give the word in a few days that everything is good/not good. I'll leave it open till it's verified. Thanks. Created attachment 13666 [details]
dmesg-112107
New dmesg, after applying the earlier patch this one contains the DevExch happening many more times than before, but it's not noticable to the user now.
Now I'm confused. Do you have device attached to the first port of that ahci controller? The kernel isn't detecting any but it's raising a lot of phy status changed events. Can you please tell us hardware configuration and post untruncated dmesg? Thanks. Oh, I underst(In reply to comment #20) > Now I'm confused. Do you have device attached to the first port of that ahci > controller? The kernel isn't detecting any but it's raising a lot of phy > status changed events. Can you please tell us hardware configuration and > post > untruncated dmesg? > > Thanks. Oh, it may seem strange that I didn't understand the root of why this was happening but for a year or more (not necessarily with this motherboard), I have done this and it was not a problem with either the IDE drives (when I had all IDE drives) or the libata drivers. This computer has 5 hard drives, 4 of which I unplug the power, but the SATA cable stays plugged in merely for convenience when they're not needed (I backup once every few weeks from various computers). I guess even though not plugged in one or more of these drives communicate without hard drive power? Ah.. okay. So, those DevExch events are happening on drives which are powered off but still connected to the SATA port. The original report was about DevExch occurring during resume from hibernation, right? Can you please first verify that's gone with the patch applied? (In reply to comment #22) > Ah.. okay. So, those DevExch events are happening on drives which are > powered > off but still connected to the SATA port. > > The original report was about DevExch occurring during resume from > hibernation, > right? Can you please first verify that's gone with the patch applied? Sure, I will disconnect all sata cables that aren't actually running for now so I can stop the dmesg flood and I'll get back to testing that. Also- Do you want a separate bug for the other issue? Thanks I don't think we can do much about DevExch when power cable is unplugged. It's strictly hardware and I don't think controller guys test for such use cases. I recommend using "hdparm -Y /dev/sdX" instead of physically yanking the power cable out. Created attachment 13693 [details]
dmesg-112207
OK, I'm still getting this DevExch. ata1, it appears, was one of the ones that was connected without power, I've disconnected all those that had no power connected which left this computer with only 1 hard drive that is physically hooked up. Now I'm getting a constant loop of DevExch messages with
sbh@rocket /usr/src/linux $ git-rev-parse master
2ffbb8377c7a0713baf6644e285adc27a5654582
sbh@rocket /usr/src/linux $ git-diff
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 8189803..fd32c55 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3988,6 +3988,7 @@ void ata_std_postreset(struct ata_link *link, unsigned int *classes)
/* clear SError */
if (sata_scr_read(link, SCR_ERROR, &serror) == 0)
sata_scr_write(link, SCR_ERROR, serror);
+ link->eh_info.serror = 0;
/* is double-select really necessary? */
if (classes[0] != ATA_DEV_NONE)
Argh, Just figured out what ata7 is, it's a external drive, turned off but when on running running over esata. Obviously the power gets turned off on that drive but some kind of communication continues :( Okay, Hmmm.... It might be that you don't even need that patch. Can you please try without the patch? Well, I haven't experienced this in two days now, so basically what's being said here is that >2.6.23-rc1 leaving hard drive sata cables plugged in but not having the power plugged in, internal or external, is a no no? :) err, 2.6.24-rc1 I mean That probably doesn't have anything to do with kernel version. Can you please check again old kernel doesn't have this problem? It probably has more to do with hardware configuration change. And the no-no doesn't come from driver, the hardware just isn't designed with such use case in mind and it generates hotplugging event repeatedly. I dunno why it's like that. It could be the controller's fault or the drive's. Anyways, nothing in the spec mentions such setup, so nothing can be really out-of-spec. :-) I don't know where it came from but I was never able to make the old kernel have the problem. I've plugged the sata back in alone and I haven't gotten the problem yet. That's weird. Please confirm that old kernel is okay and then verify again that switching to new kernel triggers the problem. Thanks. Created attachment 13809 [details]
dmesg-113007
I have reconfirmed that this does not occur with 2.6.23.8 and it definitely does with the latest git master (2ffbb8377c7a0713baf6644e285adc27a5654582). Attaching the latest git master dmesg.
In the posted dmesg, the only error messages I can see are spurious completions. What we're trying to debug here is DevExch's, right? Oh, please excuse me, I'm thinking they're somehow related. I will try the latest kernel again. Ping? Yeah, it doesn't reproduce now :/ Not sure why or what, but it won't reproduce at all. Both are hooked in, on -rc4+. I think it may have been caused by one that was without power but the sata cable wasn't secure into the hard drive. Created attachment 13913 [details]
dmesg-120707
Needless to say, despite what I just said a few hours ago, it popped back up again.
This is from the harddisk which is connected but not powered and the kernel is -rc4, right? And how long did it take to reproduce the problem? (In reply to comment #39) > This is from the harddisk which is connected but not powered and the kernel > is > -rc4, right? That is correct, I left the hard drive connected but without the power connection, and the kernel was post -rc4 git. I'm not running it right now due to hibernate not working on a regular basis on that kernel. > And how long did it take to reproduce the problem? Well, I don't remember exactly, I had just got done rebooting it during this session if I remember correctly. The best reference I have for time are the kernel time marks there >> [ 72.892153] vboxdrv: Successfully done. [ 2111.219739] ata7: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xa frozen [ 2111.219744] ata7: irq_stat 0x00000040, connection status changed [ 2111.219747] ata7: SError: { DevExch } The problem here is that it still isn't clear whether the behavior was introduced by change in the driver code or is happening regardless of driver code changes. Please keep paying attention to kernel messages and verify that it really doesn't happen with previous kernels. Thanks. I'll be on 2.6.23.8 for at least another week, if I don't see it in that week, that's proof positive enough for me that it doesn't happen in this version. Yeap, I'll wait for your response. Thanks. I've been keeping a good eye out for this bug (checking dmesg at least daily) and it's pretty safe to say it doesn't reproduce on the 2.6.23.x. Thanks. That means we're doing something differently on idle PHY. I wonder whether it's related to recent PHY power save changes. Kristen, can you please have a look? Thanks. Kristen? |