Bug 9393

Summary: SError: { DevExch } occuring and causing disruption
Product: IO/Storage Reporter: Avuton Olrich (avuton)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: REJECTED INSUFFICIENT_DATA    
Severity: normal CC: bunk, kristen.c.accardi
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v2.6.24-rc2-g6e800af2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9243    
Attachments: dmesg showing offenders
config The config used for this system.
lspci -vvv output
smartctl -a
New dmesg with times
dmesg as requested.
bug13596-clear-ehi_serror.patch
dmesg-112107
dmesg-112207
dmesg-113007
dmesg-120707

Description Avuton Olrich 2007-11-15 22:39:40 UTC
Most recent kernel where this bug did not occur: v2.6.23
Distribution: Gentoo

Since running in the latest kernels I have been getting the following, but not often. When it happens the computer grinds to a complete halt. Cursor will move a small small amount every 20 seconds or so, making things pretty painful until they complete.

v2.6.24-rc2-g6e800af2 was the first place I saw it, I didn't see it in v2.6.24-rc1, but I was only on it briefly due to unrelated issues.

This hasn't completely been a problem too often, it happens once every day, maybe two even, and I would bisect it but this occurs so infrequently I'm sure I wouldn't get this bug report out until v2.6.24 :).

Now that I'm looking over my dmesg I also notice:
atl1 0000:03:00.0: tx busy
which could also be the problem

Any way I can help please let me know.
Comment 1 Avuton Olrich 2007-11-15 22:40:19 UTC
Created attachment 13573 [details]
dmesg showing offenders
Comment 2 Avuton Olrich 2007-11-15 22:41:19 UTC
Created attachment 13574 [details]
config The config used for this system.
Comment 3 Avuton Olrich 2007-11-15 22:42:37 UTC
Created attachment 13575 [details]
lspci -vvv output
Comment 4 Avuton Olrich 2007-11-15 22:46:18 UTC
A few additional things, it usually takes about 1 1/2 minutes to reset(?) (appx). 
Comment 5 Anonymous Emailer 2007-11-15 23:14:09 UTC
Reply-To: akpm@linux-foundation.org


(switched to email - please reply via emailed reply-to-all, not via the
bugzilla web interface)

On Thu, 15 Nov 2007 22:39:41 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9393
> 
>            Summary: SError: { DevExch } occuring and causing disruption
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: v2.6.24-rc2-g6e800af2
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Serial ATA
>         AssignedTo: jgarzik@pobox.com
>         ReportedBy: avuton@gmail.com
> 
> 
> Most recent kernel where this bug did not occur: v2.6.23

Rafael, one for the regression list please.

> Distribution: Gentoo
> 
> Since running in the latest kernels I have been getting the following, but
> not
> often. When it happens the computer grinds to a complete halt. Cursor will
> move
> a small small amount every 20 seconds or so, making things pretty painful
> until
> they complete.
> 
> v2.6.24-rc2-g6e800af2 was the first place I saw it, I didn't see it in
> v2.6.24-rc1, but I was only on it briefly due to unrelated issues.
> 
> This hasn't completely been a problem too often, it happens once every day,
> maybe two even, and I would bisect it but this occurs so infrequently I'm
> sure
> I wouldn't get this bug report out until v2.6.24 :).
> 
> Now that I'm looking over my dmesg I also notice:
> atl1 0000:03:00.0: tx busy
> which could also be the problem
> 
> Any way I can help please let me know.
> 

Yes, that's an ATA problem.

(there is dmesg output and a few other things at the bugzilla URL).
Comment 6 Tejun Heo 2007-11-16 01:18:39 UTC
Hello,

ata1.00: exception Emask 0x2 SAct 0x3fffb SErr 0x0 action 0x2 frozen
ata1.00: spurious completions during NCQ issue=0x0 SAct=0x3fffb FIS=004040a1:00000004
ata1.00: cmd 61/00:00:ea:78:a8/04:00:0b:00:00/40 tag 0 cdb 0x0 data 524288 out
         res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
ata1.00: status: { DRDY }
ata1.00: cmd 61/00:08:9a:e3:a8/04:00:0b:00:00/40 tag 1 cdb 0x0 data 524288 out
         res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
ata1.00: status: { DRDY }
ata1.00: cmd 61/18:18:7a:61:a7/00:00:0b:00:00/40 tag 3 cdb 0x0 data 12288 out
         res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)

This one we know and are trying to determine what's going on.  Do you
happen to have smartd running?

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4
ata1: irq_stat 0x00400040, connection status changed
ata1: SError: { PHYRdyChg DevExch }

This one very likely indicates hardware problem (e.g. dying power
supply or air conditioner kicking in).  Can you please turn on
CONFIG_PRINTK_TIME and report the failing dmesg?  Let's see what's
happening when.  Also, please post the result of "smartctl -a
/dev/sda".

Thanks.
Comment 7 Avuton Olrich 2007-11-16 04:45:29 UTC
(In reply to comment #6)
> Hello,
> 
> ata1.00: exception Emask 0x2 SAct 0x3fffb SErr 0x0 action 0x2 frozen
> ata1.00: spurious completions during NCQ issue=0x0 SAct=0x3fffb
> FIS=004040a1:00000004
> ata1.00: cmd 61/00:00:ea:78:a8/04:00:0b:00:00/40 tag 0 cdb 0x0 data 524288
> out
>          res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
> ata1.00: status: { DRDY }
> ata1.00: cmd 61/00:08:9a:e3:a8/04:00:0b:00:00/40 tag 1 cdb 0x0 data 524288
> out
>          res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
> ata1.00: status: { DRDY }
> ata1.00: cmd 61/18:18:7a:61:a7/00:00:0b:00:00/40 tag 3 cdb 0x0 data 12288 out
>          res 40/00:28:ca:62:a7/00:00:0b:00:00/40 Emask 0x2 (HSM violation)
> 
> This one we know and are trying to determine what's going on.  Do you
> happen to have smartd running?

No.

> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: configured for UDMA/133
> ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4
> ata1: irq_stat 0x00400040, connection status changed
> ata1: SError: { PHYRdyChg DevExch }
> 
> This one very likely indicates hardware problem (e.g. dying power
> supply or air conditioner kicking in).  Can you please turn on
> CONFIG_PRINTK_TIME and report the failing dmesg?  Let's see what's
> happening when.  Also, please post the result of "smartctl -a
> /dev/sda".

I'm weary when I'm told things are 'hardware' problems, though I'm sure they do exist, I was told that before, went out bought a new mobo and a year later found out it was a kernel bug :). I will turn on CONFIG_PRINTK_TIME and dump the dmesg here next time it occurs. Also, will attach smartctl-a output now.

> Thanks.

Thanks for your time.
Comment 8 Avuton Olrich 2007-11-16 04:45:58 UTC
Created attachment 13578 [details]
smartctl -a
Comment 9 Tejun Heo 2007-11-16 07:13:57 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > This one we know and are trying to determine what's going on.  Do you
> > happen to have smartd running?
> 
> No.

Hmmm...

> > ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > ata1.00: configured for UDMA/133
> > ata1: exception Emask 0x10 SAct 0x0 SErr 0x4010000 action 0xb t4
> > ata1: irq_stat 0x00400040, connection status changed
> > ata1: SError: { PHYRdyChg DevExch }
> > 
> > This one very likely indicates hardware problem (e.g. dying power
> > supply or air conditioner kicking in).  Can you please turn on
> > CONFIG_PRINTK_TIME and report the failing dmesg?  Let's see what's
> > happening when.  Also, please post the result of "smartctl -a
> > /dev/sda".
> 
> I'm weary when I'm told things are 'hardware' problems, though I'm sure they
> do
> exist, I was told that before, went out bought a new mobo and a year later
> found out it was a kernel bug :). I will turn on CONFIG_PRINTK_TIME and dump
> the dmesg here next time it occurs. Also, will attach smartctl-a output now.

PHY ready changed and device exchanged bits are in SError register which indicates the state of the SATA link as seen from the host side.  PHY ready changed means link on/offline status has changed and device exchanged means link came online with new device.  If this happens while the drive is connected, it means link went offline briefly and came back up.  Really, not much software involved there and symptoms like this are pretty typical.

The most likely cause is unstable power.  SATA links are very susceptible to power fluctuation and interferences.  If your power supply isn't stable enough for whatever reason, SATA links will be the first ones to fail.  The best way to verify this is to connect harddrives to a separate power supply.  Buy a al-cheapo power supply, wire it such that it powers up without motherboard attached and connect only harddrives to it.
Comment 10 Mark Lord 2007-11-16 07:30:58 UTC
Rather than fussing with hardware upgrades just to demonstrate a bug, you could instead just back up to an older kernel which you believe did not have this problem.  Prove that it still works fine there, and you've eliminated hardware as a primary cause.

Cheers
Comment 11 Avuton Olrich 2007-11-16 07:37:57 UTC
(In reply to comment #10)
> Rather than fussing with hardware upgrades just to demonstrate a bug, you
> could
> instead just back up to an older kernel which you believe did not have this
> problem.  Prove that it still works fine there, and you've eliminated
> hardware
> as a primary cause.

Sounds like a plan, I'll keep the computer on this kernel until it happens again to give more info to Tejun as requested, then I'll roll back afterwards (if it reproduces) and again wait for it to reproduce.

Thanks
Comment 12 Avuton Olrich 2007-11-16 15:33:36 UTC
Created attachment 13582 [details]
New dmesg with times

Here's a new dmesg with times and this stalls things for small slices of time, very noticeable when watching video. I mentioned it before, but here are more of them:

atl1 0000:03:00.0: tx busy

If this is a separate issue that needs to be dealt with in a separate bug please inform me and I will file a new bug. Thanks again.
Comment 13 Tejun Heo 2007-11-16 17:02:08 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> Created an attachment (id=13582)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=13582&action=view)
> New dmesg with times
> 
> Here's a new dmesg with times and this stalls things for small slices of
> time,
> very noticeable when watching video. I mentioned it before, but here are more
> of them:

I don't see any libata related messages in the log other than probing
messages?

> atl1 0000:03:00.0: tx busy
> 
> If this is a separate issue that needs to be dealt with in a separate bug
> please inform me and I will file a new bug. Thanks again.

Yes, it's a separate issue.
Comment 14 Avuton Olrich 2007-11-17 21:37:20 UTC
Created attachment 13596 [details]
dmesg as requested.

And the new dmesg as requested. This one has more status DRDY stuff I will revert back to 2.6.23 and I'll report go/nogo in a few days.
Comment 15 Tejun Heo 2007-11-18 16:22:11 UTC
Ah.. okay, so those device exchanged exceptions occur after resuming from hibernation.  That now makes much more sense.  PHY status changed and device exchanged are kind of expected during resume w/ controller being reset and all.  EH is supposed to keep quiet.  I'll look into what's going on && there's nothing wrong with your setup.  It's just libata whining a bit too much when it shouldn't.

Thanks.
Comment 16 Tejun Heo 2007-11-19 07:14:09 UTC
Created attachment 13613 [details]
bug13596-clear-ehi_serror.patch

Does this patch fix the problem?
Comment 17 Avuton Olrich 2007-11-19 17:33:53 UTC
OK, I am testing. No reason not to close this as resolved/fixed and I'll reopen it if it occurs again, otherwise I'll give the word in a few days that everything is good/not good.
Comment 18 Tejun Heo 2007-11-19 17:39:57 UTC
I'll leave it open till it's verified.  Thanks.
Comment 19 Avuton Olrich 2007-11-21 03:04:17 UTC
Created attachment 13666 [details]
dmesg-112107

New dmesg, after applying the earlier patch this one contains the DevExch happening many more times than before, but it's not noticable to the user now.
Comment 20 Tejun Heo 2007-11-21 16:58:08 UTC
Now I'm confused.  Do you have device attached to the first port of that ahci controller?  The kernel isn't detecting any but it's raising a lot of phy status changed events.  Can you please tell us hardware configuration and post untruncated dmesg?

Thanks.
Comment 21 Avuton Olrich 2007-11-21 18:37:02 UTC
Oh, I underst(In reply to comment #20)
> Now I'm confused.  Do you have device attached to the first port of that ahci
> controller?  The kernel isn't detecting any but it's raising a lot of phy
> status changed events.  Can you please tell us hardware configuration and
> post
> untruncated dmesg?
> 
> Thanks.

Oh, it may seem strange that I didn't understand the root of why this was happening but for a year or more (not necessarily with this motherboard), I have done this and it was not a problem with either the IDE drives (when I had all IDE drives) or the libata drivers.

This computer has 5 hard drives, 4 of which I unplug the power, but the SATA cable stays plugged in merely for convenience when they're not needed (I backup once every few weeks from various computers). I guess even though not plugged in one or more of these drives communicate without hard drive power?
Comment 22 Tejun Heo 2007-11-21 18:42:31 UTC
Ah.. okay.  So, those DevExch events are happening on drives which are powered off but still connected to the SATA port.

The original report was about DevExch occurring during resume from hibernation, right?  Can you please first verify that's gone with the patch applied?
Comment 23 Avuton Olrich 2007-11-21 18:44:46 UTC
(In reply to comment #22)
> Ah.. okay.  So, those DevExch events are happening on drives which are
> powered
> off but still connected to the SATA port.
> 
> The original report was about DevExch occurring during resume from
> hibernation,
> right?  Can you please first verify that's gone with the patch applied?

Sure, I will disconnect all sata cables that aren't actually running for now so I can stop the dmesg flood and I'll get back to testing that. Also- Do you want a separate bug for the other issue?

Thanks
Comment 24 Tejun Heo 2007-11-21 18:51:37 UTC
I don't think we can do much about DevExch when power cable is unplugged.  It's strictly hardware and I don't think controller guys test for such use cases.  I recommend using "hdparm -Y /dev/sdX" instead of physically yanking the power cable out.
Comment 25 Avuton Olrich 2007-11-22 04:00:12 UTC
Created attachment 13693 [details]
dmesg-112207

OK, I'm still getting this DevExch. ata1, it appears, was one of the ones that was connected without power, I've disconnected all those that had no power connected which left this computer with only 1 hard drive that is physically hooked up. Now I'm getting a constant loop of DevExch messages with 

sbh@rocket /usr/src/linux $ git-rev-parse master
2ffbb8377c7a0713baf6644e285adc27a5654582
sbh@rocket /usr/src/linux $ git-diff
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 8189803..fd32c55 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3988,6 +3988,7 @@ void ata_std_postreset(struct ata_link *link, unsigned int *classes)
        /* clear SError */
        if (sata_scr_read(link, SCR_ERROR, &serror) == 0)
                sata_scr_write(link, SCR_ERROR, serror);
+       link->eh_info.serror = 0;

        /* is double-select really necessary? */
        if (classes[0] != ATA_DEV_NONE)
Comment 26 Avuton Olrich 2007-11-22 08:04:24 UTC
Argh, Just figured out what ata7 is, it's a external drive, turned off but when on running running over esata. Obviously the power gets turned off on that drive but some kind of communication continues :(
Comment 27 Tejun Heo 2007-11-22 17:12:53 UTC
Okay, Hmmm.... It might be that you don't even need that patch.  Can you please try without the patch?
Comment 28 Avuton Olrich 2007-11-24 07:59:42 UTC
Well, I haven't experienced this in two days now, so basically what's being said here is that >2.6.23-rc1 leaving hard drive sata cables plugged in but not having the power plugged in, internal or external, is a no no? :)
Comment 29 Avuton Olrich 2007-11-24 08:24:35 UTC
err, 2.6.24-rc1 I mean
Comment 30 Tejun Heo 2007-11-24 19:13:24 UTC
That probably doesn't have anything to do with kernel version.  Can you please check again old kernel doesn't have this problem?  It probably has more to do with hardware configuration change.  And the no-no doesn't come from driver, the hardware just isn't designed with such use case in mind and it generates hotplugging event repeatedly.  I dunno why it's like that.  It could be the controller's fault or the drive's.  Anyways, nothing in the spec mentions such setup, so nothing can be really out-of-spec.  :-)
Comment 31 Avuton Olrich 2007-11-28 05:51:09 UTC
I don't know where it came from but I was never able to make the old kernel have the problem. I've plugged the sata back in alone and I haven't gotten the problem yet.
Comment 32 Tejun Heo 2007-11-28 05:54:40 UTC
That's weird.  Please confirm that old kernel is okay and then verify again that switching to new kernel triggers the problem.  Thanks.
Comment 33 Avuton Olrich 2007-11-30 16:39:28 UTC
Created attachment 13809 [details]
dmesg-113007

I have reconfirmed that this does not occur with 2.6.23.8 and it definitely does with the latest git master (2ffbb8377c7a0713baf6644e285adc27a5654582). Attaching the latest git master dmesg.
Comment 34 Tejun Heo 2007-12-03 18:56:36 UTC
In the posted dmesg, the only error messages I can see are spurious completions.  What we're trying to debug here is DevExch's, right?
Comment 35 Avuton Olrich 2007-12-03 19:28:10 UTC
Oh, please excuse me, I'm thinking they're somehow related. I will try the latest kernel again.
Comment 36 Rafael J. Wysocki 2007-12-07 17:29:57 UTC
Ping?
Comment 37 Avuton Olrich 2007-12-07 20:39:20 UTC
Yeah, it doesn't reproduce now :/ Not sure why or what, but it won't reproduce at all. Both are hooked in, on -rc4+. I think it may have been caused by one that was without power but the sata cable wasn't secure into the hard drive.
Comment 38 Avuton Olrich 2007-12-07 22:50:42 UTC
Created attachment 13913 [details]
dmesg-120707

Needless to say, despite what I just said a few hours ago, it popped back up again.
Comment 39 Tejun Heo 2007-12-08 21:57:42 UTC
This is from the harddisk which is connected but not powered and the kernel is -rc4, right?  And how long did it take to reproduce the problem?
Comment 40 Avuton Olrich 2007-12-08 22:30:07 UTC
(In reply to comment #39)
> This is from the harddisk which is connected but not powered and the kernel
> is
> -rc4, right?  

That is correct, I left the hard drive connected but without the power connection, and the kernel was post -rc4 git. I'm not running it right now due to hibernate not working on a regular basis on that kernel.

> And how long did it take to reproduce the problem?

Well, I don't remember exactly, I had just got done rebooting it during this session if I remember correctly. The best reference I have for time are the kernel time marks there >>

[   72.892153] vboxdrv: Successfully done.
[ 2111.219739] ata7: exception Emask 0x10 SAct 0x0 SErr 0x4000000 action 0xa frozen
[ 2111.219744] ata7: irq_stat 0x00000040, connection status changed
[ 2111.219747] ata7: SError: { DevExch }
Comment 41 Tejun Heo 2007-12-08 22:41:27 UTC
The problem here is that it still isn't clear whether the behavior was introduced by change in the driver code or is happening regardless of driver code changes.  Please keep paying attention to kernel messages and verify that it really doesn't happen with previous kernels.  Thanks.
Comment 42 Avuton Olrich 2007-12-09 07:18:12 UTC
I'll be on 2.6.23.8 for at least another week, if I don't see it in that week, that's proof positive enough for me that it doesn't happen in this version.
Comment 43 Tejun Heo 2007-12-09 07:25:33 UTC
Yeap, I'll wait for your response.  Thanks.
Comment 44 Avuton Olrich 2007-12-16 16:26:10 UTC
I've been keeping a good eye out for this bug (checking dmesg at least daily) and it's pretty safe to say it doesn't reproduce on the 2.6.23.x.
Comment 45 Tejun Heo 2007-12-17 18:52:45 UTC
Thanks.  That means we're doing something differently on idle PHY.  I wonder whether it's related to recent PHY power save changes.  Kristen, can you please have a look?  Thanks.
Comment 46 Rafael J. Wysocki 2008-01-01 12:16:45 UTC
Kristen?