Bug 14583

Summary: SSD system stall, HSM Violation
Product: IO/Storage Reporter: Andrew Simpson (andrewnz.simpson)
Component: IDEAssignee: io_ide (io_ide)
Status: RESOLVED WILL_NOT_FIX    
Severity: normal CC: alan, alan, andrewsquire+kernel, hancockrwd, horacioh, jvdneste, kay, mzxreary, scott, slesru, tj, zeuthen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg
kernel log
lspci -nn
dmesg with libata.dma=0
dmesg with libata.force=udma/33
dmesg with 2.6.28
dmesg with irqpoll
hdparm -I /dev/sda
dmesg with 2.6.32

Description Andrew Simpson 2009-11-11 06:23:58 UTC
Created attachment 23737 [details]
dmesg

Bug occurring in Ubuntu 9.10 with SSD units, mainly reported in Acer Aspire One, EEE and Dell Mini 9.  Occurs always on boot and randomly during use.

Mainly reported by users that have physically upgraded the SSD unit to a newer/larger unit.  Also reported by some users with original factory SSD. 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/445852


[  112.816122] ata2: lost interrupt (Status 0x58)
[  112.820088] ata2: drained 2048 bytes to clear DRQ.
[  112.824067] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  112.824084] ata2.00: BMDMA stat 0x4
[  112.824117] ata2.00: cmd c8/00:08:87:d4:8e/00:00:00:00:00/e1 tag 0 dma 4096 in
[  112.824123]          res 58/00:08:87:d4:8e/00:00:00:00:00/e1 Emask 0x2 (HSM violation)
[  112.824138] ata2.00: status: { DRDY DRQ }
[  112.824200] ata2: soft resetting link
[  112.992538] ata2.00: configured for UDMA/100
[  112.992581] ata2: EH complete

The bug is typified by command C8h with response 58h as above.

No reports of problem with Ubuntu 9.04 (2.6.28) on same hardware.

I have found that booting with libata.dma=0 forces mode PIO4 and the bug does not show.  Booting with libata.force=udma/33 reduces bus speed, but bug still occurs.

The attached files are from booting LiveCD (on a USB stick):

dmesg.txt (dmesg output)
kern.log
lspci.txt (lspci -nn)

dmesg_libata.dma-0.txt (dmesg with libata.dma=0 )
dmesg_libata.force-udma33.txt (dmesg with libata.force=udma/33)

The following file is from booting Ubuntu 9.04 on the SSD unit:

dmesg_2.6.28-kernel.txt
Comment 1 Andrew Simpson 2009-11-11 06:24:50 UTC
Created attachment 23738 [details]
kernel log
Comment 2 Andrew Simpson 2009-11-11 06:25:27 UTC
Created attachment 23739 [details]
lspci -nn
Comment 3 Andrew Simpson 2009-11-11 06:26:22 UTC
Created attachment 23740 [details]
dmesg with libata.dma=0
Comment 4 Andrew Simpson 2009-11-11 06:27:21 UTC
Created attachment 23741 [details]
dmesg with libata.force=udma/33
Comment 5 Andrew Simpson 2009-11-11 06:28:02 UTC
Created attachment 23742 [details]
dmesg with 2.6.28
Comment 6 Tejun Heo 2009-11-11 08:59:33 UTC
Hmmm... it's a CF device which forgets to send interrupt from time to time.  Does "irqpoll" help?
Comment 7 Andrew Simpson 2009-11-12 04:46:40 UTC
Thanks. "irqpoll" seems to make little difference.

dmesg is attached...
Comment 8 Andrew Simpson 2009-11-12 04:47:25 UTC
Created attachment 23756 [details]
dmesg with irqpoll
Comment 9 Tejun Heo 2009-11-12 04:59:15 UTC
But the system doesn't stall for 30secs, does it?
Comment 10 Andrew Simpson 2009-11-12 05:03:58 UTC
>But the system doesn't stall for 30secs, does it?

Umm, yes, you're right on that.  I also tried a 'torture test' of gparted and whereas previously it would be tied up in knots for a long time, it did work promptly (with the same errors in the log).
Comment 11 Tejun Heo 2009-11-12 20:11:02 UTC
It looks like the device is buggy and fails from time to time without raising interrupt.  libata waits till the timeout comes whenever that happens and thus the 30sec long stutters.  This is basically caused by crappy hardware.  The only workaround is to poll for the failure as irqpoll does.  For now, I think you'll have to live with irqpoll.  In the long run, we'll probably need to add automatic polling execution for the hardare.  Can you please attach the output of "hdparm -I /dev/sda"?

Thanks.
Comment 12 Andrew Simpson 2009-11-13 05:58:32 UTC
O.K., hdparm output attached below.

Is there any reason that the SSD should work fine with 2.6.28 (Ubuntu 9.04), but not with the later 2.6.31?

Thanks.
Comment 13 Andrew Simpson 2009-11-13 05:59:33 UTC
Created attachment 23762 [details]
hdparm -I /dev/sda
Comment 14 Tejun Heo 2009-11-18 05:24:08 UTC
Hmmm... I can't think of any reason why 2.6.31 would behave worse than 2.6.28.  Are you sure that the error doesn't happen with 2.6.28?  Can you please check one more time (no need to install whole distro, just installing the old kernel should be enough)?

Cc'ing Alan.  Alan, a CF device is failing without triggering IRQ thus causing timeouts.  Reportedly, this behavior doesn't happen with 2.6.28 but does with 2.6.31.  Any ideas?

Thanks.
Comment 15 Andrew Simpson 2009-11-18 06:48:59 UTC
Ummm... a case of bad timing.  I just spent the last hour wiping the Ubuntu 9.04 install and putting on Mandriva 2010.0

Yes, I have previously inspected Ubuntu 9.04 very carefully - and more than once.  Also others in the referenced Ubuntu bug report are finding the same as me.  I will do a LiveCD (on usb stick) and confirm again for you shortly.

BTW - there is a 2.6.28 boot dmesg that I attached previously in the attachments. 

For what it's worth:  Mandriva 2010.0 (2.6.31) shows no sign of this bug (yet).  Fedora 12 Beta (again LiveCD on USB stick) showed the same bug in dmesg (I didn't investigate any further than this).  No bug reports in Fedora bugzilla that I could find.
Comment 16 Andrew Simpson 2009-11-18 07:59:07 UTC
O.K. Just confirming: Running 2.6.28 (Ubuntu 9.04 LiveCD on USB stick) gives no bug messages in dmesg.  Running gparted (which generates plenty of messages in 2.6.31) and checking dmesg shows no change.

Happy to provide more on request.

Thanks.
Comment 17 slesru 2009-11-25 10:28:22 UTC
I'd like to add that this problem exists on hdd drive, ubuntu 9.10, dell 120l :

[ 6702.000206] ata1: lost interrupt (Status 0x58)
[ 6702.004018] ata1: drained 32768 bytes to clear DRQ.
[ 6702.093725] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 6702.093739] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[ 6702.093740] cdb 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 6702.093742] res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation)
[ 6702.093746] ata1.01: status: { DRDY DRQ }
[ 6702.093859] ata1: soft resetting link
[ 6702.340796] ata1.00: configured for UDMA/100
[ 6702.372409] ata1.01: configured for UDMA/33
[ 6702.391202] ata1: EH complete
Comment 18 Alan 2009-11-25 10:57:42 UTC
No obvious thought. The interesting thing to me is that the newer kernels double check for a lost IRQ but are not seeing one in this case. Some of the SSD devices are very fast responding and I wonder if in fact somewhere between 2.6.28-31 we've introduced a case where we can clear the IRQ in error if it is raised extremely fast ?
Comment 19 Alan 2009-11-25 10:59:04 UTC
And trying 2.6.29 and 2.6.30 might be useful just to pin down the problem more closely.

The Mandriva/Fedora behaviour difference is odd but probably important.
Comment 20 Andrew Simpson 2009-11-26 04:23:55 UTC
The comment on SSD device speed being part of the bug is interesting:

1. Looking through the bug reports on Ubuntu, most of the problems are with the supposedly 'faster' devices.  The slower devices - such as older factory fitted devices - don't seem to figure.

2. I have an early Aspire One that is factory fitted with an Intel 8Gb SSD.  This is not a fast SSD unit, and I believe that Acer later discontinued using the Intel in favour of faster SSD units.  This machine has been 2.6.31 (Ubuntu 9.10) since the Beta, and without any problems.
Comment 21 Andrew Simpson 2009-11-26 04:33:48 UTC
I will ask on the Ubuntu bug report whether anyone has tried 2.6.29 and/or 2.6.30.

I have already been comparing Mandriva and Ubuntu kernels.  The patches that Mandriva apply to the 2.6.31 kernel source are, I believe, here:

http://svn.mandriva.com/svn/packages/cooker/kernel/current/PATCHES/

I have also diff-ed the Ubuntu and Mandriva kernel configs for comparison.  I have yet to see any 'interesting differences'.
Comment 22 Alan Pope 2009-12-06 23:18:13 UTC
I have commented on the Ubuntu bug, but thought I'd leave a comment here too.

I've got two Eee 900's and the bug exhibits itself on both. I have gone back and forth from Jaunty (kernel 2.6.28) and Karmic (kernel 2.6.31). This kind of error message appears only under 2.6.31 and 2.6.32. I don't get it on 2.6.28.

[ 113.816054] ata2: lost interrupt (Status 0x58)
[ 113.820008] ata2: drained 8192 bytes to clear DRQ.
[ 113.835302] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 113.835310] ata2.00: BMDMA stat 0x64
[ 113.835317] ata2.00: failed command: READ DMA
[ 113.835332] ata2.00: cmd c8/00:20:a7:8f:09/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 113.835335] res 58/00:20:a7:8f:09/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 113.835343] ata2.00: status: { DRDY DRQ }
[ 113.835393] ata2: soft resetting link
Comment 23 Alan Pope 2009-12-06 23:18:55 UTC
Created attachment 24064 [details]
dmesg with 2.6.32
Comment 24 Andrew Simpson 2009-12-08 06:27:11 UTC
Comment from Ubuntu bug report:

"I can confirm kernel 2.6.30 works fine with my netbook (Acer Aspire One with 8 GB SSD) using a Jaunty userland. I am running 2.6.30-02063009-generic.
I am somewhat reluctant to try a karmic kernel since once I started getting the errors after installing karmic to get rid of them in any distro/kernel combination I had to write all zeros to the SSD. If it's truly necessary I will try."

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/445852 with comment #66

Other comments show that the bug is still present in 2.6.32.
Comment 25 Robert Hancock 2009-12-09 02:40:49 UTC
It looks like most of these bug reports indicate BMDMA status 0x04 or 0x64 when the timeout occurs, indicating that the controller should be asserting an interrupt and the BMDMA transfer is not or is no longer active. But the device registers indicate DRQ is asserted, indicating the device wants to transfer more data. It looks like this is why libata reports an HSM violation.

The amount of data drained indicates that no data was actually transferred (I notice there's a bug that causes the amount of data reported as drained to be half the actual amount, I'll submit a patch for that). I don't see a reason for the device to generate an interrupt at this point. Thus I tend to suspect that the cause is a bug in the device.

As far as why an interrupt is shown as generated but doesn't actually get received, I'm not sure. Maybe the controller doesn't propagate the interrupt if the state machine doesn't say it should be getting one? DRQ + interrupt asserted certainly doesn't seem like a combination that should happen.
Comment 26 Andrew Simpson 2009-12-09 06:29:53 UTC
Thanks. I'm not a kernel developer, but I get the general gist of that.  However, there are several reasons that I don't think this is necessarily a bug in the SSD device.

Firstly, it affects a range of SSD devices from different manufacturers and different production dates.  I guess they could share common firmware (and hence the same bug), but that has to be rather remote possibility.

Secondly, the SSD devices are performing without issue on 2.6.28 (and one report of 2.6.30), but the same machines fail on 2.6.31/2.6.32.  This suggests something changed in the kernel that affects the SSD between 2.6.30 and 2.6.31.

Thirdly, these machines were often sold with factory installed Linux.  O.K., so it was an earlier kernel, but the SSD has clearly worked enough to met the manufacturer's satisfaction.

Fourthly, (and this where I show I'm not a developer), we are only seeing the end result of the transaction (the HSM violation).  Could there have been several erroneous - but not out-of-band - command/response exchanges leading up to the HSM violation?
Comment 27 Robert Hancock 2009-12-10 00:50:02 UTC
Well, it is quite possible that a number of SSD devices would have the same bug. There are only a few different manufacturers of the actual controller that talks on the ATA bus, regardless of who makes the flash chips or who prints their name on the device.

If 2.6.28 is indeed OK (not just that the problem happens less frequently) then likely the easiest way to track it down would be for someone to try bisecting. That's assuming the problem is caused by a single bad commit, though. If the problem isn't actually due to any change in behavior but just a difference in timing or something like that, then it's possible the bisect may not come up with anything useful. It'd be an important troubleshooting step nonetheless, however.

If there were any really "erroneous" exchanges before that we would have seen error reports about them, unless I misunderstand what you mean..
Comment 28 Andrew Simpson 2009-12-10 06:05:01 UTC
You could be right that all the devices have common firmware (and bug).  It just feels remote in this case.

More importantly, some interesting research on the Ubuntu bug report by a number of users:

Ubuntu 9.10 (Karmic) shows the bug with 2.6.28, 2.6.29, 2.6.31, 2.6.32.

Ubuntu 9.04 (Jaunty) shows NO bug with 2.6.28, 2.6.30, 2.6.31, 2.6.32

This has been confirmed by three different users.  Ext4 has been ruled out (same results with ext2).

Two questions:  Is this a bug outside the kernel space?  And how can a userspace bug affect the disk operations?
Comment 29 Robert Hancock 2009-12-10 14:38:15 UTC
A userspace bug shouldn't be able to affect disk operations at that level. Unless it is just some kind of timing difference that's triggering the error.

Assuming that pattern is indeed consistent, presumably those reports are mainly with Ubuntu-patched kernels.. are there any Ubuntu kernel patches in Karmic that wouldn't have been in Jaunty that could explain this?
Comment 30 Tejun Heo 2009-12-10 14:44:03 UTC
One thing which changed with recent distros is the use of devicekit storage stuff which seems to issue slightly different SMART commands than smartd, so there's a slight chance that it might be causing some problem.  It would be interesting to try new kernel on older distro and see whether the problem goes away.

Thanks.
Comment 31 Robert Hancock 2009-12-10 23:55:03 UTC
Yes, that could potentially be a factor. Could also try disabling that service (however exactly one does this) on the newer distro and see if that helps..
Comment 32 Johan Van den Neste 2009-12-14 16:51:20 UTC
Comment #28 points out that indeed the bug does not show on older distros with a new kernel. The devicekit suggestion looks to be spot on, indicated by comments 107 and 108 in the ubuntu bug tracker (referenced here in comment #24).
Comment 33 Scott James Remnant 2009-12-14 18:38:47 UTC
I've been tracking a bug we've linked to this on Ubuntu, the most recent replies received from affected users do indicate that it's the devkit-disks-probe-ata-smart command that triggers the HSM Violations and death of their SSD.
Comment 34 David Zeuthen 2009-12-14 20:09:58 UTC
(In reply to comment #33)
> I've been tracking a bug we've linked to this on Ubuntu, the most recent
> replies received from affected users do indicate that it's the
> devkit-disks-probe-ata-smart command that triggers the HSM Violations and
> death
> of their SSD.

This program is a user of libatasmart. You want to look at

 http://git.0pointer.de/?p=libatasmart.git;a=summary

Here's the libatasmart bug tracker

 http://bugs.freedesktop.org/enter_bug.cgi?product=libatasmart
Comment 35 Bartlomiej Zolnierkiewicz 2009-12-15 00:34:20 UTC
On Monday 14 December 2009 09:10:00 pm bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14583
> 
> 
> David Zeuthen <zeuthen@gmail.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |zeuthen@gmail.com
> 
> 
> 
> 
> --- Comment #34 from David Zeuthen <zeuthen@gmail.com>  2009-12-14 20:09:58
> ---
> (In reply to comment #33)
> > I've been tracking a bug we've linked to this on Ubuntu, the most recent
> > replies received from affected users do indicate that it's the
> > devkit-disks-probe-ata-smart command that triggers the HSM Violations and
> death
> > of their SSD.
> 
> This program is a user of libatasmart. You want to look at
> 
>  http://git.0pointer.de/?p=libatasmart.git;a=summary

Slightly off-topic but WTF is this?  NIH smartmontools implementation?

Oh, I see.. 

	http://0pointer.de/blog/projects/being-smart.html

Why, oh, why...  smartmontools were working just fine and could have
been _easily_ been ported to C from C++ or enhanced into libsmartmon..

"Please note that I certainly don't plan to replace smartmontools.
 libatasmart will always implement only a subset of S.M.A.R.T. If you
 want the full set of functionality then please refer to smartmontools."

I somehow find it suspicious given how (only some) USB/ATA bridges
support were added later to *libata*smart project...

To make things worse Ubuntu is actually shipping it.  What a mess...

First my audio, now my disks.. :(
--
Bartlomiej Zolnierkiewicz
Comment 36 Tejun Heo 2009-12-15 00:58:06 UTC
cc'ing Kay Sievers as he knows one or two things about how these things are going.  Bartlomiej, yeah, it would have been nice if the code base were shared but well nobody did that and it's good to have desktop integration.  It's probably *lib*atasmart - it doesn't have anything to do with libata.  Anyways, deviation from the command sequence used by smartmontools seems to cause two problems till now.

* HSM violations on certain firmwares.

* Spinning up drives in standby mode (smartmontools uses CHK_POWER to check power state before issuing SMART commands).

If someone is gonna report this to the upstream developer, please include both.

Thanks.
Comment 37 Kay Sievers 2009-12-15 11:27:48 UTC
On Tue, Dec 15, 2009 at 01:58,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14583

> It's probably *lib*atasmart - it doesn't have anything to do with libata. 
> Anyways,
> deviation from the command sequence used by smartmontools seems to cause two
> problems till now.
>
> * HSM violations on certain firmwares.
>
> * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check
> power state before issuing SMART commands).
>
> If someone is gonna report this to the upstream developer, please include
> both.

Pinged Lennart with a pointer to this bug. Thanks!
Comment 38 Bartlomiej Zolnierkiewicz 2009-12-15 14:39:35 UTC
On Tuesday 15 December 2009 01:58:08 am bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=14583
> 
> 
> Tejun Heo <tj@kernel.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |kay.sievers@vrfy.org
> 
> 
> 
> 
> --- Comment #36 from Tejun Heo <tj@kernel.org>  2009-12-15 00:58:06 ---
> cc'ing Kay Sievers as he knows one or two things about how these things are
> going.  Bartlomiej, yeah, it would have been nice if the code base were
> shared
> but well nobody did that and it's good to have desktop integration.  It's

After reading project's homepage it is clear that it was done in order
for desktop integration.  However I couldn't find any previous discussions
about the best way to solve the problem on the net and it just seems like
there wasn't even any attempt of involving smartmontools developers in the
loop (I didn't look very carefully though).

While this is not a lot of code it is a very sensitive area and bugs
involving S.M.A.R.T. support (or firmware issues in general) have been
always real hassle to debug and workaround/fix.

Sharing low-level code base with smartmontools will be great as it is not
Linux-only thing (which in this case helps a lot with additional testing
and verification on other platforms) and actually has a people with a lot
of needed ATA background/experience behind the project.
Comment 39 Scott James Remnant 2009-12-16 15:05:07 UTC
I have filed https://bugs.freedesktop.org/show_bug.cgi?id=25673 on libatasmart
Comment 40 Lennart Poettering 2009-12-18 08:22:37 UTC
(In reply to comment #35)

> Why, oh, why...  smartmontools were working just fine and could have
> been _easily_ been ported to C from C++ or enhanced into libsmartmon..

libatasmart is basically a port from C++ to C. But adds a couple fo things and drops others. It's also kinda tiny, instead of the huge beast that smartmonutils is.
 
> "Please note that I certainly don't plan to replace smartmontools.
>  libatasmart will always implement only a subset of S.M.A.R.T. If you
>  want the full set of functionality then please refer to smartmontools."
> 
> I somehow find it suspicious given how (only some) USB/ATA bridges
> support were added later to *libata*smart project...

Uh?

> First my audio, now my disks.. :(

Yes, I am out to eat your children the next time.
Comment 41 Lennart Poettering 2009-12-18 08:24:10 UTC
(In reply to comment #36)

> * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check
> power state before issuing SMART commands).

We do that too.
Comment 42 Tejun Heo 2009-12-18 08:45:17 UTC
(In reply to comment #41)
> (In reply to comment #36)
> 
> > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check
> > power state before issuing SMART commands).
> 
> We do that too.

Strange.  If I put my WD green drives into powersave mode, the devkit smart polling always wakes it up.  On openSUSE 11.1, I had smartmontools running and it never happened.  I'll try to find out what the difference is.

Thanks.
Comment 43 Lennart Poettering 2009-12-18 09:28:06 UTC
(In reply to comment #42)
> (In reply to comment #41)
> > (In reply to comment #36)
> > 
> > > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to
> check
> > > power state before issuing SMART commands).
> > 
> > We do that too.
> 
> Strange.  If I put my WD green drives into powersave mode, the devkit smart
> polling always wakes it up.  On openSUSE 11.1, I had smartmontools running
> and
> it never happened.  I'll try to find out what the difference is.

Which version of libatasmart is that? We had some changes recently (oct) there in the initialization order, because some commands apparently could cause a wakeup on some drives which didnt cause one on others. You should have .17 for this to work flawlessly.
Comment 44 Tejun Heo 2009-12-18 09:42:36 UTC
It says libatasmart4-0.14-3.2.x86_64 which has been shipped with openSUSE 11.2.  So, too old by three minor versions.  I'll try newer one.

BTW, I just tried to reproduce the problem and it's a bit strange.  If I put the hard drive into explicit standby using 'hdparm -y', it stays suspended but if the standby timer puts the drive into standby mode, it gets woken up by devkit smart refresh.

Anyways, will try newer one.

Thanks.
Comment 45 Andrew Simpson 2010-02-03 06:29:28 UTC
Closing bug report, since it's not actually kernel related.  Thanks all.

Open bug report for this problem is on freedesktop.org bugzilla:

 https://bugs.freedesktop.org/show_bug.cgi?id=25673