Bug 14583
Summary: | SSD system stall, HSM Violation | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Andrew Simpson (andrewnz.simpson) |
Component: | IDE | Assignee: | io_ide (io_ide) |
Status: | RESOLVED WILL_NOT_FIX | ||
Severity: | normal | CC: | alan, alan, andrewsquire+kernel, hancockrwd, horacioh, jvdneste, kay, mzxreary, scott, slesru, tj, zeuthen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg
kernel log lspci -nn dmesg with libata.dma=0 dmesg with libata.force=udma/33 dmesg with 2.6.28 dmesg with irqpoll hdparm -I /dev/sda dmesg with 2.6.32 |
Description
Andrew Simpson
2009-11-11 06:23:58 UTC
Created attachment 23738 [details]
kernel log
Created attachment 23739 [details]
lspci -nn
Created attachment 23740 [details]
dmesg with libata.dma=0
Created attachment 23741 [details]
dmesg with libata.force=udma/33
Created attachment 23742 [details]
dmesg with 2.6.28
Hmmm... it's a CF device which forgets to send interrupt from time to time. Does "irqpoll" help? Thanks. "irqpoll" seems to make little difference. dmesg is attached... Created attachment 23756 [details]
dmesg with irqpoll
But the system doesn't stall for 30secs, does it? >But the system doesn't stall for 30secs, does it?
Umm, yes, you're right on that. I also tried a 'torture test' of gparted and whereas previously it would be tied up in knots for a long time, it did work promptly (with the same errors in the log).
It looks like the device is buggy and fails from time to time without raising interrupt. libata waits till the timeout comes whenever that happens and thus the 30sec long stutters. This is basically caused by crappy hardware. The only workaround is to poll for the failure as irqpoll does. For now, I think you'll have to live with irqpoll. In the long run, we'll probably need to add automatic polling execution for the hardare. Can you please attach the output of "hdparm -I /dev/sda"? Thanks. O.K., hdparm output attached below. Is there any reason that the SSD should work fine with 2.6.28 (Ubuntu 9.04), but not with the later 2.6.31? Thanks. Created attachment 23762 [details]
hdparm -I /dev/sda
Hmmm... I can't think of any reason why 2.6.31 would behave worse than 2.6.28. Are you sure that the error doesn't happen with 2.6.28? Can you please check one more time (no need to install whole distro, just installing the old kernel should be enough)? Cc'ing Alan. Alan, a CF device is failing without triggering IRQ thus causing timeouts. Reportedly, this behavior doesn't happen with 2.6.28 but does with 2.6.31. Any ideas? Thanks. Ummm... a case of bad timing. I just spent the last hour wiping the Ubuntu 9.04 install and putting on Mandriva 2010.0 Yes, I have previously inspected Ubuntu 9.04 very carefully - and more than once. Also others in the referenced Ubuntu bug report are finding the same as me. I will do a LiveCD (on usb stick) and confirm again for you shortly. BTW - there is a 2.6.28 boot dmesg that I attached previously in the attachments. For what it's worth: Mandriva 2010.0 (2.6.31) shows no sign of this bug (yet). Fedora 12 Beta (again LiveCD on USB stick) showed the same bug in dmesg (I didn't investigate any further than this). No bug reports in Fedora bugzilla that I could find. O.K. Just confirming: Running 2.6.28 (Ubuntu 9.04 LiveCD on USB stick) gives no bug messages in dmesg. Running gparted (which generates plenty of messages in 2.6.31) and checking dmesg shows no change. Happy to provide more on request. Thanks. I'd like to add that this problem exists on hdd drive, ubuntu 9.10, dell 120l : [ 6702.000206] ata1: lost interrupt (Status 0x58) [ 6702.004018] ata1: drained 32768 bytes to clear DRQ. [ 6702.093725] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 6702.093739] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0 [ 6702.093740] cdb 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 6702.093742] res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation) [ 6702.093746] ata1.01: status: { DRDY DRQ } [ 6702.093859] ata1: soft resetting link [ 6702.340796] ata1.00: configured for UDMA/100 [ 6702.372409] ata1.01: configured for UDMA/33 [ 6702.391202] ata1: EH complete No obvious thought. The interesting thing to me is that the newer kernels double check for a lost IRQ but are not seeing one in this case. Some of the SSD devices are very fast responding and I wonder if in fact somewhere between 2.6.28-31 we've introduced a case where we can clear the IRQ in error if it is raised extremely fast ? And trying 2.6.29 and 2.6.30 might be useful just to pin down the problem more closely. The Mandriva/Fedora behaviour difference is odd but probably important. The comment on SSD device speed being part of the bug is interesting: 1. Looking through the bug reports on Ubuntu, most of the problems are with the supposedly 'faster' devices. The slower devices - such as older factory fitted devices - don't seem to figure. 2. I have an early Aspire One that is factory fitted with an Intel 8Gb SSD. This is not a fast SSD unit, and I believe that Acer later discontinued using the Intel in favour of faster SSD units. This machine has been 2.6.31 (Ubuntu 9.10) since the Beta, and without any problems. I will ask on the Ubuntu bug report whether anyone has tried 2.6.29 and/or 2.6.30. I have already been comparing Mandriva and Ubuntu kernels. The patches that Mandriva apply to the 2.6.31 kernel source are, I believe, here: http://svn.mandriva.com/svn/packages/cooker/kernel/current/PATCHES/ I have also diff-ed the Ubuntu and Mandriva kernel configs for comparison. I have yet to see any 'interesting differences'. I have commented on the Ubuntu bug, but thought I'd leave a comment here too. I've got two Eee 900's and the bug exhibits itself on both. I have gone back and forth from Jaunty (kernel 2.6.28) and Karmic (kernel 2.6.31). This kind of error message appears only under 2.6.31 and 2.6.32. I don't get it on 2.6.28. [ 113.816054] ata2: lost interrupt (Status 0x58) [ 113.820008] ata2: drained 8192 bytes to clear DRQ. [ 113.835302] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 113.835310] ata2.00: BMDMA stat 0x64 [ 113.835317] ata2.00: failed command: READ DMA [ 113.835332] ata2.00: cmd c8/00:20:a7:8f:09/00:00:00:00:00/e0 tag 0 dma 16384 in [ 113.835335] res 58/00:20:a7:8f:09/00:00:00:00:00/e0 Emask 0x2 (HSM violation) [ 113.835343] ata2.00: status: { DRDY DRQ } [ 113.835393] ata2: soft resetting link Created attachment 24064 [details]
dmesg with 2.6.32
Comment from Ubuntu bug report: "I can confirm kernel 2.6.30 works fine with my netbook (Acer Aspire One with 8 GB SSD) using a Jaunty userland. I am running 2.6.30-02063009-generic. I am somewhat reluctant to try a karmic kernel since once I started getting the errors after installing karmic to get rid of them in any distro/kernel combination I had to write all zeros to the SSD. If it's truly necessary I will try." https://bugs.launchpad.net/ubuntu/+source/linux/+bug/445852 with comment #66 Other comments show that the bug is still present in 2.6.32. It looks like most of these bug reports indicate BMDMA status 0x04 or 0x64 when the timeout occurs, indicating that the controller should be asserting an interrupt and the BMDMA transfer is not or is no longer active. But the device registers indicate DRQ is asserted, indicating the device wants to transfer more data. It looks like this is why libata reports an HSM violation. The amount of data drained indicates that no data was actually transferred (I notice there's a bug that causes the amount of data reported as drained to be half the actual amount, I'll submit a patch for that). I don't see a reason for the device to generate an interrupt at this point. Thus I tend to suspect that the cause is a bug in the device. As far as why an interrupt is shown as generated but doesn't actually get received, I'm not sure. Maybe the controller doesn't propagate the interrupt if the state machine doesn't say it should be getting one? DRQ + interrupt asserted certainly doesn't seem like a combination that should happen. Thanks. I'm not a kernel developer, but I get the general gist of that. However, there are several reasons that I don't think this is necessarily a bug in the SSD device. Firstly, it affects a range of SSD devices from different manufacturers and different production dates. I guess they could share common firmware (and hence the same bug), but that has to be rather remote possibility. Secondly, the SSD devices are performing without issue on 2.6.28 (and one report of 2.6.30), but the same machines fail on 2.6.31/2.6.32. This suggests something changed in the kernel that affects the SSD between 2.6.30 and 2.6.31. Thirdly, these machines were often sold with factory installed Linux. O.K., so it was an earlier kernel, but the SSD has clearly worked enough to met the manufacturer's satisfaction. Fourthly, (and this where I show I'm not a developer), we are only seeing the end result of the transaction (the HSM violation). Could there have been several erroneous - but not out-of-band - command/response exchanges leading up to the HSM violation? Well, it is quite possible that a number of SSD devices would have the same bug. There are only a few different manufacturers of the actual controller that talks on the ATA bus, regardless of who makes the flash chips or who prints their name on the device. If 2.6.28 is indeed OK (not just that the problem happens less frequently) then likely the easiest way to track it down would be for someone to try bisecting. That's assuming the problem is caused by a single bad commit, though. If the problem isn't actually due to any change in behavior but just a difference in timing or something like that, then it's possible the bisect may not come up with anything useful. It'd be an important troubleshooting step nonetheless, however. If there were any really "erroneous" exchanges before that we would have seen error reports about them, unless I misunderstand what you mean.. You could be right that all the devices have common firmware (and bug). It just feels remote in this case. More importantly, some interesting research on the Ubuntu bug report by a number of users: Ubuntu 9.10 (Karmic) shows the bug with 2.6.28, 2.6.29, 2.6.31, 2.6.32. Ubuntu 9.04 (Jaunty) shows NO bug with 2.6.28, 2.6.30, 2.6.31, 2.6.32 This has been confirmed by three different users. Ext4 has been ruled out (same results with ext2). Two questions: Is this a bug outside the kernel space? And how can a userspace bug affect the disk operations? A userspace bug shouldn't be able to affect disk operations at that level. Unless it is just some kind of timing difference that's triggering the error. Assuming that pattern is indeed consistent, presumably those reports are mainly with Ubuntu-patched kernels.. are there any Ubuntu kernel patches in Karmic that wouldn't have been in Jaunty that could explain this? One thing which changed with recent distros is the use of devicekit storage stuff which seems to issue slightly different SMART commands than smartd, so there's a slight chance that it might be causing some problem. It would be interesting to try new kernel on older distro and see whether the problem goes away. Thanks. Yes, that could potentially be a factor. Could also try disabling that service (however exactly one does this) on the newer distro and see if that helps.. Comment #28 points out that indeed the bug does not show on older distros with a new kernel. The devicekit suggestion looks to be spot on, indicated by comments 107 and 108 in the ubuntu bug tracker (referenced here in comment #24). I've been tracking a bug we've linked to this on Ubuntu, the most recent replies received from affected users do indicate that it's the devkit-disks-probe-ata-smart command that triggers the HSM Violations and death of their SSD. (In reply to comment #33) > I've been tracking a bug we've linked to this on Ubuntu, the most recent > replies received from affected users do indicate that it's the > devkit-disks-probe-ata-smart command that triggers the HSM Violations and > death > of their SSD. This program is a user of libatasmart. You want to look at http://git.0pointer.de/?p=libatasmart.git;a=summary Here's the libatasmart bug tracker http://bugs.freedesktop.org/enter_bug.cgi?product=libatasmart On Monday 14 December 2009 09:10:00 pm bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14583 > > > David Zeuthen <zeuthen@gmail.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |zeuthen@gmail.com > > > > > --- Comment #34 from David Zeuthen <zeuthen@gmail.com> 2009-12-14 20:09:58 > --- > (In reply to comment #33) > > I've been tracking a bug we've linked to this on Ubuntu, the most recent > > replies received from affected users do indicate that it's the > > devkit-disks-probe-ata-smart command that triggers the HSM Violations and > death > > of their SSD. > > This program is a user of libatasmart. You want to look at > > http://git.0pointer.de/?p=libatasmart.git;a=summary Slightly off-topic but WTF is this? NIH smartmontools implementation? Oh, I see.. http://0pointer.de/blog/projects/being-smart.html Why, oh, why... smartmontools were working just fine and could have been _easily_ been ported to C from C++ or enhanced into libsmartmon.. "Please note that I certainly don't plan to replace smartmontools. libatasmart will always implement only a subset of S.M.A.R.T. If you want the full set of functionality then please refer to smartmontools." I somehow find it suspicious given how (only some) USB/ATA bridges support were added later to *libata*smart project... To make things worse Ubuntu is actually shipping it. What a mess... First my audio, now my disks.. :( -- Bartlomiej Zolnierkiewicz cc'ing Kay Sievers as he knows one or two things about how these things are going. Bartlomiej, yeah, it would have been nice if the code base were shared but well nobody did that and it's good to have desktop integration. It's probably *lib*atasmart - it doesn't have anything to do with libata. Anyways, deviation from the command sequence used by smartmontools seems to cause two problems till now. * HSM violations on certain firmwares. * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check power state before issuing SMART commands). If someone is gonna report this to the upstream developer, please include both. Thanks. On Tue, Dec 15, 2009 at 01:58, <bugzilla-daemon@bugzilla.kernel.org> wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14583 > It's probably *lib*atasmart - it doesn't have anything to do with libata. > Anyways, > deviation from the command sequence used by smartmontools seems to cause two > problems till now. > > * HSM violations on certain firmwares. > > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check > power state before issuing SMART commands). > > If someone is gonna report this to the upstream developer, please include > both. Pinged Lennart with a pointer to this bug. Thanks! On Tuesday 15 December 2009 01:58:08 am bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14583 > > > Tejun Heo <tj@kernel.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |kay.sievers@vrfy.org > > > > > --- Comment #36 from Tejun Heo <tj@kernel.org> 2009-12-15 00:58:06 --- > cc'ing Kay Sievers as he knows one or two things about how these things are > going. Bartlomiej, yeah, it would have been nice if the code base were > shared > but well nobody did that and it's good to have desktop integration. It's After reading project's homepage it is clear that it was done in order for desktop integration. However I couldn't find any previous discussions about the best way to solve the problem on the net and it just seems like there wasn't even any attempt of involving smartmontools developers in the loop (I didn't look very carefully though). While this is not a lot of code it is a very sensitive area and bugs involving S.M.A.R.T. support (or firmware issues in general) have been always real hassle to debug and workaround/fix. Sharing low-level code base with smartmontools will be great as it is not Linux-only thing (which in this case helps a lot with additional testing and verification on other platforms) and actually has a people with a lot of needed ATA background/experience behind the project. I have filed https://bugs.freedesktop.org/show_bug.cgi?id=25673 on libatasmart (In reply to comment #35) > Why, oh, why... smartmontools were working just fine and could have > been _easily_ been ported to C from C++ or enhanced into libsmartmon.. libatasmart is basically a port from C++ to C. But adds a couple fo things and drops others. It's also kinda tiny, instead of the huge beast that smartmonutils is. > "Please note that I certainly don't plan to replace smartmontools. > libatasmart will always implement only a subset of S.M.A.R.T. If you > want the full set of functionality then please refer to smartmontools." > > I somehow find it suspicious given how (only some) USB/ATA bridges > support were added later to *libata*smart project... Uh? > First my audio, now my disks.. :( Yes, I am out to eat your children the next time. (In reply to comment #36) > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check > power state before issuing SMART commands). We do that too. (In reply to comment #41) > (In reply to comment #36) > > > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to check > > power state before issuing SMART commands). > > We do that too. Strange. If I put my WD green drives into powersave mode, the devkit smart polling always wakes it up. On openSUSE 11.1, I had smartmontools running and it never happened. I'll try to find out what the difference is. Thanks. (In reply to comment #42) > (In reply to comment #41) > > (In reply to comment #36) > > > > > * Spinning up drives in standby mode (smartmontools uses CHK_POWER to > check > > > power state before issuing SMART commands). > > > > We do that too. > > Strange. If I put my WD green drives into powersave mode, the devkit smart > polling always wakes it up. On openSUSE 11.1, I had smartmontools running > and > it never happened. I'll try to find out what the difference is. Which version of libatasmart is that? We had some changes recently (oct) there in the initialization order, because some commands apparently could cause a wakeup on some drives which didnt cause one on others. You should have .17 for this to work flawlessly. It says libatasmart4-0.14-3.2.x86_64 which has been shipped with openSUSE 11.2. So, too old by three minor versions. I'll try newer one. BTW, I just tried to reproduce the problem and it's a bit strange. If I put the hard drive into explicit standby using 'hdparm -y', it stays suspended but if the standby timer puts the drive into standby mode, it gets woken up by devkit smart refresh. Anyways, will try newer one. Thanks. Closing bug report, since it's not actually kernel related. Thanks all. Open bug report for this problem is on freedesktop.org bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=25673 |