Bug 13416 - SATA hiccup during suspend/resume on Samsung NC10
SATA hiccup during suspend/resume on Samsung NC10
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA
All Linux
: P1 normal
Assigned To: Tejun Heo
: 14928 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 2009-05-31 15:35 UTC by Thilo-Alexander Ginkel
Modified: 2010-01-02 13:02 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.28-11-generic
Tree: Mainline
Regression: No

dmesg output (39.40 KB, text/plain)
2009-05-31 15:46 UTC, Thilo-Alexander Ginkel
lspci -vvv output (17.08 KB, text/plain)
2009-05-31 15:47 UTC, Thilo-Alexander Ginkel
ata_piix-pcs-hardreset.patch (2.48 KB, patch)
2009-05-31 23:40 UTC, Tejun Heo
Details | Diff
kern.log (91.17 KB, text/plain)
2009-06-01 15:34 UTC, Thilo-Alexander Ginkel
kern.log (w/o patch) (89.86 KB, text/plain)
2009-06-02 06:54 UTC, Thilo-Alexander Ginkel
ata_piix-sig_mode-on-suspend.patch (4.43 KB, patch)
2009-06-03 03:09 UTC, Tejun Heo
Details | Diff
kern.log (w/ ata_piix-sig_mode-on-suspend.patch) (68.73 KB, text/plain)
2009-06-03 19:44 UTC, Thilo-Alexander Ginkel
dmesg output with kernel Linux 2.6.27-14 (8.50 KB, application/octet-stream)
2009-06-16 09:07 UTC, Nico Isenbeck
Ubuntu 2.6.27-14 suspend failure output (799 bytes, text/plain)
2009-06-17 20:13 UTC, Alec Wood
dmesg output for SATA trouble during regular bootup (42.19 KB, text/plain)
2009-07-12 22:11 UTC, Thilo-Alexander Ginkel
dmesg output w/ libata.force=udma33 enabled (59.81 KB, text/plain)
2009-08-17 20:48 UTC, Thilo-Alexander Ginkel
dmesg output w/ libata.force=udma33 enabled (88.91 KB, text/plain)
2009-08-19 07:03 UTC, Thilo-Alexander Ginkel
dmidecode output (8.83 KB, text/plain)
2009-08-19 07:34 UTC, Thilo-Alexander Ginkel
Screenshot of the PM settings dialog w/ the changed setting highlighted (154.97 KB, image/png)
2009-08-19 18:44 UTC, Thilo-Alexander Ginkel
ata-monitor-cmds.patch (1.27 KB, patch)
2009-08-27 07:19 UTC, Tejun Heo
Details | Diff
dmesg output w/ cmd tracing enabled (41.67 KB, text/plain)
2009-08-28 06:40 UTC, Thilo-Alexander Ginkel
filter-out-apm.patch (644 bytes, patch)
2009-08-28 07:24 UTC, Tejun Heo
Details | Diff
filter-out-apm-1.patch (586 bytes, patch)
2009-09-03 06:51 UTC, Tejun Heo
Details | Diff
kernel.log with libata.force=1.5Gbps (44.92 KB, application/octet-stream)
2009-10-14 12:56 UTC, Hans Werner
kernel.log with libata.force=1.5Gbps and monitor_cmds=1 (45.92 KB, text/plain)
2009-10-14 20:07 UTC, Hans Werner
kernel.log with libata.acpi_gtf_filter=0x1f (46.89 KB, text/plain)
2009-10-17 03:15 UTC, Hans Werner
acpidump for Samsung N140 BIOS 04CU .tar.bz2 (60.34 KB, application/octet-stream)
2009-10-19 23:27 UTC, Hans Werner
kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x1F (21.82 KB, application/octet-stream)
2009-10-19 23:33 UTC, Hans Werner
kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x21F (139.21 KB, application/octet-stream)
2009-10-19 23:41 UTC, Hans Werner
requested logs (104.54 KB, application/octet-stream)
2009-10-26 18:00 UTC, Jan

Description Thilo-Alexander Ginkel 2009-05-31 15:35:40 UTC
When using the suspend/resume feature (this applies to both STR and Hibernate) on my Samsung NC10 I have a 1:5 chance that the machine does not properly resume when woken up again. Once in a while any disk I/O times out after resume rendering the system unusable until it is power-cycled. While the error happens, the HDD LED stays on all the time.

The resulting error messages read:
[<timestamp>] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[<timestamp>] ata1.00: cmd ca/00:58:87:c1:63/00:00:00:00:00/e9 tag 0 dma 45056 out
[<timestamp>]          res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[<timestamp>] ata1: link is slow to respond, please be patent (ready=0)
[<timestamp>] ata1: device not ready (errno=-16), forcing hardreset
[<timestamp>] ata1: soft resetting link
[<timestamp>] ata1: link is slow to respond, please be patent (ready=0)
[<timestamp>] ata1: SRST failed (errno=-16)

And after some time:

[<timestamp>] end_request: I/O error, dev sda, sector <sector#>

An in-depth analysis by Manoj Iyer and Tejun Heo has been posted downstream at: https://bugs.launchpad.net/ubuntu/jaunty/+source/linux/+bug/340014/comments/11
Comment 1 Thilo-Alexander Ginkel 2009-05-31 15:46:47 UTC
Created attachment 21664 [details]
dmesg output
Comment 2 Thilo-Alexander Ginkel 2009-05-31 15:47:45 UTC
Created attachment 21665 [details]
lspci -vvv output
Comment 3 Tejun Heo 2009-05-31 23:40:01 UTC
Created attachment 21673 [details]

Can you please try this patch?  Thanks.
Comment 4 Thilo-Alexander Ginkel 2009-06-01 10:19:27 UTC
I tried the patch, but unfortunately, the I/O freezes still happen. The error message has slightly changed, though:

[<timestamp>] ata2: XXX: reset complete, classes=9,0
[<timestamp>] ata1: XXX: reset complete, classes=1,9
[<timestamp>] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
[<timestamp>] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out
[<timestamp>] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[<timestamp>] ata1.00: cmd ca/00:08:4f:ca:fd/00:00:00:00:00/e5 tag 0 dma 4096 out
[<timestamp>]          res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[<timestamp>] ata1.00: status: { DRDY }
[<timestamp>] ata1: hard resetting link
[<timestamp>] ata1: XXX: doing PCS hardreset, pcsmask=0x5
[<timestamp>] ata1: link is slow to respond, please be patent (ready=0)
[<timestamp>] ata1: PCS hardreset failed (errno=-16)
Comment 5 Tejun Heo 2009-06-01 11:37:52 UTC
Can you please attach full kernel log including boot messages?
Comment 6 Thilo-Alexander Ginkel 2009-06-01 15:34:09 UTC
Created attachment 21690 [details]

Sure. The log is truncated at the end, but the first couple of failed hard reset attempts are included.
Comment 7 Tejun Heo 2009-06-01 22:40:07 UTC
Can you please post full log w/o the patch?  The interesting thing is that during the last resume, the device was resumed correctly (ata1.00: configured for UDMA/133) but timeout occurred when an actual IO was issued to it.  After that, the device seemed to have completely shut down.  Hmmm... maybe we need to play with SIG_MODE.  Anyways, please post full log without the patch.  Thanks.
Comment 8 Thilo-Alexander Ginkel 2009-06-02 06:54:16 UTC
Created attachment 21702 [details]
kern.log (w/o patch)

Please find attached a kern.log created without the patch applied.

An observation I made after using Magic SysReq to reboot the machine is that the HDD (or controller) seems to lock up in such a way that even the BIOS is unable to recover it when performing a soft reboot (i.e., the BIOS hangs during POST). Only power-cycling helps. This is the case for the patched and the unpatched kernel version.
Comment 9 Tejun Heo 2009-06-03 03:09:15 UTC
Created attachment 21721 [details]

Can you please this patch and post kernel log?
Comment 10 Alan 2009-06-03 15:04:35 UTC
Can't duplicate this behaviour on my NC10
Comment 11 Thilo-Alexander Ginkel 2009-06-03 15:22:14 UTC
Alan, which HDD model is built into your NC10?
Comment 12 Alan 2009-06-03 16:23:23 UTC
Comment 13 Thilo-Alexander Ginkel 2009-06-03 19:44:04 UTC
Created attachment 21733 [details]
kern.log (w/ ata_piix-sig_mode-on-suspend.patch)

It took a little bit longer than expected, but finally I have been able to reproduce the issue with the newest patch set applied. Please find attached the resulting kernel log.

@Nico: Which HDD model is your NC10 based upon?
@Alan: I assume you are running a more recent kernel than 2.6.28, right?
Comment 14 Nico Isenbeck 2009-06-03 20:54:41 UTC
hdparm -i /dev/sda
 Model=SAMSUNG HM160HI , FwRev=HH100-06

uname -a
Linux nico-netbook 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux
Comment 15 Alan 2009-06-03 21:04:08 UTC
Mine is usually running 2.6.27 but they are different drives and that may be relevant.
Comment 16 Tejun Heo 2009-06-10 07:31:25 UTC
Does 'irqpoll' make any difference?
Comment 17 Thilo-Alexander Ginkel 2009-06-10 15:14:05 UTC
Unfortunately, the problem still occurs with irqpoll enabled.
Comment 18 Alec Wood 2009-06-14 18:50:22 UTC
I can confirm the same bug on the Samsung N110 -- same hard drive and kernel as Nico above.
Comment 19 Thilo-Alexander Ginkel 2009-06-16 07:04:40 UTC
What I'd like to find out is whether the issue is new with 2.6.28 (as Alan reported he did not observe the issue with his NC10 running 2.6.27) or whether the different hard drive models are causing the different behavior (Alan's NC10 is using a WD drive whereas all others have Samsung HDDs).

I already tried booting a 2.6.27 kernel, which however panicked due to an issue with dmcrypt. Could somebody else (who also has a Samsung HDD in his NC10) try if the problem is reproducible with a 2.6.27 kernel? A recent 2.6.27 kernel for Ubuntu is available at: http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-2.6.27-14-generic_2.6.27-14.34_i386.deb
Comment 20 Nico Isenbeck 2009-06-16 09:07:57 UTC
Created attachment 21931 [details]
dmesg output with kernel Linux 2.6.27-14

absolutely no problem with Ubuntu kernel 2.6.27-14 Thilo posted here. suspend/resume works fine!
but WLAN does not ;-)
Comment 21 Tejun Heo 2009-06-17 06:20:40 UTC
Strange.  2.6.28 introduced slave_link and the head parking thing but for PATA suspend/resume nothing really has changed.  The timeout occurs *after* libata resumes the device is very interesting.  It means polling data transfer works fine and also IRQ driven no-data commands work fine too.  Hmmm.... Don't have much clue at this point.  Alan, can you please try newer kernels on your machine?

Comment 22 Alec Wood 2009-06-17 12:40:33 UTC
I finally got a vanilla 2.6.30 running on my N110, and the problem is definitely still present in that kernel.
Comment 23 Alec Wood 2009-06-17 16:45:32 UTC
Interestingly, I do get the error using a hand-compiled vanilla kernel (wifi does work tho)... I'll give that Ubuntu a spin next.
Comment 24 Alec Wood 2009-06-17 20:11:01 UTC
I have the Ubuntu 2.6.27-14 kernel linked by Thilo booting.  In trying to reproduce the error (probably the third go round) I got a panic on suspend.  Transcription of panic output attached.
Comment 25 Alec Wood 2009-06-17 20:13:24 UTC
Created attachment 21973 [details]
Ubuntu 2.6.27-14 suspend failure output

Got this crash on or around 3rd attempt at suspend.
Comment 26 Tejun Heo 2009-06-18 01:24:53 UTC
That isn't a panic.  That's about the same problem happening on suspend.  The disk checked out while the machine tries to sleep (probably while trying to flush the request queue before the suspend commands get issued) and suspend operation failed.  The kernel correctly returned to proper operation but the disk is gone, so the system is not whole lot useful.

So, regardless of kernel, it looks like your drive likes to check out from time to time.  Can you please repeat the test and see whether the problem repeats itself?  Also, is there a firmeware update available for your drive?

Comment 27 Tejun Heo 2009-06-18 01:25:56 UTC
Also, does your system have smartd or hddtemp running?
Comment 28 Alec Wood 2009-06-18 04:13:03 UTC
Tejun: yes, I see your point.  Yes, I can reproduce the suspend failure with the 2.6.27 kernel.  I have also seen the resume failure with that same kernel (at least I think it was the same failure -- I never got any log output that time so I can't be sure).  However, I have never once seen the suspend failure with the 2.6.28 or 2.6.30 kernels -- maybe that's sampling error and maybe it isn't... 

As per firmware, there seem to be many firmware versions of this drive in the wild, but I have been unable to find an updater anywhere, from Samsung included.

And about smartd and hddtemp: no, I don't have either one installed.  They sound useful, though.
Comment 29 Tejun Heo 2009-06-18 04:54:39 UTC
The timeouts only occur while trying suspend or resume?  Can you please try to do some stress testing?  Repeatedly making multiple copies of a large file and checksumming them should do the trick.

Comment 30 Thilo-Alexander Ginkel 2009-06-18 06:24:12 UTC
@Alec: I also experienced the freezes on shutdown with more recent kernels, so this is most likely not a new symptom. IMHO they occur when there is not much activity between resume and suspend and the first disk access is triggered by the next suspend (as everything else could be read from the cache).

@Tejun: I personally have not seen the timeout to happen under any other circumstances, but will try your suggested stress test.

Apart from that my system is neither running hddtemp, nor smartd, but laptop-mode, which probably also re-configures the HDD during resume.

I am currently composing a message to Samsung support inquiring about the availability of a new firmware.
Comment 31 Thilo-Alexander Ginkel 2009-06-24 05:52:01 UTC
I put my system under load over night using three concurrent operations:

2x: while true; do dd if=/dev/zero of=test.dat bs=1024 count=10000000; done
1x: while true; do md5sum *; done # '*' matching ~ 8 GB of data

Nothing broke, so I would assume that high load alone does not trigger the issue.

Regarding my mail to Samsung support: They have acknowledged the receipt of my mail, but I have not received a personal answer yet.
Comment 32 Alec Wood 2009-06-28 23:54:47 UTC
I just experienced the problem (or something very like it) during boot up.  During the part of the boot sequence where the machine switches from text mode to graphics mode (but before any X graphics have appeared on screen) the drive access light came on and the machine just froze there.  I shut it down and restarted, and all is well.  Logs don't indicate anything interesting.

I'm starting to think it may be worth considering replacing the drive... maybe it's time to try out SSD...
Comment 33 Tejun Heo 2009-06-29 09:02:47 UTC
It seems like the problem has good chance of being a hardware issue but then again there are more than one reports, so I'm not sure either way at this point.  It would be great if you can rule in or out the drive as the cause of the problem.  I'd recommend borrowing one or buying something not too expensive to try it out but that's your call.

Comment 34 Thilo-Alexander Ginkel 2009-06-30 13:01:31 UTC
I just received a reply from Samsung stating that no firmware update is available for the drive... I'll try to convince them to send me a non-Samsung replacement drive.
Comment 35 Thilo-Alexander Ginkel 2009-07-12 22:11:52 UTC
Created attachment 22327 [details]
dmesg output for SATA trouble during regular bootup

I just experienced something very similar to what Alex described in comment #32, i.e. the HDD access hung during regular bootup when the system was going to switch to graphics mode (when starting X). The interesting thing is that this time the reset seems to have been successful (but that was always the case when the issue happened during a regular startup compared to a resume) and that the kernel log contains some details about what happened.
Comment 36 Tejun Heo 2009-07-15 04:16:04 UTC
Hmmm... if you specify "libata.force=udma33" at the kernel prompt, does it change anything?
Comment 37 Thilo-Alexander Ginkel 2009-07-15 21:27:21 UTC
Unfortunately, libata.force=udma33 did not help (if you are interested in the dmesg output, I can try to capture it)... Also, I upgraded to 2.6.31 as it was mentioned in the Ubuntu bug report above that this had solved the issue for one user, but I cannot confirm this.

I am wondering whether this might be a hardware issue. If the system encounters the error the HDD LED basically is on (and stays on) from the moment the system wake-up is triggered. However, I do not know anything about the Linux STR resume, so probably some actions (such as restoring PCI registers) happen very soon after triggering the wake-up...
Comment 38 Tejun Heo 2009-07-16 02:16:18 UTC
Yes, please attach the output.  Doesn't the machine suspend and resume fine on Windows?  There's likely some issue in the hardware or firmware but it's also likely the problem can be worked around with proper tweaks to the reset protocol.  Ergh...
Comment 39 Alec Wood 2009-07-18 02:58:09 UTC
As a followup to comment #37, problem still occurs for me under 2.6.31-rc3.
Comment 40 Thilo-Alexander Ginkel 2009-08-17 20:48:43 UTC
Created attachment 22759 [details]
dmesg output w/ libata.force=udma33 enabled

Sorry, it took a while... Please find attached the dmesg output with libata.force=udma33 enabled.

Meanwhile, I am seriously considering to replace the HDD with an Intel SSD. Did somebody already try this and can confirm that the issue goes away when switching to a different HDD model?
Comment 41 Tejun Heo 2009-08-19 01:58:18 UTC
libata.force=udma33 was not in effect in the posted dmesg.  Can you please check again?  The kernel command line doesn't show it and also libata should be printing about forcing UDMA/33.

Comment 42 Thilo-Alexander Ginkel 2009-08-19 07:03:02 UTC
Created attachment 22773 [details]
dmesg output w/ libata.force=udma33 enabled

Sorry, I guess I forgot to run update-grub after changing the kernel cmd line parameters. This should now be fixed, please refer to the attached log.
Comment 43 Tejun Heo 2009-08-19 07:20:29 UTC
Thanks.  Eh... I'm running out of ideas.  It looks like this will have to go up on the glorious ata wiki known issues page. :-(

Can you guys please attach the output of 'dmidecode'?  Also, is there a BIOS update available?

Comment 44 Thilo-Alexander Ginkel 2009-08-19 07:34:46 UTC
Created attachment 22774 [details]
dmidecode output

Please find attached the dmidecode output. A BIOS update is available, but I first need to figure out how to boot Windows from an USB stick as the update can only be applied under Windows.
Comment 45 Thilo-Alexander Ginkel 2009-08-19 13:01:54 UTC
I just made an (at least for me) very interesting observation: The HDD lockup issue was just triggered by a modification of the KDE power management options. No suspend/resume was involved at all.
Comment 46 Tejun Heo 2009-08-19 14:35:15 UTC
Hmmm.... interesting.  What settings exactly?  And how reliably can you reproduce it?
Comment 47 Thilo-Alexander Ginkel 2009-08-19 18:37:43 UTC
Nothing HDD related, I just toggled the "Dim display when idle for more than x mins" checkbox and hit "Apply".

I tried to reproduce the issue and succeeded after around ten attempts. So, while it is reproducible, I would not call that reliable.

IMHO, resume is most likely not the root cause, but more likely something related to power management. Is there a way to log which commands have been submitted to the HDD before the freeze happens (e.g., in a kind of ring buffer)?
Comment 48 Thilo-Alexander Ginkel 2009-08-19 18:44:40 UTC
Created attachment 22777 [details]
Screenshot of the PM settings dialog w/ the changed setting highlighted

[although I think which specific PM setting was actually changed does not matter]
Comment 49 Thilo-Alexander Ginkel 2009-08-21 19:50:22 UTC
I applied the most recent BIOS and MICOM update, but that unfortunately did not solve the issue.
Comment 50 Tejun Heo 2009-08-27 06:36:56 UTC
Once every ten times is pretty reliable, I think.  Yeah, it probably is issuing some ata related commands when changing the settings.  Hmmm... I can write a patch to print any non-IO commands out.  Just a moment.
Comment 51 Tejun Heo 2009-08-27 07:19:29 UTC
Created attachment 22873 [details]

Can you please apply the attached patch and report what the kernel says when you change the setting and trigger the failure?  It'll dump non-IO commands as they're being issued.  It's disabled by default.  You can turn it on by specifying kernel parameter libata.monitor_cmds=PORTNO or enable it during runtime with "echo PORTNO > /sys/module/libata/parameters/monitor_cmds".

Comment 52 Thilo-Alexander Ginkel 2009-08-28 06:40:10 UTC
Created attachment 22888 [details]
dmesg output w/ cmd tracing enabled

Thanks for providing the patch. It seems that the lockup is triggered by issuing the command used to set the HDD's APM level to 254 (= 0xfe). Saving the KDE Power Management settings issues that command multiple times in a row and the same also happens during resume.

Maybe the HDD does not like receiving this command multiple times in a row? I will probably try if I can reproduce this with hdparm -B.

The attached log shows the lockup occurring after hitting the "Apply" button for the first time during that session.
Comment 53 Tejun Heo 2009-08-28 07:24:11 UTC
Created attachment 22889 [details]

Can you please apply this patch and see whether the problem goes away?  Thanks.
Comment 54 Thilo-Alexander Ginkel 2009-09-02 22:22:25 UTC
The freezes still occur with the patch applied, but I am not sure whether the patch works as intended as I am not seeing any "XXX filtering out APM" log messages although I increased the severity to KERN_ERR.
Comment 55 Tejun Heo 2009-09-02 23:04:05 UTC
You can issue APM commands directly using "hdparm -B".  Does that trigger the filter?
Comment 56 Tejun Heo 2009-09-02 23:04:38 UTC
Also, you can apply patch from #51 together so that the kernel prints out all those special commands.
Comment 57 Thilo-Alexander Ginkel 2009-09-03 05:57:19 UTC
Neither hdparm -B nor any of the other triggers end up being caught by the filter. I am also certain that I am not just missing the log message because when I issue "hdparm -B 1 /dev/sda" the HDD spins down rather quickly and again reverts to not spinning down when I issue "hdparm -B 254 /dev/sda".

As I applied both patches together I can also see which commands are generated:

[  174.368158] ata1.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0
[  215.019090] ata1.00: cmd ef/05:01:00:00:00/00:00:00:00:00/40 tag 0
[  236.158646] ata1.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0

Any idea?
Comment 58 Tejun Heo 2009-09-03 06:51:55 UTC
Created attachment 22984 [details]

It was looking at tf before it was initialized.  This means that TPM command filtering has never really worked.  I'll fix that up.  The attached patch should filter correctly.  Thanks.
Comment 59 Thilo-Alexander Ginkel 2009-09-04 06:51:05 UTC
Good news: After applying the patch from #58 along with the "libata: unbreak TPM filtering by reorganizing ata_scsi_pass_thru()" patch you sent by e-mail APM filtering seems to work correctly and I can no longer reproduce the freezes.
Comment 60 Tejun Heo 2009-09-05 00:46:18 UTC
Great, well, I'm not yet sure how to proceed here.  We can filter it out from the kernel but it's also that the power management thing in KDE is being too aggressive.  ATA APM settings aren't defined very clearly and implementation varies greatly between different products.  On some, high values easily cause overheating and 254 is the highest value while 255 wraps to zero for some reason and lower values can cause too frequent head unloads leading to shortened lifespan of the device.  Plus, being a not-so-frequently feature, it's not too unexpected that some device will throw up when hit with the command repeatedly like you experienced.

In short, APM setting is not something you wanna mess with unless you specifically know which value means what and that it would work and filtering out APM commands from the kernel would also block other safer uses which could possibly be using different more well supported value.

If KDE power management wants to directly control APM setting, it will have to build a database of supported APM values on each mobile drive product line and which values mean what for each of them and whether it's safe to use.

Hmmmm... Can you please report this to KDE and ask them not to adjust APM settings?  It's just too dangerous.

I'm resolving this one as WILL_NOT_FIX for now but please feel free to add comments and/or reopen if you think there's more to be done from the kernel side.

Comment 61 Thilo-Alexander Ginkel 2009-09-05 06:44:28 UTC
First of all, thanks for all the help in analyzing this issue!

I have kind of mixed feelings with regards to attempting to solve this on the user level, but also understand that unconditional filtering on the Kernel level is a bad idea.

As I noticed that the bug does not seem to be KDE-specific (the original Ubuntu bug report was for Gnome) I digged a little bit deeper and it turned out that the acpi-support package is responsible for invoking hdparm -B way too often on various occasions (and with hardcoded APM values). I have updated the original Ubuntu bug report with the findings from this one and assigned it to the apm-support package.
Comment 62 Tejun Heo 2009-09-05 12:00:03 UTC
Ah... I thought it was done by kpowersaved or something similar as I never noticed it with my opensuse gnome installation.  Yeap, please report it to whichever package which is issuing the command.  APM setting is simply not something to be used that way.  Not yet at least.

Comment 63 Hans Werner 2009-10-13 18:40:29 UTC
I see this issue with the Samsung N140 (BIOS 04CU)
with *both* the
Samsung HM160HI drive (fw 100-06) 
and the 
Western Digital WD1600BEVT drive (fw 1101A11).

Kernels tried so far (Samsung) (Samsung and WD) (WD)

So it's clearly not specific to the Samsung HDD.

Easy to reproduce too -- just boot (or resume). At first dmesg looks OK.
Wait a few minutes and the machine will freeze and now dmesg shows in

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ca/00:08:5a:05:cd/00:00:00:00:00/e5 tag 0 dma 4096 out
         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: link is slow to respond, please be patient (ready=0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting link
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0
ata1: EH complete

The WD drive currently has a very bare Linux installation
with no KDE, or even Xorg, and nothing doing hdparm -B (afaik).

The Samsung drive currently has 28 errors like this.
There have been more than 28 freezes on this disk though.

$ smartctl -a /dev/sda:
Error 28 occurred at disk power-on lifetime: 277 hours (11 days + 13 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  -- -- -- -- -- -- --
  84 51 20 ba 36 91 e3  Error: ICRC, ABRT 32 sectors at LBA = 0x039136ba = 59848378

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 ba 36 91 e3 00      00:33:55.312  READ DMA
  ef 10 03 f9 82 d0 a0 00      00:31:28.125  SET FEATURES [Reserved for Serial ATA]
  ca 00 08 f2 82 d0 e2 00      00:31:24.375  WRITE DMA
  ca 00 08 2a c2 c0 e2 00      00:31:24.375  WRITE DMA
  ca 00 08 e2 a2 c0 e2 00      00:31:24.375  WRITE DMA
Comment 64 Tejun Heo 2009-10-14 02:33:47 UTC
Hans, does libata.force=1.5Gbps make any difference?  Please post kernel boot log with the parameter applied and note that you may need to modify initrd depending on your distribution to apply the parameter.

Comment 65 Hans Werner 2009-10-14 12:56:09 UTC
Created attachment 23407 [details]
kernel.log with libata.force=1.5Gbps

Here's the kernel log. Using libata.force=1.5Gbps didn't help.

Is there no message to confirm the force was picked up? I put libata in my
initrd, "options libata force=1.5Gbps" in modprobe.conf and rebuilt the
Comment 66 Thilo-Alexander Ginkel 2009-10-14 13:50:04 UTC
Well, in the log it states:

kernel: Unknown boot option `libata.force=1.5Gbps': ignoring

BTW, you stated that you are quite confident that no APM commands are issued. Could you try using the ata-monitor-cmds.patch to verify that this is actually the case?
Comment 67 Hans Werner 2009-10-14 15:28:16 UTC
Correct me if I am wrong, but I think the kernel throws out that message
because of the command line I have in grub, but the module in initrd is set
up correctly by modprobe.conf which is also in initrd. I did experiment with
the ata-monitor-cmds.patch a few days ago with the Samsung disk and KDE.
Will recheck. The WD disk has just the initial Arch installation -- just
enough to boot. But I will check.
Comment 68 Hans Werner 2009-10-14 20:07:17 UTC
Created attachment 23410 [details]
kernel.log with libata.force=1.5Gbps and monitor_cmds=1

Now with the monitoring patch applied. SETFEATURES_XFER is showing
up. Is that a problem?
Comment 69 Hans Werner 2009-10-14 20:29:56 UTC
Note there were no ef/05 (APM) commands in the log but there was a freeze.

If I type an hdparm -B command by hand I do see an ef/05 log entry of course.
It's set to 128 by default by the way.
Comment 70 Tejun Heo 2009-10-15 07:39:12 UTC
Ah... sorry about that.  Your controller doesn't have access to SControl register so it won't be able to force the link speed down to 1.5Gbps.

Looks like genuine transmission problem.  SMART is logging them as transmission errors too.  Any chance the hardware is faulty?
Comment 71 Hans Werner 2009-10-15 15:40:41 UTC
I think it's unlikely to be faulty hardware. I'd expect to see more 
randomness in the occurrence of the problem -- this is completely
reproducible, including timing, and with two different disks. The netbook
is new and hardly used let alone abused. I checked that the screws holding
the disk in place are tight.

Could the problem be in the area of power management?

The smartctl error messages I posted above show ef/10/03 which is 
related to DIPM Device Initiated Power Management. Could the problem
be something to do with power management of the link?

If the link gets into a low power mode (for any reason, device initiated
or otherwise) is libata/ata_piix failing to recognise this and bring it up
again in the right way?

Actually if the ICH7-M cant access SControl then is the online/offline
status of the link ever detected properly?
Comment 72 Tejun Heo 2009-10-16 02:39:08 UTC
Oh... DIPM can definitely do that.  Can you please try "libata.acpi_gtf_filter=0x1f" and report the boot log?
Comment 73 Hans Werner 2009-10-17 03:15:30 UTC
Created attachment 23441 [details]
kernel.log with libata.acpi_gtf_filter=0x1f

In the logs I posted above I already had acpi_gtf_filter set to 7 which
is the max (and default) in 2.6.31 and includes the DIPM bit. Here is the
log for 2.6.32-rc4 with libata.acpi_gtf_filter=0x1f.

I looked into this with powertop. After a few minutes idle, the machine
switches into a low power mode, stepping down from about 9W to 6W.
And it seems the freeze happens just after the machine switches into
that mode for the first time after booting. The disk spins down, but
perhaps more hardware gets turned off. Interesting that this does not
happen again subsequently -- the state after the hardreset is different
from the state just after boot. Could it be something like the vacant port
going from on-but-vacant to off-and-really-invisible?
Booting with acpi=off stops this switch to low power and stops the
freezing -- not acceptable as a workaround but tells us something.
With pci=noacpi or libata.noacpi=1 the freeze still occurs.
Comment 74 Tejun Heo 2009-10-18 03:28:07 UTC
(cc'd Thomas.)

This is getting ridiculous.  It looks like certain ACPI method used when entering low power mode is issuing ATA commands to enable DIPM and it's not the _GTF.  That ACPI method is issuing an ATA command behind ATA driver's back.  I guess this was to achieve power saving without OS knowing about it but for some reason, after initialized by ata_piix, the host controller doesn't behave well when the device initiates powersave mode (which isn't too unexpected, maybe windows driver doesn't do full controller reset so that the configuration done by BIOS remains intact?).  Anyways, we need to find out which one is responsible for this and not use the ACPI method as it's frigging dangerous for an ACPI method to issue ATA command directly without going through _GTF and the OS driver.

Comment 75 Hans Werner 2009-10-19 01:11:53 UTC
Ugh, that's horrible, and dangerous.

Has anyone got any ideas on how confirm this, i.e. to find which ACPI
method is causing the trouble? Unfortunately I have only a couple of
days' more debug time with the N140 (it's not mine).
Comment 76 Thomas Renninger 2009-10-19 08:19:13 UTC
The bug already is rather long...
Let me summarize things a bit (please correct):
  - The machine works fine under gnome and it may be related to kpowersave
    init/runtime settings?
  - It is eventually related to backlight settings, possibly kpowersave is
    trying to set disk's APM settings directly if it detects a laptop or similar?
Hmm, later in the bug it looks as if kpowersaved is out of the game?
The ata poking causing this happens at early boot already?

ACPI shouldn't be able to send ATA commands to the disk at all.
_GTF and friends providing a buffer/string with ATA commands how the OS has to set up the disk. I also wonder what in the output of comment #73 points to ACPI issuing something.

Please try:
acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x1F log_buf_len=32M

Do you see ACPI messages near the ATA failure?
Please attach acpidump and a dmesg output with above boot params.

You may want to increase the acpi.debug_level to 0x21F
Then you will get *a lot* output.

Hmm, possibly it's related to C-states?
From dmesg:
Monitor-Mwait will be used to enter C-2 state
ata1.00: failed command: WRITE DMA

Before you try above, can you try idle=poll, if this helps try idle=nomwait
You won't get C-states with acpi=off, possibly it's that.
Comment 77 Hans Werner 2009-10-19 23:27:06 UTC
Created attachment 23470 [details]
acpidump for Samsung N140 BIOS 04CU .tar.bz2

Here's the acpidump output.
Comment 78 Hans Werner 2009-10-19 23:33:34 UTC
Created attachment 23471 [details]
kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x1F

Here's the kernel.log with
acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x1F log_buf_len=32M
Comment 79 Hans Werner 2009-10-19 23:41:58 UTC
Created attachment 23472 [details]
kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x21F

And here's the more verbose kernel.log with 
acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x21F log_buf_len=32M

These logs show the ata disk "frozen" messages as before, plus ACPI

With idle=poll or idle=nomwait I again saw the freezes.
Comment 80 Thomas Renninger 2009-10-20 11:36:37 UTC
Sorry, I cannot see anything obvious.
Would libata.noacpi param make sense to test?
_GTM sata func is invoked, this is the only ACPI sata func I see processed, but I expect this is how it should work and there are no warnings. libata.noacpi should skip this one.
Comment 81 Tejun Heo 2009-10-20 13:30:54 UTC
Thomas, it's not _GTM or _GTF and libata.noacpi doesn't solve the issue.  If the drive is actually receiving DIPM commands, it's not going through libata.  I think one of the ACPI routines might be issuing the command directly to the controller.  Is there anything which look like it might do something like this?

Hans, can you please verify the drive is actually getting DIPM command when this happens?  You can compare the power-on hours and the timestamps of the error log to check this.

Comment 82 Thomas Renninger 2009-10-20 20:17:45 UTC
No idea what DIPM commands are...
If you could provide me with an IO/mem address to look for...
I checked that SATA related PCI config space declared there does not get touched anywhere.
Hmm, isn't the NC10 rather widespread? I already saw key code quirks.
Or it is related to this one:
[Bug 14314] Samsung n130: HDD lockup after 5 min after boot
Did you already try older/different kernels?
Comment 83 Thomas Renninger 2009-10-20 20:22:11 UTC
What distro do you use?
Best try a openSUSE liveCD/DVD, if it works it should be related to bug #14314.
Comment 84 Thilo-Alexander Ginkel 2009-10-20 20:32:55 UTC
Thomas, while the original issue popped up on a NC10 I think Hans reported an issue for an N140, which may use similar, but not identical hardware.
Comment 85 Hans Werner 2009-10-22 16:02:09 UTC
Thomas, thanks I had not seen Bug 14314 -- it does look like it may be the
same problem I see. The N130 and the N140 are almost identical I believe,
and the NC10 is very similar hardware, although perhaps the BIOS is
different (perhaps in power management since Samsung claim an improved
battery life, despite having almost the same hardware and the same battery
capacity). I don't know if we have two issues here or one with slightly
different symptoms.

SATA_DIPM) got to the disk but libata would/should filter this. If an
ACPI routine issues SATA commands to the controller we would not see
that directly in the log would we?

In my log from Comment #78 perhaps look between 23:43:48 and 23:47:29,
i.e. between end of boot and when the freeze occurred.

Wild guess: how about _Q5F, _Q66 or _Q70 ?

I tried blacklisting the battery module to no avail.

Tejun, before I posted Comment #63 I did spend some time trying to work out
if the DIPM commands I saw in the smartctl error logs were happening at the
same time as the freezes. My conclusion was that yes they were, as far as 
I could tell with a resolution of as long as it takes to type some commands
by hand. I didn't see a smartctl error log for every freeze though, and the
WD disk did not record any errors, but perhaps it can't for some reason.

Tejun, does SCR access via SIDPR in ata_piix.c work for ICH7-M? If not
then the libata-core routines ata_{phys,}_link_{off,on}line always *all*
return false don't they? Which ought to screw up the logic in libata-eh.c
which depends on them, no? Or perhaps I'm misunderstanding something.

Thomas, I tried kernels 2.6.28-32rc4, distro Arch Linux. I don't have
access to the N140 any more unfortunately. Perhaps briefly in a month.
Comment 86 Tejun Heo 2009-10-26 14:59:52 UTC
(In reply to comment #85)
> SATA_DIPM) got to the disk but libata would/should filter this. If an
> ACPI routine issues SATA commands to the controller we would not see
> that directly in the log would we?

That was my wild guess but of course I could be wrong.

> Tejun, before I posted Comment #63 I did spend some time trying to work out
> if the DIPM commands I saw in the smartctl error logs were happening at the
> same time as the freezes. My conclusion was that yes they were, as far as 
> I could tell with a resolution of as long as it takes to type some commands
> by hand. I didn't see a smartctl error log for every freeze though, and the
> WD disk did not record any errors, but perhaps it can't for some reason.

Yeah, for some reason, disks often do not log all the failures but if you see DIPM commands in the failure log which monitor_cmds doesn't report, something is bypassing the kernel and issuing command to the drive directly.

> Tejun, does SCR access via SIDPR in ata_piix.c work for ICH7-M? If not
> then the libata-core routines ata_{phys,}_link_{off,on}line always *all*
> return false don't they? Which ought to screw up the logic in libata-eh.c
> which depends on them, no? Or perhaps I'm misunderstanding something.

Unfortunately, SIDPR only works from ich8 on and both link_{off,on} are supposed to return false if link status is not available.  It basically tells the core that it doesn't know about the link status and libata core layer will behave correctly (all PATA controllers return false for both functions too).
Comment 87 Jan 2009-10-26 18:00:07 UTC
Created attachment 23534 [details]
requested logs

I can confim the bug on N140:(
Comment 88 Tejun Heo 2009-10-28 16:25:53 UTC
Thomas, any ideas?
Comment 89 Thomas Renninger 2009-10-28 21:55:32 UTC
DIPM, SIDPR,... doesn't tell me much.
I searched for "grep -i ef [SD]SDT*.dsl" and the matches were not much. All sane was in buffers created in _GTF funcs which have to be invoked by OS and the returned buffer has to be send to the disk by OS as well, right?

> Thomas, any ideas?
Unfortunately, no.
Comment 90 Tejun Heo 2009-10-29 10:44:26 UTC
Oh... DIPM or SIDPR doesn't matter at all.  The thing to look out for would be direct accesses to IO ports for the ATA controller (pci dev 1f.2, the ports should be 0x1f0).

Comment 91 Thomas Renninger 2009-10-29 12:08:22 UTC
I can't see how this is accessed anywhere by ACPI. I may have overseen something, but I am rather sure these ports are not touched by ACPI.
The last two params are address and length of a port or a memory mapped IO region which can later be accessed in functions:

grep "SystemIO" *
DSDT.dsl:    OperationRegion (PRT0, SystemIO, 0x80, 0x04)
DSDT.dsl:    OperationRegion (S_IO, SystemIO, 0x06B0, 0x11)
DSDT.dsl:    OperationRegion (IO_T, SystemIO, 0x0800, 0x10)
DSDT.dsl:    OperationRegion (PMIO, SystemIO, 0x1000, 0x80)
DSDT.dsl:    OperationRegion (GPIO, SystemIO, 0x1180, 0x3C)
DSDT.dsl:        OperationRegion (SECT, SystemIO, 0x5000, 0x10)
DSDT.dsl:        OperationRegion (SMI0, SystemIO, 0x0000FE00, 0x00000002)
DSDT.dsl:                OperationRegion (SMBI, SystemIO, 0x18A0, 0x10)

trenn@queen:~/dsdt/samsungN140_BIOS_04CU> grep "SystemMemory" *
DSDT.dsl:    OperationRegion (GNVS, SystemMemory, 0x3F6D1DBC, 0x0100)
DSDT.dsl:    OperationRegion (RCRB, SystemMemory, 0xFED1C000, 0x4000)
DSDT.dsl:        OperationRegion (SNVS, SystemMemory, 0x3F6D1CBD, 0x000000FF)
DSDT.dsl:        OperationRegion (TCG1, SystemMemory, 0x3F6D1CB6, 0x00000007)
DSDT.dsl:        OperationRegion (SMI1, SystemMemory, 0x3F6D1EBD, 0x00000090)

For Sata PCI config space there seem to be some bits and bytes declared as variables:
OperationRegion (SACS, PCI_Config, 0x40, 0xC0)
                Field (SACS, DWordAcc, NoLock, Preserve)
                    PRIT,   16,
But the declared variables are only used in _GTM/_STM/_GTF functions which are not invoked by ACPI code itself.
Comment 92 Tejun Heo 2009-11-09 04:00:34 UTC
Hmmm.... strange.  Those commands definitely are not going through libata but is being logged by the drive and disabling acpi makes them go away.  What could it be?  Puzzled...

Hans, can you please verify one more time with the ata-monitor-cmds patch and smart log?

Comment 93 Hans Werner 2009-11-10 22:04:30 UTC
Tejun, unfortunately I don't have any access to the N140 at the moment.
Perhaps Jan can help, or someone else who is watching this?
I may be able to do some brief tests in two weeks.

Perhaps some more trial and error tests could be used to try and narrow down the 
location of the problem. Thomas, can you suggest how we could selectively turn off parts of ACPI or individual routines in order to obtain some tighter limits on where the issue might be?
Comment 94 Thomas Renninger 2009-11-12 21:01:11 UTC
Overriding DSDT via initrd does not work with latest mainline kernels.
You have to compile it in:
iasl -tc DSDT.dsl
Then copy DSDT.hex to the kernel source dir: drivers/acpi
and compile with:

You can cluster code you think which could be affected with:
Store("Debug to show up in syslog", debug)
Store(variable_xy, debug)

Make sure that acpi.debug_layer= and acpi.debug_level is set to show this.
acpi.debug_layer=0xFFFFFFF acpi.debug_level=0xF
should work with a lot, but not insane amount of debug code spit out.
(Latest kernels define the second debug_level bit to see stores to the debug object):
ACPI_LV_INIT                    0x00000001 [ ]
ACPI_LV_DEBUG_OBJECT            0x00000002 [ ]
that changed recently...
Do you really want to go through this... You should look at some code that could affect this (I couldn't find anything according to your suggestions, but I don't know anything about disk HW...).
Comment 95 Tejun Heo 2010-01-01 01:09:42 UTC
*** Bug 14928 has been marked as a duplicate of this bug. ***
Comment 96 Hans Werner 2010-01-02 12:43:50 UTC
It has been reported that if one applies a patch (by Tejun) which removes
spurious IRQs then this problem is no longer seen. Please see my post here:
Comment 97 Hans Werner 2010-01-02 13:02:56 UTC
See also Bug 14314.

Note You need to log in before you can comment on or make changes to this bug.