When using the suspend/resume feature (this applies to both STR and Hibernate) on my Samsung NC10 I have a 1:5 chance that the machine does not properly resume when woken up again. Once in a while any disk I/O times out after resume rendering the system unusable until it is power-cycled. While the error happens, the HDD LED stays on all the time. The resulting error messages read: [<timestamp>] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [<timestamp>] ata1.00: cmd ca/00:58:87:c1:63/00:00:00:00:00/e9 tag 0 dma 45056 out [<timestamp>] res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [<timestamp>] ata1: link is slow to respond, please be patent (ready=0) [<timestamp>] ata1: device not ready (errno=-16), forcing hardreset [<timestamp>] ata1: soft resetting link [<timestamp>] ata1: link is slow to respond, please be patent (ready=0) [<timestamp>] ata1: SRST failed (errno=-16) And after some time: [<timestamp>] end_request: I/O error, dev sda, sector <sector#> An in-depth analysis by Manoj Iyer and Tejun Heo has been posted downstream at: https://bugs.launchpad.net/ubuntu/jaunty/+source/linux/+bug/340014/comments/11
Created attachment 21664 [details] dmesg output
Created attachment 21665 [details] lspci -vvv output
Created attachment 21673 [details] ata_piix-pcs-hardreset.patch Can you please try this patch? Thanks.
I tried the patch, but unfortunately, the I/O freezes still happen. The error message has slightly changed, though: [<timestamp>] ata2: XXX: reset complete, classes=9,0 [<timestamp>] ata1: XXX: reset complete, classes=1,9 [<timestamp>] ata1.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out [<timestamp>] ata1.00: ACPI cmd ef/03:45:00:00:00:a0 filtered out [...] [<timestamp>] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [<timestamp>] ata1.00: cmd ca/00:08:4f:ca:fd/00:00:00:00:00/e5 tag 0 dma 4096 out [<timestamp>] res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [<timestamp>] ata1.00: status: { DRDY } [<timestamp>] ata1: hard resetting link [<timestamp>] ata1: XXX: doing PCS hardreset, pcsmask=0x5 [<timestamp>] ata1: link is slow to respond, please be patent (ready=0) [<timestamp>] ata1: PCS hardreset failed (errno=-16)
Can you please attach full kernel log including boot messages?
Created attachment 21690 [details] kern.log Sure. The log is truncated at the end, but the first couple of failed hard reset attempts are included.
Can you please post full log w/o the patch? The interesting thing is that during the last resume, the device was resumed correctly (ata1.00: configured for UDMA/133) but timeout occurred when an actual IO was issued to it. After that, the device seemed to have completely shut down. Hmmm... maybe we need to play with SIG_MODE. Anyways, please post full log without the patch. Thanks.
Created attachment 21702 [details] kern.log (w/o patch) Please find attached a kern.log created without the patch applied. An observation I made after using Magic SysReq to reboot the machine is that the HDD (or controller) seems to lock up in such a way that even the BIOS is unable to recover it when performing a soft reboot (i.e., the BIOS hangs during POST). Only power-cycling helps. This is the case for the patched and the unpatched kernel version.
Created attachment 21721 [details] ata_piix-sig_mode-on-suspend.patch Can you please this patch and post kernel log?
Can't duplicate this behaviour on my NC10
Alan, which HDD model is built into your NC10?
WDC1600BEVT
Created attachment 21733 [details] kern.log (w/ ata_piix-sig_mode-on-suspend.patch) It took a little bit longer than expected, but finally I have been able to reproduce the issue with the newest patch set applied. Please find attached the resulting kernel log. @Nico: Which HDD model is your NC10 based upon? @Alan: I assume you are running a more recent kernel than 2.6.28, right?
hdparm -i /dev/sda /dev/sda: Model=SAMSUNG HM160HI , FwRev=HH100-06 uname -a Linux nico-netbook 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux
Mine is usually running 2.6.27 but they are different drives and that may be relevant.
Does 'irqpoll' make any difference?
Unfortunately, the problem still occurs with irqpoll enabled.
I can confirm the same bug on the Samsung N110 -- same hard drive and kernel as Nico above.
What I'd like to find out is whether the issue is new with 2.6.28 (as Alan reported he did not observe the issue with his NC10 running 2.6.27) or whether the different hard drive models are causing the different behavior (Alan's NC10 is using a WD drive whereas all others have Samsung HDDs). I already tried booting a 2.6.27 kernel, which however panicked due to an issue with dmcrypt. Could somebody else (who also has a Samsung HDD in his NC10) try if the problem is reproducible with a 2.6.27 kernel? A recent 2.6.27 kernel for Ubuntu is available at: http://security.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-2.6.27-14-generic_2.6.27-14.34_i386.deb
Created attachment 21931 [details] dmesg output with kernel Linux 2.6.27-14 absolutely no problem with Ubuntu kernel 2.6.27-14 Thilo posted here. suspend/resume works fine! but WLAN does not ;-)
Strange. 2.6.28 introduced slave_link and the head parking thing but for PATA suspend/resume nothing really has changed. The timeout occurs *after* libata resumes the device is very interesting. It means polling data transfer works fine and also IRQ driven no-data commands work fine too. Hmmm.... Don't have much clue at this point. Alan, can you please try newer kernels on your machine? Thanks.
I finally got a vanilla 2.6.30 running on my N110, and the problem is definitely still present in that kernel.
Interestingly, I do get the error using a hand-compiled vanilla 2.6.27.14 kernel (wifi does work tho)... I'll give that Ubuntu 2.6.27.14 a spin next.
I have the Ubuntu 2.6.27-14 kernel linked by Thilo booting. In trying to reproduce the error (probably the third go round) I got a panic on suspend. Transcription of panic output attached.
Created attachment 21973 [details] Ubuntu 2.6.27-14 suspend failure output Got this crash on or around 3rd attempt at suspend.
That isn't a panic. That's about the same problem happening on suspend. The disk checked out while the machine tries to sleep (probably while trying to flush the request queue before the suspend commands get issued) and suspend operation failed. The kernel correctly returned to proper operation but the disk is gone, so the system is not whole lot useful. So, regardless of kernel, it looks like your drive likes to check out from time to time. Can you please repeat the test and see whether the problem repeats itself? Also, is there a firmeware update available for your drive? Thanks.
Also, does your system have smartd or hddtemp running?
Tejun: yes, I see your point. Yes, I can reproduce the suspend failure with the 2.6.27 kernel. I have also seen the resume failure with that same kernel (at least I think it was the same failure -- I never got any log output that time so I can't be sure). However, I have never once seen the suspend failure with the 2.6.28 or 2.6.30 kernels -- maybe that's sampling error and maybe it isn't... As per firmware, there seem to be many firmware versions of this drive in the wild, but I have been unable to find an updater anywhere, from Samsung included. And about smartd and hddtemp: no, I don't have either one installed. They sound useful, though.
The timeouts only occur while trying suspend or resume? Can you please try to do some stress testing? Repeatedly making multiple copies of a large file and checksumming them should do the trick. Thanks.
@Alec: I also experienced the freezes on shutdown with more recent kernels, so this is most likely not a new symptom. IMHO they occur when there is not much activity between resume and suspend and the first disk access is triggered by the next suspend (as everything else could be read from the cache). @Tejun: I personally have not seen the timeout to happen under any other circumstances, but will try your suggested stress test. Apart from that my system is neither running hddtemp, nor smartd, but laptop-mode, which probably also re-configures the HDD during resume. I am currently composing a message to Samsung support inquiring about the availability of a new firmware.
I put my system under load over night using three concurrent operations: 2x: while true; do dd if=/dev/zero of=test.dat bs=1024 count=10000000; done 1x: while true; do md5sum *; done # '*' matching ~ 8 GB of data Nothing broke, so I would assume that high load alone does not trigger the issue. Regarding my mail to Samsung support: They have acknowledged the receipt of my mail, but I have not received a personal answer yet.
I just experienced the problem (or something very like it) during boot up. During the part of the boot sequence where the machine switches from text mode to graphics mode (but before any X graphics have appeared on screen) the drive access light came on and the machine just froze there. I shut it down and restarted, and all is well. Logs don't indicate anything interesting. I'm starting to think it may be worth considering replacing the drive... maybe it's time to try out SSD...
It seems like the problem has good chance of being a hardware issue but then again there are more than one reports, so I'm not sure either way at this point. It would be great if you can rule in or out the drive as the cause of the problem. I'd recommend borrowing one or buying something not too expensive to try it out but that's your call. Thanks.
I just received a reply from Samsung stating that no firmware update is available for the drive... I'll try to convince them to send me a non-Samsung replacement drive.
Created attachment 22327 [details] dmesg output for SATA trouble during regular bootup I just experienced something very similar to what Alex described in comment #32, i.e. the HDD access hung during regular bootup when the system was going to switch to graphics mode (when starting X). The interesting thing is that this time the reset seems to have been successful (but that was always the case when the issue happened during a regular startup compared to a resume) and that the kernel log contains some details about what happened.
Hmmm... if you specify "libata.force=udma33" at the kernel prompt, does it change anything?
Unfortunately, libata.force=udma33 did not help (if you are interested in the dmesg output, I can try to capture it)... Also, I upgraded to 2.6.31 as it was mentioned in the Ubuntu bug report above that this had solved the issue for one user, but I cannot confirm this. I am wondering whether this might be a hardware issue. If the system encounters the error the HDD LED basically is on (and stays on) from the moment the system wake-up is triggered. However, I do not know anything about the Linux STR resume, so probably some actions (such as restoring PCI registers) happen very soon after triggering the wake-up...
Yes, please attach the output. Doesn't the machine suspend and resume fine on Windows? There's likely some issue in the hardware or firmware but it's also likely the problem can be worked around with proper tweaks to the reset protocol. Ergh...
As a followup to comment #37, problem still occurs for me under 2.6.31-rc3.
Created attachment 22759 [details] dmesg output w/ libata.force=udma33 enabled Sorry, it took a while... Please find attached the dmesg output with libata.force=udma33 enabled. Meanwhile, I am seriously considering to replace the HDD with an Intel SSD. Did somebody already try this and can confirm that the issue goes away when switching to a different HDD model?
libata.force=udma33 was not in effect in the posted dmesg. Can you please check again? The kernel command line doesn't show it and also libata should be printing about forcing UDMA/33. Thanks.
Created attachment 22773 [details] dmesg output w/ libata.force=udma33 enabled Sorry, I guess I forgot to run update-grub after changing the kernel cmd line parameters. This should now be fixed, please refer to the attached log.
Thanks. Eh... I'm running out of ideas. It looks like this will have to go up on the glorious ata wiki known issues page. :-( Can you guys please attach the output of 'dmidecode'? Also, is there a BIOS update available? Thanks.
Created attachment 22774 [details] dmidecode output Please find attached the dmidecode output. A BIOS update is available, but I first need to figure out how to boot Windows from an USB stick as the update can only be applied under Windows.
I just made an (at least for me) very interesting observation: The HDD lockup issue was just triggered by a modification of the KDE power management options. No suspend/resume was involved at all.
Hmmm.... interesting. What settings exactly? And how reliably can you reproduce it?
Nothing HDD related, I just toggled the "Dim display when idle for more than x mins" checkbox and hit "Apply". I tried to reproduce the issue and succeeded after around ten attempts. So, while it is reproducible, I would not call that reliable. IMHO, resume is most likely not the root cause, but more likely something related to power management. Is there a way to log which commands have been submitted to the HDD before the freeze happens (e.g., in a kind of ring buffer)?
Created attachment 22777 [details] Screenshot of the PM settings dialog w/ the changed setting highlighted [although I think which specific PM setting was actually changed does not matter]
I applied the most recent BIOS and MICOM update, but that unfortunately did not solve the issue.
Once every ten times is pretty reliable, I think. Yeah, it probably is issuing some ata related commands when changing the settings. Hmmm... I can write a patch to print any non-IO commands out. Just a moment.
Created attachment 22873 [details] ata-monitor-cmds.patch Can you please apply the attached patch and report what the kernel says when you change the setting and trigger the failure? It'll dump non-IO commands as they're being issued. It's disabled by default. You can turn it on by specifying kernel parameter libata.monitor_cmds=PORTNO or enable it during runtime with "echo PORTNO > /sys/module/libata/parameters/monitor_cmds". Thanks.
Created attachment 22888 [details] dmesg output w/ cmd tracing enabled Thanks for providing the patch. It seems that the lockup is triggered by issuing the command used to set the HDD's APM level to 254 (= 0xfe). Saving the KDE Power Management settings issues that command multiple times in a row and the same also happens during resume. Maybe the HDD does not like receiving this command multiple times in a row? I will probably try if I can reproduce this with hdparm -B. The attached log shows the lockup occurring after hitting the "Apply" button for the first time during that session.
Created attachment 22889 [details] filter-out-apm.patch Can you please apply this patch and see whether the problem goes away? Thanks.
The freezes still occur with the patch applied, but I am not sure whether the patch works as intended as I am not seeing any "XXX filtering out APM" log messages although I increased the severity to KERN_ERR.
You can issue APM commands directly using "hdparm -B". Does that trigger the filter?
Also, you can apply patch from #51 together so that the kernel prints out all those special commands.
Neither hdparm -B nor any of the other triggers end up being caught by the filter. I am also certain that I am not just missing the log message because when I issue "hdparm -B 1 /dev/sda" the HDD spins down rather quickly and again reverts to not spinning down when I issue "hdparm -B 254 /dev/sda". As I applied both patches together I can also see which commands are generated: [ 174.368158] ata1.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0 [ 215.019090] ata1.00: cmd ef/05:01:00:00:00/00:00:00:00:00/40 tag 0 [ 236.158646] ata1.00: cmd ef/05:fe:00:00:00/00:00:00:00:00/40 tag 0 Any idea?
Created attachment 22984 [details] filter-out-apm-1.patch It was looking at tf before it was initialized. This means that TPM command filtering has never really worked. I'll fix that up. The attached patch should filter correctly. Thanks.
Good news: After applying the patch from #58 along with the "libata: unbreak TPM filtering by reorganizing ata_scsi_pass_thru()" patch you sent by e-mail APM filtering seems to work correctly and I can no longer reproduce the freezes.
Great, well, I'm not yet sure how to proceed here. We can filter it out from the kernel but it's also that the power management thing in KDE is being too aggressive. ATA APM settings aren't defined very clearly and implementation varies greatly between different products. On some, high values easily cause overheating and 254 is the highest value while 255 wraps to zero for some reason and lower values can cause too frequent head unloads leading to shortened lifespan of the device. Plus, being a not-so-frequently feature, it's not too unexpected that some device will throw up when hit with the command repeatedly like you experienced. In short, APM setting is not something you wanna mess with unless you specifically know which value means what and that it would work and filtering out APM commands from the kernel would also block other safer uses which could possibly be using different more well supported value. If KDE power management wants to directly control APM setting, it will have to build a database of supported APM values on each mobile drive product line and which values mean what for each of them and whether it's safe to use. Hmmmm... Can you please report this to KDE and ask them not to adjust APM settings? It's just too dangerous. I'm resolving this one as WILL_NOT_FIX for now but please feel free to add comments and/or reopen if you think there's more to be done from the kernel side. Thanks.
First of all, thanks for all the help in analyzing this issue! I have kind of mixed feelings with regards to attempting to solve this on the user level, but also understand that unconditional filtering on the Kernel level is a bad idea. As I noticed that the bug does not seem to be KDE-specific (the original Ubuntu bug report was for Gnome) I digged a little bit deeper and it turned out that the acpi-support package is responsible for invoking hdparm -B way too often on various occasions (and with hardcoded APM values). I have updated the original Ubuntu bug report with the findings from this one and assigned it to the apm-support package.
Ah... I thought it was done by kpowersaved or something similar as I never noticed it with my opensuse gnome installation. Yeap, please report it to whichever package which is issuing the command. APM setting is simply not something to be used that way. Not yet at least. Thanks.
I see this issue with the Samsung N140 (BIOS 04CU) with *both* the Samsung HM160HI drive (fw 100-06) and the Western Digital WD1600BEVT drive (fw 1101A11). Kernels tried so far 2.6.30.6 (Samsung) 2.6.31.3 (Samsung and WD) 2.6.31.4 (WD) So it's clearly not specific to the Samsung HDD. Easy to reproduce too -- just boot (or resume). At first dmesg looks OK. Wait a few minutes and the machine will freeze and now dmesg shows in addition: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen ata1.00: cmd ca/00:08:5a:05:cd/00:00:00:00:00/e5 tag 0 dma 4096 out res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) ata1.00: status: { DRDY } ata1: link is slow to respond, please be patient (ready=0) ata1: device not ready (errno=-16), forcing hardreset ata1: soft resetting link ata1.00: configured for UDMA/133 ata1.00: device reported invalid CHS sector 0 ata1: EH complete The WD drive currently has a very bare Linux installation with no KDE, or even Xorg, and nothing doing hdparm -B (afaik). The Samsung drive currently has 28 errors like this. There have been more than 28 freezes on this disk though. $ smartctl -a /dev/sda: Error 28 occurred at disk power-on lifetime: 277 hours (11 days + 13 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 20 ba 36 91 e3 Error: ICRC, ABRT 32 sectors at LBA = 0x039136ba = 59848378 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 20 ba 36 91 e3 00 00:33:55.312 READ DMA ef 10 03 f9 82 d0 a0 00 00:31:28.125 SET FEATURES [Reserved for Serial ATA] ca 00 08 f2 82 d0 e2 00 00:31:24.375 WRITE DMA ca 00 08 2a c2 c0 e2 00 00:31:24.375 WRITE DMA ca 00 08 e2 a2 c0 e2 00 00:31:24.375 WRITE DMA
Hans, does libata.force=1.5Gbps make any difference? Please post kernel boot log with the parameter applied and note that you may need to modify initrd depending on your distribution to apply the parameter. Thanks.
Created attachment 23407 [details] kernel.log with libata.force=1.5Gbps Here's the kernel log. Using libata.force=1.5Gbps didn't help. Is there no message to confirm the force was picked up? I put libata in my initrd, "options libata force=1.5Gbps" in modprobe.conf and rebuilt the initrd.
Well, in the log it states: kernel: Unknown boot option `libata.force=1.5Gbps': ignoring BTW, you stated that you are quite confident that no APM commands are issued. Could you try using the ata-monitor-cmds.patch to verify that this is actually the case?
Correct me if I am wrong, but I think the kernel throws out that message because of the command line I have in grub, but the module in initrd is set up correctly by modprobe.conf which is also in initrd. I did experiment with the ata-monitor-cmds.patch a few days ago with the Samsung disk and KDE. Will recheck. The WD disk has just the initial Arch installation -- just enough to boot. But I will check.
Created attachment 23410 [details] kernel.log with libata.force=1.5Gbps and monitor_cmds=1 Now with the monitoring patch applied. SETFEATURES_XFER is showing up. Is that a problem?
Note there were no ef/05 (APM) commands in the log but there was a freeze. If I type an hdparm -B command by hand I do see an ef/05 log entry of course. It's set to 128 by default by the way.
Ah... sorry about that. Your controller doesn't have access to SControl register so it won't be able to force the link speed down to 1.5Gbps. Looks like genuine transmission problem. SMART is logging them as transmission errors too. Any chance the hardware is faulty?
I think it's unlikely to be faulty hardware. I'd expect to see more randomness in the occurrence of the problem -- this is completely reproducible, including timing, and with two different disks. The netbook is new and hardly used let alone abused. I checked that the screws holding the disk in place are tight. Could the problem be in the area of power management? The smartctl error messages I posted above show ef/10/03 which is related to DIPM Device Initiated Power Management. Could the problem be something to do with power management of the link? If the link gets into a low power mode (for any reason, device initiated or otherwise) is libata/ata_piix failing to recognise this and bring it up again in the right way? Actually if the ICH7-M cant access SControl then is the online/offline status of the link ever detected properly?
Oh... DIPM can definitely do that. Can you please try "libata.acpi_gtf_filter=0x1f" and report the boot log?
Created attachment 23441 [details] kernel.log with libata.acpi_gtf_filter=0x1f In the logs I posted above I already had acpi_gtf_filter set to 7 which is the max (and default) in 2.6.31 and includes the DIPM bit. Here is the log for 2.6.32-rc4 with libata.acpi_gtf_filter=0x1f. I looked into this with powertop. After a few minutes idle, the machine switches into a low power mode, stepping down from about 9W to 6W. And it seems the freeze happens just after the machine switches into that mode for the first time after booting. The disk spins down, but perhaps more hardware gets turned off. Interesting that this does not happen again subsequently -- the state after the hardreset is different from the state just after boot. Could it be something like the vacant port going from on-but-vacant to off-and-really-invisible? Booting with acpi=off stops this switch to low power and stops the freezing -- not acceptable as a workaround but tells us something. With pci=noacpi or libata.noacpi=1 the freeze still occurs.
(cc'd Thomas.) This is getting ridiculous. It looks like certain ACPI method used when entering low power mode is issuing ATA commands to enable DIPM and it's not the _GTF. That ACPI method is issuing an ATA command behind ATA driver's back. I guess this was to achieve power saving without OS knowing about it but for some reason, after initialized by ata_piix, the host controller doesn't behave well when the device initiates powersave mode (which isn't too unexpected, maybe windows driver doesn't do full controller reset so that the configuration done by BIOS remains intact?). Anyways, we need to find out which one is responsible for this and not use the ACPI method as it's frigging dangerous for an ACPI method to issue ATA command directly without going through _GTF and the OS driver. Thanks.
Ugh, that's horrible, and dangerous. Has anyone got any ideas on how confirm this, i.e. to find which ACPI method is causing the trouble? Unfortunately I have only a couple of days' more debug time with the N140 (it's not mine).
The bug already is rather long... Let me summarize things a bit (please correct): - The machine works fine under gnome and it may be related to kpowersave init/runtime settings? - It is eventually related to backlight settings, possibly kpowersave is trying to set disk's APM settings directly if it detects a laptop or similar? Hmm, later in the bug it looks as if kpowersaved is out of the game? The ata poking causing this happens at early boot already? ACPI shouldn't be able to send ATA commands to the disk at all. _GTF and friends providing a buffer/string with ATA commands how the OS has to set up the disk. I also wonder what in the output of comment #73 points to ACPI issuing something. Please try: acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x1F log_buf_len=32M Do you see ACPI messages near the ATA failure? Please attach acpidump and a dmesg output with above boot params. You may want to increase the acpi.debug_level to 0x21F Then you will get *a lot* output. Hmm, possibly it's related to C-states? From dmesg: Monitor-Mwait will be used to enter C-2 state ata1.00: failed command: WRITE DMA Before you try above, can you try idle=poll, if this helps try idle=nomwait You won't get C-states with acpi=off, possibly it's that.
Created attachment 23470 [details] acpidump for Samsung N140 BIOS 04CU .tar.bz2 Here's the acpidump output.
Created attachment 23471 [details] kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x1F Here's the kernel.log with acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x1F log_buf_len=32M
Created attachment 23472 [details] kernel.log.bz2 with ACPI debug_layer=0xFFFFFFFF debug_level=0x21F And here's the more verbose kernel.log with acpi.debug_layer=0xFFFFFFFF acpi.debug_level=0x21F log_buf_len=32M These logs show the ata disk "frozen" messages as before, plus ACPI messages. With idle=poll or idle=nomwait I again saw the freezes.
Sorry, I cannot see anything obvious. Would libata.noacpi param make sense to test? _GTM sata func is invoked, this is the only ACPI sata func I see processed, but I expect this is how it should work and there are no warnings. libata.noacpi should skip this one.
Thomas, it's not _GTM or _GTF and libata.noacpi doesn't solve the issue. If the drive is actually receiving DIPM commands, it's not going through libata. I think one of the ACPI routines might be issuing the command directly to the controller. Is there anything which look like it might do something like this? Hans, can you please verify the drive is actually getting DIPM command when this happens? You can compare the power-on hours and the timestamps of the error log to check this. Thanks.
No idea what DIPM commands are... If you could provide me with an IO/mem address to look for... I checked that SATA related PCI config space declared there does not get touched anywhere. Hmm, isn't the NC10 rather widespread? I already saw key code quirks. Or it is related to this one: [Bug 14314] Samsung n130: HDD lockup after 5 min after boot http://bugzilla.kernel.org/show_bug.cgi?id=14314 Did you already try older/different kernels?
What distro do you use? Best try a openSUSE liveCD/DVD, if it works it should be related to bug #14314.
Thomas, while the original issue popped up on a NC10 I think Hans reported an issue for an N140, which may use similar, but not identical hardware.
Thomas, thanks I had not seen Bug 14314 -- it does look like it may be the same problem I see. The N130 and the N140 are almost identical I believe, and the NC10 is very similar hardware, although perhaps the BIOS is different (perhaps in power management since Samsung claim an improved battery life, despite having almost the same hardware and the same battery capacity). I don't know if we have two issues here or one with slightly different symptoms. Somehow "ef 10 03" (= ATA_CMD_SET_FEATURES + SETFEATURES_SATA_ENABLE + SATA_DIPM) got to the disk but libata would/should filter this. If an ACPI routine issues SATA commands to the controller we would not see that directly in the log would we? In my log from Comment #78 perhaps look between 23:43:48 and 23:47:29, i.e. between end of boot and when the freeze occurred. Wild guess: how about _Q5F, _Q66 or _Q70 ? I tried blacklisting the battery module to no avail. Tejun, before I posted Comment #63 I did spend some time trying to work out if the DIPM commands I saw in the smartctl error logs were happening at the same time as the freezes. My conclusion was that yes they were, as far as I could tell with a resolution of as long as it takes to type some commands by hand. I didn't see a smartctl error log for every freeze though, and the WD disk did not record any errors, but perhaps it can't for some reason. Tejun, does SCR access via SIDPR in ata_piix.c work for ICH7-M? If not then the libata-core routines ata_{phys,}_link_{off,on}line always *all* return false don't they? Which ought to screw up the logic in libata-eh.c which depends on them, no? Or perhaps I'm misunderstanding something. Thomas, I tried kernels 2.6.28-32rc4, distro Arch Linux. I don't have access to the N140 any more unfortunately. Perhaps briefly in a month.
(In reply to comment #85) > Somehow "ef 10 03" (= ATA_CMD_SET_FEATURES + SETFEATURES_SATA_ENABLE + > SATA_DIPM) got to the disk but libata would/should filter this. If an > ACPI routine issues SATA commands to the controller we would not see > that directly in the log would we? That was my wild guess but of course I could be wrong. > Tejun, before I posted Comment #63 I did spend some time trying to work out > if the DIPM commands I saw in the smartctl error logs were happening at the > same time as the freezes. My conclusion was that yes they were, as far as > I could tell with a resolution of as long as it takes to type some commands > by hand. I didn't see a smartctl error log for every freeze though, and the > WD disk did not record any errors, but perhaps it can't for some reason. Yeah, for some reason, disks often do not log all the failures but if you see DIPM commands in the failure log which monitor_cmds doesn't report, something is bypassing the kernel and issuing command to the drive directly. > Tejun, does SCR access via SIDPR in ata_piix.c work for ICH7-M? If not > then the libata-core routines ata_{phys,}_link_{off,on}line always *all* > return false don't they? Which ought to screw up the logic in libata-eh.c > which depends on them, no? Or perhaps I'm misunderstanding something. Unfortunately, SIDPR only works from ich8 on and both link_{off,on} are supposed to return false if link status is not available. It basically tells the core that it doesn't know about the link status and libata core layer will behave correctly (all PATA controllers return false for both functions too).
Created attachment 23534 [details] requested logs Hi, I can confim the bug on N140:(
Thomas, any ideas?
DIPM, SIDPR,... doesn't tell me much. I searched for "grep -i ef [SD]SDT*.dsl" and the matches were not much. All sane was in buffers created in _GTF funcs which have to be invoked by OS and the returned buffer has to be send to the disk by OS as well, right? > Thomas, any ideas? Unfortunately, no.
Oh... DIPM or SIDPR doesn't matter at all. The thing to look out for would be direct accesses to IO ports for the ATA controller (pci dev 1f.2, the ports should be 0x1f0). Thanks.
I can't see how this is accessed anywhere by ACPI. I may have overseen something, but I am rather sure these ports are not touched by ACPI. The last two params are address and length of a port or a memory mapped IO region which can later be accessed in functions: grep "SystemIO" * DSDT.dsl: OperationRegion (PRT0, SystemIO, 0x80, 0x04) DSDT.dsl: OperationRegion (S_IO, SystemIO, 0x06B0, 0x11) DSDT.dsl: OperationRegion (IO_T, SystemIO, 0x0800, 0x10) DSDT.dsl: OperationRegion (PMIO, SystemIO, 0x1000, 0x80) DSDT.dsl: OperationRegion (GPIO, SystemIO, 0x1180, 0x3C) DSDT.dsl: OperationRegion (SECT, SystemIO, 0x5000, 0x10) DSDT.dsl: OperationRegion (SMI0, SystemIO, 0x0000FE00, 0x00000002) DSDT.dsl: OperationRegion (SMBI, SystemIO, 0x18A0, 0x10) trenn@queen:~/dsdt/samsungN140_BIOS_04CU> grep "SystemMemory" * DSDT.dsl: OperationRegion (GNVS, SystemMemory, 0x3F6D1DBC, 0x0100) DSDT.dsl: OperationRegion (RCRB, SystemMemory, 0xFED1C000, 0x4000) DSDT.dsl: OperationRegion (SNVS, SystemMemory, 0x3F6D1CBD, 0x000000FF) DSDT.dsl: OperationRegion (TCG1, SystemMemory, 0x3F6D1CB6, 0x00000007) DSDT.dsl: OperationRegion (SMI1, SystemMemory, 0x3F6D1EBD, 0x00000090) For Sata PCI config space there seem to be some bits and bytes declared as variables: OperationRegion (SACS, PCI_Config, 0x40, 0xC0) Field (SACS, DWordAcc, NoLock, Preserve) { PRIT, 16, ... } But the declared variables are only used in _GTM/_STM/_GTF functions which are not invoked by ACPI code itself.
Hmmm.... strange. Those commands definitely are not going through libata but is being logged by the drive and disabling acpi makes them go away. What could it be? Puzzled... Hans, can you please verify one more time with the ata-monitor-cmds patch and smart log? Thanks.
Tejun, unfortunately I don't have any access to the N140 at the moment. Perhaps Jan can help, or someone else who is watching this? I may be able to do some brief tests in two weeks. Perhaps some more trial and error tests could be used to try and narrow down the location of the problem. Thomas, can you suggest how we could selectively turn off parts of ACPI or individual routines in order to obtain some tighter limits on where the issue might be?
Overriding DSDT via initrd does not work with latest mainline kernels. You have to compile it in: iasl -tc DSDT.dsl Then copy DSDT.hex to the kernel source dir: drivers/acpi and compile with: CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex" You can cluster code you think which could be affected with: Store("Debug to show up in syslog", debug) or Store(variable_xy, debug) Make sure that acpi.debug_layer= and acpi.debug_level is set to show this. acpi.debug_layer=0xFFFFFFF acpi.debug_level=0xF should work with a lot, but not insane amount of debug code spit out. (Latest kernels define the second debug_level bit to see stores to the debug object): ACPI_LV_INIT 0x00000001 [ ] ACPI_LV_DEBUG_OBJECT 0x00000002 [ ] that changed recently... Do you really want to go through this... You should look at some code that could affect this (I couldn't find anything according to your suggestions, but I don't know anything about disk HW...).
*** Bug 14928 has been marked as a duplicate of this bug. ***
It has been reported that if one applies a patch (by Tejun) which removes spurious IRQs then this problem is no longer seen. Please see my post here: http://marc.info/?l=linux-kernel&m=126217468229090&w=2
See also Bug 14314.