Bug 14314 - Samsung n130: HDD lockup after 5 min after boot
Summary: Samsung n130: HDD lockup after 5 min after boot
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-03 14:09 UTC by Mikhail Malygin
Modified: 2010-05-18 00:51 UTC (History)
8 users (show)

See Also:
Kernel Version: 2.6.31.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
hdparm output for samsung HDD (728 bytes, text/plain)
2009-10-03 14:09 UTC, Mikhail Malygin
Details
lscpi -vv (9.82 KB, text/plain)
2009-10-03 14:11 UTC, Mikhail Malygin
Details
boot log in single user mode (37.01 KB, text/plain)
2009-10-04 15:52 UTC, Mikhail Malygin
Details
boot log with acpi_apic_instance=2 idle=poll (38.29 KB, text/plain)
2009-10-06 15:13 UTC, Mikhail Malygin
Details
piix-spurious-irq.patch (3.52 KB, patch)
2010-01-11 07:54 UTC, Tejun Heo
Details | Diff
dmesg ~5min after boot (37.04 KB, text/plain)
2010-01-13 00:00 UTC, Johannes Stezenbach
Details

Description Mikhail Malygin 2009-10-03 14:09:17 UTC
Hi,
I am getting a system hang (always) in ca.5 min. after boot or resume, running Ubuntu 9.04 on the brand new Samsung n130. System is absolutely locked for 15 sec. I have already tried all google-suggested kernel boot parameters (pci=noacpi, noapic, nolapic, hz=off, combinemode ..., etc), also tried to install various systems: RHEL5, Slackware,Ubuntu9.10 (from 2.6.18.X up to 2.6.30.X) - all installations bring the same. Windows runs fine, HDD is checked. Below is dmesg output:

-------------------------------------------
[  123.129158] padlock: VIA PadLock not detected.
[  223.164814] Monitor-Mwait will be used to enter C-2 state
[  223.170516] Marking TSC unstable due to TSC halts in idle
[  285.804161] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  285.804191] ata1.00: cmd ca/00:08:8a:a8:f1/00:00:00:00:00/ec tag 0 dma 4096 out
[  285.804197]          res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[  285.804210] ata1.00: status: { DRDY }
[  290.844079] ata1: link is slow to respond, please be patient (ready=0)
[  295.828084] ata1: device not ready (errno=-16), forcing hardreset
[  295.828105] ata1: soft resetting link
[  296.010535] ata1.00: configured for UDMA/133
[  296.010584] ata1: EH complete
[  296.012783] sd 0:0:0:0: [sda] 312581808 512-byte hardware sectors: (160 GB/149 GiB)
[  296.012917] sd 0:0:0:0: [sda] Write Protect is off
[  296.012934] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[  296.013173] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-------------------------------------------

That is a real stopper for me from using Linux on this laptop :( Any suggestion?
Comment 1 Mikhail Malygin 2009-10-03 14:09:58 UTC
Created attachment 23252 [details]
hdparm output for samsung HDD
Comment 2 Mikhail Malygin 2009-10-03 14:11:08 UTC
Created attachment 23253 [details]
lscpi -vv
Comment 3 Mikhail Malygin 2009-10-03 22:03:31 UTC
Just compiled and checked with latest kernel 2.6.31.1 - still the same:
-----
[  238.953023] Monitor-Mwait will be used to enter C-2 state
[  238.953058] Marking TSC unstable due to TSC halts in idle
[  303.888141] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  303.888180] ata1.00: cmd 35/00:08:fc:4f:b4/00:00:11:00:00/e0 tag 0 dma 4096 out
[  303.888187]          res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[  303.888202] ata1.00: status: { DRDY }
[  308.932080] ata1: link is slow to respond, please be patient (ready=0)
[  313.916080] ata1: device not ready (errno=-16), forcing hardreset
[  313.916104] ata1: soft resetting link
[  314.098475] ata1.00: configured for UDMA/133
[  314.098494] ata1.00: device reported invalid CHS sector 0
[  314.098522] ata1: EH complete
Comment 4 Tejun Heo 2009-10-04 00:09:05 UTC
Does it happen if you boot into single user mode and generate some IO load there?  Also, please post full boot log.  Thanks.
Comment 5 Mikhail Malygin 2009-10-04 15:52:31 UTC
Created attachment 23257 [details]
boot log in single user mode

Yes, it does. It happens also in single-user-mode, with and without load
Comment 6 Mikhail Malygin 2009-10-05 19:23:51 UTC
Just tested with acpi=off - the problem does not happen at all.
Comment 7 Tejun Heo 2009-10-06 05:31:09 UTC
Thomas cc'd.  Thomas, looks like IRQ delivery checks out after a while w/ ACPI enabled on this machine.  Does anything ring a bell?
Comment 8 Thomas Renninger 2009-10-06 10:18:10 UTC
> Does anything ring a bell?
Not really

From dmesg, you get a hint which could be related, already tried that?:
[    0.000000] ACPI: BIOS bug: multiple APIC/MADT found, using 0
[    0.000000] ACPI: If "acpi_apic_instance=2" works better, notify linux-acpi@vger.kernel.org

> Monitor-Mwait will be used to enter C-2 state
Hmm, I doubt this should affect irqs..., but you may still want to try idle=poll if you try acpi_apic_instance=2.

That would be:
acpi_apic_instance=2 idle=poll
if it works, only try one of them and report back.

> RHEL5, Slackware,Ubuntu9.10 (from 2.6.18.X up to 2.6.30.X)
Hmm, I am pretty sure we have Samsung N130 running, I got "not working hot-keys" bugs with our latest moblin release. These have rather new kernels running, so trying out a 2.6.31 kernel is worth it. Possibly you can try a openSUSE 11.2 LiveCD or NetInstallation. This should be an easy way to try out a more recent kernel without the need of recompiling or installing.
Comment 9 Thomas Renninger 2009-10-06 10:18:33 UTC
Also look out for a BIOS update...
Comment 10 Mikhail Malygin 2009-10-06 15:13:12 UTC
Have just tested with "acpi_apic_instance=2 idle=poll" as kernel startup parameter - the problem still exists (see attached boot log). Kernel used: 2.6.31.1, compiled with ubuntu config.
As for BIOS, it was the first thing i tried :) BIOS: 04CM, 30.09.2009
Comment 11 Mikhail Malygin 2009-10-06 15:13:57 UTC
Created attachment 23286 [details]
boot log with acpi_apic_instance=2 idle=poll
Comment 12 Thomas Renninger 2009-10-06 15:41:14 UTC
Greg has such a machine as he wrote a backlight driver for it...
Maybe he can point you to a relevant patch/workaround or can compare used BIOS/kernels to get it working...
Is there different HW out there under the same name?
Did you use the BIOS defaults or have you possibly played around with the configs?
Comment 13 Mikhail Malygin 2009-10-06 16:05:19 UTC
The BIOS is almost default (there is not much you can change there :)) . 
HW S/N: NP-N130-KA01DE , BIOS: 04CM, nothing special, I left Vista recovery partition + 100gb XP on it, so it has a grub 1.5 -based dualboot.
Comment 14 Greg Kroah-Hartman 2009-10-06 16:28:15 UTC
That is an old bios, please get a new one from Samsung, they have a "special" one for people who want to run Linux.

But that bios should not affect the hard drive at all from what I can tell, I have been successfully running Linux for a long time now on this machine with the same BIOS version you have.  Perhaps you have a wierd kernel configuration?

We have prebuilt kernels that are known to run well on this hardware in the opensuse build system, you can find them at:
  https://build.opensuse.org/package/show?package=kernel-default&project=Moblin%3ABase

Note that the function keys to work properly will require a bios update on your side as that is what the BIOS changes for Linux.

Also note that on Vista, Samsung does not use ACPI, so their ACPI tables are known to be broken.  The new BIOS update fixes this to get ACPI working properly for Windows 7 and Linux, so this might also solve your problem.
Comment 15 Mikhail Malygin 2009-10-06 16:57:30 UTC
>> That is an old bios, please get a new one from Samsung, they have a
>> "special" one for people who want to run Linux.
Please give me the hint, where i can find such a "special" BIOS update. On Samsung (http://support.samsung.de/support/support_down.aspx?guid=02b6ee15-37c4-49d2-9411-6f6d29a329b7&sh1=&sh2=&sh3=&sh4=&filetype=FM) I could not find anything new...

I will give it  a try. BTW, can I just use the config file from that custom kernel build or that is a really "custom kernel", so I will need a complete OpenSUSE install?
Comment 16 Greg Kroah-Hartman 2009-10-06 17:09:39 UTC
I do not know where on Samsung's site you can get this BIOS update from, sorry.  An engineer from Samsung gave it to me, I suggest contacting the place you bought the laptop from, as I am not allowed to redistribute the one I have.

And yes, you can just use the .config file from that kernel.  The only thing "custom" in it is a new driver for the function keys that is not yet upstream.  But as that only works with a new BIOS update, it would not help you out here.
Comment 17 Mikhail Malygin 2009-10-08 12:32:59 UTC
thanks, it worked. It works event on opensuse 11.1 (with all updates). So it looks like RH/Ubuntu configs are screwed up...
Comment 18 Hans Werner 2009-10-20 23:43:50 UTC
Mikhail what's the difference between the kernel config from opensuse 11.1
and the other ones you tried ?

Magic configs and one-off bioses? No pixie dust please ;-).

The key release problem was fixed some time ago for NC10 in atkbd.c. Patching to apply the same quirk works (at least on N140), or from 2.6.32 onwards just do
this in rc.local:
echo 130,131,132,134,136,137,179,247,249 > /sys/devices/platform/i8042/serio0/force_release
Comment 19 Hans Werner 2009-10-22 17:43:51 UTC
Please see the discussion in Bug 13416. I seem to have the same issue
with the Samsung N140 which is almost identical to the N130 afaik.
Comment 20 Hans Werner 2009-10-23 14:34:27 UTC
Hmm, it seems a new BIOS for the N130 (05CM) appeared today on samsung.com
but the N140 BIOS is still on 04CU (you may see a release date change but it
is the same file as the 04CU released on 5 Oct).

As ususal Samsung has released the BIOS without release notes, which is BAD.
Comment 21 Hans Werner 2010-01-02 12:57:22 UTC
It looks like the reason why some people have reported success with OpenSUSE
is that it includes a patch from Tejun which clears spurious IRQs. Please see 
http://marc.info/?l=linux-kernel&m=126217468229090&w=2
and
http://bugzilla.kernel.org/show_bug.cgi?id=13416#c96
Comment 22 Tejun Heo 2010-01-02 23:28:25 UTC
Clearing spurious IRQs fix the problem?  That is weird.  IIRC, the said patch has a potential to cause system lockup if the hardware misbehaves and it wasn't all that clear how it fixed the problem so it didn't make upstream and I don't really understand how that patch makes any difference for this case either.

Hmmm... The only possibility is that for some unknown reason the controller ends up with internal IRQ pending bit set but external IRQ request line clear.  When the next command completes, already raised internal IRQ pending bit prevents the controller from raising interrupt which leads to timeout.  With the patch applied, if the IRQ line is shared with another IRQ source, on each shared IRQ, piix interrupt handler will check the internal pending bit and clear it and in this case that makes the controller IRQ line unstuck.

If this is what's actually happening, it's more like a happy accident that the patch resolves the issue and the proper solution would be to enable IRQ polling on the controller.  I'll think about it more.

Thanks.
Comment 23 Johannes Stezenbach 2010-01-04 00:31:01 UTC
I just tested a current git kernel (v2.6.33-rc2-268-g45d28b0)
with Tejun's patch applied on my N130.  The ATA exception and hang is
indeed gone, just "ata1: clearing spurious IRQ" is logged.

I've see Tejun's comment in
http://bugzilla.kernel.org/show_bug.cgi?id=14314
and I would like to add that the ATA irq is not shared.

$ cat /proc/interrupts
           CPU0       CPU1       
  0:      50390          0   IO-APIC-edge      timer
  1:        789          0   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc0
  9:        664          0   IO-APIC-fasteoi   acpi
 12:       9960          0   IO-APIC-edge      i8042
 14:      43885          0   IO-APIC-edge      ata_piix
 15:          0          0   IO-APIC-edge      ata_piix
 16:       2370          0   IO-APIC-fasteoi   uhci_hcd:usb5, ath9k, i915@pci:0000:00:02.0
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 20:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 23:         34          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
 24:          0          0   PCI-MSI-edge      HDA Intel
 25:        113          0   PCI-MSI-edge      eth0
NMI:          0          0   Non-maskable interrupts
LOC:      84186     107637   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
PND:          0          0   Performance pending work
RES:       9152      13060   Rescheduling interrupts
CAL:         25        401   Function call interrupts
TLB:        629        494   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:          6          6   Machine check polls
ERR:          5
MIS:          0
Comment 24 Tejun Heo 2010-01-11 07:54:12 UTC
Created attachment 24505 [details]
piix-spurious-irq.patch

Does this patch also work?  Please attach dmesg.  Thanks.
Comment 25 Johannes Stezenbach 2010-01-12 23:57:51 UTC
piix-spurious-irq.patch works fine, usually only a spurious irq is
logged, no hang or ATA exception.

However, I did some stress testing with 'dd if=/dev/sda of=/dev/null'
and found that this can trigger the ATA exception after boot
or resume about 30% of time. Usually the exception is recovered
but I also got one case (out of maybe 20 suspend/resume cycles)
where dd aborted with EIO:

[ 1680.093073] ata1: clearing spurious IRQ
[ 1680.093246] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1680.093267] ata1.00: BMDMA stat 0x4
[ 1680.093289] ata1.00: failed command: READ DMA
[ 1680.093331] ata1.00: cmd c8/00:00:58:e3:f0/00:00:00:00:00/e8 tag 0 dma 131072 in
[ 1680.093342]          res 51/10:03:58:e3:f0/00:00:00:00:00/a0 Emask 0x81 (invalid argument)
[ 1680.093365] ata1.00: status: { DRDY ERR }
[ 1680.093382] ata1.00: error: { IDNF }
[ 1680.326492] ata1.00: configured for UDMA/133
[ 1680.326545] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 1680.326565] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
[ 1680.326593] Descriptor sense data with sense descriptors (in hex):
[ 1680.326606]         72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[ 1680.326674]         00 f0 e3 58 
[ 1680.326701] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found
[ 1680.326725] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 f0 e3 58 00 01 00 00
[ 1680.326783] end_request: I/O error, dev sda, sector 150004568
[ 1680.326802] Buffer I/O error on device sda, logical block 18750571
[ 1680.326832] Buffer I/O error on device sda, logical block 18750572
[ 1680.326849] Buffer I/O error on device sda, logical block 18750573
[ 1680.326865] Buffer I/O error on device sda, logical block 18750574
[ 1680.326881] Buffer I/O error on device sda, logical block 18750575
[ 1680.326897] Buffer I/O error on device sda, logical block 18750576
[ 1680.326913] Buffer I/O error on device sda, logical block 18750577
[ 1680.326929] Buffer I/O error on device sda, logical block 18750578
[ 1680.326945] Buffer I/O error on device sda, logical block 18750579
[ 1680.327057] ata1: EH complete

I think in normal use this is unlikely to happen, but the conclusion is
that the Samsung BIOS still sucks.
Comment 26 Tejun Heo 2010-01-12 23:59:58 UTC
Can you please post log for recovered error too?
Comment 27 Johannes Stezenbach 2010-01-13 00:00:06 UTC
Created attachment 24527 [details]
dmesg ~5min after boot
Comment 28 Johannes Stezenbach 2010-01-13 00:03:40 UTC
recovered ATA excpetion looks like this:

Jan 12 23:57:48 x kernel: [  605.000198] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jan 12 23:57:48 x kernel: [  605.000222] ata1.00: failed command: READ DMA
Jan 12 23:57:48 x kernel: [  605.000254] ata1.00: cmd c8/00:00:20:64:00/00:00:00:00:00/e4 tag 0 dma 131072 in
Jan 12 23:57:48 x kernel: [  605.000261]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan 12 23:57:48 x kernel: [  605.000276] ata1.00: status: { DRDY }
Jan 12 23:57:53 x kernel: [  610.044098] ata1: link is slow to respond, please be patient (ready=0)
Jan 12 23:57:58 x kernel: [  615.032100] ata1: device not ready (errno=-16), forcing hardreset
Jan 12 23:57:58 x kernel: [  615.032122] ata1: soft resetting link
Jan 12 23:57:58 x kernel: [  615.214484] ata1.00: configured for UDMA/133
Jan 12 23:57:58 x kernel: [  615.214508] ata1.00: device reported invalid CHS sector 0
Jan 12 23:57:58 x kernel: [  615.214544] ata1: EH complete
Comment 29 Johannes Stezenbach 2010-01-13 00:05:12 UTC
PS: I tested with current git kernel v2.6.33-rc3-291-g066000d
+ piix-spurious-irq.patch
Comment 30 Tejun Heo 2010-01-13 00:06:13 UTC
So, the spurious clearing removes the timeout which occurs after 5min from boot but the above failures still exist, right?  Can you trigger the above failures without the patch?

Thanks.
Comment 31 Johannes Stezenbach 2010-01-13 00:15:39 UTC
Um, without the patch the recovered exception + timeout accurs
_always_. I have not seen any non-recovered exception
without the patch, but then I didn't do the stress test
using dd.  I had a few fsck runs though (fs is ext3 on lvm on dm_crypt),
which continued after the timeout/exception.

With normal usage pattern the patch fixes the problem almost,
it is a huge improvement.
Comment 32 Tejun Heo 2010-01-13 00:35:57 UTC
I wanna make sure the problem is not made worse by the patch.  Can you please try dd'ing without the patch?

Thanks.
Comment 33 Johannes Stezenbach 2010-01-14 00:09:06 UTC
I went through a couple reboots and numerous suspend/resume
cycles with the dd running, and could not reproduce the ATA exception
from comment #25, neither with nor without the patch applied.
I only got the recoverable timeout + exception from comment #28.
Curiously today the timeout didn't _always_ happen with the
unpatched kernel (but still most of the time). I also had
one boot where the BIOS didn't enable the C-2 state after 5min at all
(consequently no ATA problems), and this also persisted through a
number of suspend/resume cycles until I rebooted. Probably
just another BIOS bug...

I still think the piix-spurious-irq.patch is good.
Comment 34 Tejun Heo 2010-01-14 06:00:13 UTC
Spurious IRQ detection and clearing is something nice to have with or without this bug report as there are cases in which run away IRQ brings down the disk.  I just wanted to make sure that the patch somehow worsens your case.  I think there is a connection between the two given the close timing.  The BIOS is putting the controller into a strange state.  I'll prepare another patch which makes libata EH also retry on IDNF.
Comment 36 Hans Werner 2010-01-19 00:52:51 UTC
Nice to see those. Are you aiming for 2.6.33 or 2.6.34 ? Regardless of the kernel process schedule, people who have these netbooks need a patched kernel to be able to use them without freezes. Do you now recommend the three patches from Comment #35 for everyone ? For 2.6.33-rc* and 2.6.32.y ?
Comment 37 Tejun Heo 2010-01-19 00:56:35 UTC
I think it still has some problems.  I'll post updated version soon and cc you and it will probably end up in .34.  Thanks.
Comment 39 Mike Beattie 2010-05-18 00:51:54 UTC
Hi,

I'd just like to add that I've seen this same issue on my Dell Mini 9 (Inspiron 910), where the spurious IRQ patch seems to be successfully suppressing the freezes I was experiencing.

From memory (Don't have it with me at present), the BIOS version installed is A04. I'm wondering if it's using a similar codebase to the samsung BIOS.. If it's the same root cause.

910 BIOS list available here:
http://bit.ly/bc6ZYf

Cheers,
Mike.

Note You need to log in before you can comment on or make changes to this bug.