Bug 14314
Summary: | Samsung n130: HDD lockup after 5 min after boot | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Mikhail Malygin (mmalygin) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | blocking | CC: | fr4nk.is.a.geek, greg, hwerner4, js, mike, mmalygin, tj, trenn |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
hdparm output for samsung HDD
lscpi -vv boot log in single user mode boot log with acpi_apic_instance=2 idle=poll piix-spurious-irq.patch dmesg ~5min after boot |
Description
Mikhail Malygin
2009-10-03 14:09:17 UTC
Created attachment 23252 [details]
hdparm output for samsung HDD
Created attachment 23253 [details]
lscpi -vv
Just compiled and checked with latest kernel 2.6.31.1 - still the same: ----- [ 238.953023] Monitor-Mwait will be used to enter C-2 state [ 238.953058] Marking TSC unstable due to TSC halts in idle [ 303.888141] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 303.888180] ata1.00: cmd 35/00:08:fc:4f:b4/00:00:11:00:00/e0 tag 0 dma 4096 out [ 303.888187] res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) [ 303.888202] ata1.00: status: { DRDY } [ 308.932080] ata1: link is slow to respond, please be patient (ready=0) [ 313.916080] ata1: device not ready (errno=-16), forcing hardreset [ 313.916104] ata1: soft resetting link [ 314.098475] ata1.00: configured for UDMA/133 [ 314.098494] ata1.00: device reported invalid CHS sector 0 [ 314.098522] ata1: EH complete Does it happen if you boot into single user mode and generate some IO load there? Also, please post full boot log. Thanks. Created attachment 23257 [details]
boot log in single user mode
Yes, it does. It happens also in single-user-mode, with and without load
Just tested with acpi=off - the problem does not happen at all. Thomas cc'd. Thomas, looks like IRQ delivery checks out after a while w/ ACPI enabled on this machine. Does anything ring a bell? > Does anything ring a bell? Not really From dmesg, you get a hint which could be related, already tried that?: [ 0.000000] ACPI: BIOS bug: multiple APIC/MADT found, using 0 [ 0.000000] ACPI: If "acpi_apic_instance=2" works better, notify linux-acpi@vger.kernel.org > Monitor-Mwait will be used to enter C-2 state Hmm, I doubt this should affect irqs..., but you may still want to try idle=poll if you try acpi_apic_instance=2. That would be: acpi_apic_instance=2 idle=poll if it works, only try one of them and report back. > RHEL5, Slackware,Ubuntu9.10 (from 2.6.18.X up to 2.6.30.X) Hmm, I am pretty sure we have Samsung N130 running, I got "not working hot-keys" bugs with our latest moblin release. These have rather new kernels running, so trying out a 2.6.31 kernel is worth it. Possibly you can try a openSUSE 11.2 LiveCD or NetInstallation. This should be an easy way to try out a more recent kernel without the need of recompiling or installing. Also look out for a BIOS update... Have just tested with "acpi_apic_instance=2 idle=poll" as kernel startup parameter - the problem still exists (see attached boot log). Kernel used: 2.6.31.1, compiled with ubuntu config. As for BIOS, it was the first thing i tried :) BIOS: 04CM, 30.09.2009 Created attachment 23286 [details]
boot log with acpi_apic_instance=2 idle=poll
Greg has such a machine as he wrote a backlight driver for it... Maybe he can point you to a relevant patch/workaround or can compare used BIOS/kernels to get it working... Is there different HW out there under the same name? Did you use the BIOS defaults or have you possibly played around with the configs? The BIOS is almost default (there is not much you can change there :)) . HW S/N: NP-N130-KA01DE , BIOS: 04CM, nothing special, I left Vista recovery partition + 100gb XP on it, so it has a grub 1.5 -based dualboot. That is an old bios, please get a new one from Samsung, they have a "special" one for people who want to run Linux. But that bios should not affect the hard drive at all from what I can tell, I have been successfully running Linux for a long time now on this machine with the same BIOS version you have. Perhaps you have a wierd kernel configuration? We have prebuilt kernels that are known to run well on this hardware in the opensuse build system, you can find them at: https://build.opensuse.org/package/show?package=kernel-default&project=Moblin%3ABase Note that the function keys to work properly will require a bios update on your side as that is what the BIOS changes for Linux. Also note that on Vista, Samsung does not use ACPI, so their ACPI tables are known to be broken. The new BIOS update fixes this to get ACPI working properly for Windows 7 and Linux, so this might also solve your problem. >> That is an old bios, please get a new one from Samsung, they have a >> "special" one for people who want to run Linux. Please give me the hint, where i can find such a "special" BIOS update. On Samsung (http://support.samsung.de/support/support_down.aspx?guid=02b6ee15-37c4-49d2-9411-6f6d29a329b7&sh1=&sh2=&sh3=&sh4=&filetype=FM) I could not find anything new... I will give it a try. BTW, can I just use the config file from that custom kernel build or that is a really "custom kernel", so I will need a complete OpenSUSE install? I do not know where on Samsung's site you can get this BIOS update from, sorry. An engineer from Samsung gave it to me, I suggest contacting the place you bought the laptop from, as I am not allowed to redistribute the one I have. And yes, you can just use the .config file from that kernel. The only thing "custom" in it is a new driver for the function keys that is not yet upstream. But as that only works with a new BIOS update, it would not help you out here. thanks, it worked. It works event on opensuse 11.1 (with all updates). So it looks like RH/Ubuntu configs are screwed up... Mikhail what's the difference between the kernel config from opensuse 11.1 and the other ones you tried ? Magic configs and one-off bioses? No pixie dust please ;-). The key release problem was fixed some time ago for NC10 in atkbd.c. Patching to apply the same quirk works (at least on N140), or from 2.6.32 onwards just do this in rc.local: echo 130,131,132,134,136,137,179,247,249 > /sys/devices/platform/i8042/serio0/force_release Please see the discussion in Bug 13416. I seem to have the same issue with the Samsung N140 which is almost identical to the N130 afaik. Hmm, it seems a new BIOS for the N130 (05CM) appeared today on samsung.com but the N140 BIOS is still on 04CU (you may see a release date change but it is the same file as the 04CU released on 5 Oct). As ususal Samsung has released the BIOS without release notes, which is BAD. It looks like the reason why some people have reported success with OpenSUSE is that it includes a patch from Tejun which clears spurious IRQs. Please see http://marc.info/?l=linux-kernel&m=126217468229090&w=2 and http://bugzilla.kernel.org/show_bug.cgi?id=13416#c96 Clearing spurious IRQs fix the problem? That is weird. IIRC, the said patch has a potential to cause system lockup if the hardware misbehaves and it wasn't all that clear how it fixed the problem so it didn't make upstream and I don't really understand how that patch makes any difference for this case either. Hmmm... The only possibility is that for some unknown reason the controller ends up with internal IRQ pending bit set but external IRQ request line clear. When the next command completes, already raised internal IRQ pending bit prevents the controller from raising interrupt which leads to timeout. With the patch applied, if the IRQ line is shared with another IRQ source, on each shared IRQ, piix interrupt handler will check the internal pending bit and clear it and in this case that makes the controller IRQ line unstuck. If this is what's actually happening, it's more like a happy accident that the patch resolves the issue and the proper solution would be to enable IRQ polling on the controller. I'll think about it more. Thanks. I just tested a current git kernel (v2.6.33-rc2-268-g45d28b0) with Tejun's patch applied on my N130. The ATA exception and hang is indeed gone, just "ata1: clearing spurious IRQ" is logged. I've see Tejun's comment in http://bugzilla.kernel.org/show_bug.cgi?id=14314 and I would like to add that the ATA irq is not shared. $ cat /proc/interrupts CPU0 CPU1 0: 50390 0 IO-APIC-edge timer 1: 789 0 IO-APIC-edge i8042 8: 1 0 IO-APIC-edge rtc0 9: 664 0 IO-APIC-fasteoi acpi 12: 9960 0 IO-APIC-edge i8042 14: 43885 0 IO-APIC-edge ata_piix 15: 0 0 IO-APIC-edge ata_piix 16: 2370 0 IO-APIC-fasteoi uhci_hcd:usb5, ath9k, i915@pci:0000:00:02.0 19: 0 0 IO-APIC-fasteoi uhci_hcd:usb3 20: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 23: 34 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 24: 0 0 PCI-MSI-edge HDA Intel 25: 113 0 PCI-MSI-edge eth0 NMI: 0 0 Non-maskable interrupts LOC: 84186 107637 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts PND: 0 0 Performance pending work RES: 9152 13060 Rescheduling interrupts CAL: 25 401 Function call interrupts TLB: 629 494 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 6 6 Machine check polls ERR: 5 MIS: 0 Created attachment 24505 [details]
piix-spurious-irq.patch
Does this patch also work? Please attach dmesg. Thanks.
piix-spurious-irq.patch works fine, usually only a spurious irq is logged, no hang or ATA exception. However, I did some stress testing with 'dd if=/dev/sda of=/dev/null' and found that this can trigger the ATA exception after boot or resume about 30% of time. Usually the exception is recovered but I also got one case (out of maybe 20 suspend/resume cycles) where dd aborted with EIO: [ 1680.093073] ata1: clearing spurious IRQ [ 1680.093246] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 1680.093267] ata1.00: BMDMA stat 0x4 [ 1680.093289] ata1.00: failed command: READ DMA [ 1680.093331] ata1.00: cmd c8/00:00:58:e3:f0/00:00:00:00:00/e8 tag 0 dma 131072 in [ 1680.093342] res 51/10:03:58:e3:f0/00:00:00:00:00/a0 Emask 0x81 (invalid argument) [ 1680.093365] ata1.00: status: { DRDY ERR } [ 1680.093382] ata1.00: error: { IDNF } [ 1680.326492] ata1.00: configured for UDMA/133 [ 1680.326545] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 1680.326565] sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor] [ 1680.326593] Descriptor sense data with sense descriptors (in hex): [ 1680.326606] 72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 1680.326674] 00 f0 e3 58 [ 1680.326701] sd 0:0:0:0: [sda] Add. Sense: Recorded entity not found [ 1680.326725] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 f0 e3 58 00 01 00 00 [ 1680.326783] end_request: I/O error, dev sda, sector 150004568 [ 1680.326802] Buffer I/O error on device sda, logical block 18750571 [ 1680.326832] Buffer I/O error on device sda, logical block 18750572 [ 1680.326849] Buffer I/O error on device sda, logical block 18750573 [ 1680.326865] Buffer I/O error on device sda, logical block 18750574 [ 1680.326881] Buffer I/O error on device sda, logical block 18750575 [ 1680.326897] Buffer I/O error on device sda, logical block 18750576 [ 1680.326913] Buffer I/O error on device sda, logical block 18750577 [ 1680.326929] Buffer I/O error on device sda, logical block 18750578 [ 1680.326945] Buffer I/O error on device sda, logical block 18750579 [ 1680.327057] ata1: EH complete I think in normal use this is unlikely to happen, but the conclusion is that the Samsung BIOS still sucks. Can you please post log for recovered error too? Created attachment 24527 [details]
dmesg ~5min after boot
recovered ATA excpetion looks like this: Jan 12 23:57:48 x kernel: [ 605.000198] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Jan 12 23:57:48 x kernel: [ 605.000222] ata1.00: failed command: READ DMA Jan 12 23:57:48 x kernel: [ 605.000254] ata1.00: cmd c8/00:00:20:64:00/00:00:00:00:00/e4 tag 0 dma 131072 in Jan 12 23:57:48 x kernel: [ 605.000261] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 12 23:57:48 x kernel: [ 605.000276] ata1.00: status: { DRDY } Jan 12 23:57:53 x kernel: [ 610.044098] ata1: link is slow to respond, please be patient (ready=0) Jan 12 23:57:58 x kernel: [ 615.032100] ata1: device not ready (errno=-16), forcing hardreset Jan 12 23:57:58 x kernel: [ 615.032122] ata1: soft resetting link Jan 12 23:57:58 x kernel: [ 615.214484] ata1.00: configured for UDMA/133 Jan 12 23:57:58 x kernel: [ 615.214508] ata1.00: device reported invalid CHS sector 0 Jan 12 23:57:58 x kernel: [ 615.214544] ata1: EH complete PS: I tested with current git kernel v2.6.33-rc3-291-g066000d + piix-spurious-irq.patch So, the spurious clearing removes the timeout which occurs after 5min from boot but the above failures still exist, right? Can you trigger the above failures without the patch? Thanks. Um, without the patch the recovered exception + timeout accurs _always_. I have not seen any non-recovered exception without the patch, but then I didn't do the stress test using dd. I had a few fsck runs though (fs is ext3 on lvm on dm_crypt), which continued after the timeout/exception. With normal usage pattern the patch fixes the problem almost, it is a huge improvement. I wanna make sure the problem is not made worse by the patch. Can you please try dd'ing without the patch? Thanks. I went through a couple reboots and numerous suspend/resume cycles with the dd running, and could not reproduce the ATA exception from comment #25, neither with nor without the patch applied. I only got the recoverable timeout + exception from comment #28. Curiously today the timeout didn't _always_ happen with the unpatched kernel (but still most of the time). I also had one boot where the BIOS didn't enable the C-2 state after 5min at all (consequently no ATA problems), and this also persisted through a number of suspend/resume cycles until I rebooted. Probably just another BIOS bug... I still think the piix-spurious-irq.patch is good. Spurious IRQ detection and clearing is something nice to have with or without this bug report as there are cases in which run away IRQ brings down the disk. I just wanted to make sure that the patch somehow worsens your case. I think there is a connection between the two given the close timing. The BIOS is putting the controller into a strange state. I'll prepare another patch which makes libata EH also retry on IDNF. Patches posted upstream. http://article.gmane.org/gmane.linux.ide/44381 http://thread.gmane.org/gmane.linux.ide/44382 Nice to see those. Are you aiming for 2.6.33 or 2.6.34 ? Regardless of the kernel process schedule, people who have these netbooks need a patched kernel to be able to use them without freezes. Do you now recommend the three patches from Comment #35 for everyone ? For 2.6.33-rc* and 2.6.32.y ? I think it still has some problems. I'll post updated version soon and cc you and it will probably end up in .34. Thanks. Thanks. I see the first one has reached Linus' tree: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=patch;h=534ead709235b967b659947c55d9130873a432c4 and the other two have been applied to libata-dev.git #upstream: http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=patch;h=b86b2e86d5740da336fabb091d92db30c37feeb0 http://git.kernel.org/?p=linux/kernel/git/jgarzik/libata-dev.git;a=patch;h=d79ae28a0b16e1c81d58356401d2f343f478c729 Hi, I'd just like to add that I've seen this same issue on my Dell Mini 9 (Inspiron 910), where the spurious IRQ patch seems to be successfully suppressing the freezes I was experiencing. From memory (Don't have it with me at present), the BIOS version installed is A04. I'm wondering if it's using a similar codebase to the samsung BIOS.. If it's the same root cause. 910 BIOS list available here: http://bit.ly/bc6ZYf Cheers, Mike. |