Bug 15708 - ATA_PIIX often freezes laptop during boot
Summary: ATA_PIIX often freezes laptop during boot
Status: RESOLVED INSUFFICIENT_DATA
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Tejun Heo
URL:
Keywords:
: 14244 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-04-07 08:56 UTC by Stefan Krause
Modified: 2010-12-09 09:27 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.33.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg for kernel 2.6.33 with ata_piix disabled (booting correctly) (89.65 KB, text/plain)
2010-04-07 08:56 UTC, Stefan Krause
Details
lshw for kernel 2.6.33.2 without ata_piix (24.11 KB, text/plain)
2010-04-07 08:57 UTC, Stefan Krause
Details
lsmod for kernel 2.6.33 with ata_piix disabled (2.32 KB, text/plain)
2010-04-07 09:14 UTC, Stefan Krause
Details
lspci for kernel 2.6.33 with ata_piix disabled (18.09 KB, text/plain)
2010-04-07 09:15 UTC, Stefan Krause
Details
Screenshot: System freezes with ata_piix (376.89 KB, image/jpeg)
2010-04-07 09:15 UTC, Stefan Krause
Details
dmesg for 2.6.33 with ata_piix after a sucessful boot (54.59 KB, text/plain)
2010-04-07 09:18 UTC, Stefan Krause
Details
lshw for 2.6.33 with ata_piix after a sucessful boot (24.37 KB, text/plain)
2010-04-07 09:19 UTC, Stefan Krause
Details
lsmod for 2.6.33 with ata_piix after a sucessful boot (2.32 KB, text/plain)
2010-04-07 09:20 UTC, Stefan Krause
Details
lspci for 2.6.33 with ata_piix after a sucessful boot (18.20 KB, text/plain)
2010-04-07 09:20 UTC, Stefan Krause
Details
config used for 2.6.33.2 without ata_piix (91.11 KB, text/plain)
2010-04-07 09:22 UTC, Stefan Krause
Details
config used for 2.6.33.2 with ata_piix (89.57 KB, text/plain)
2010-04-07 09:24 UTC, Stefan Krause
Details
ata_piix-init-dbg.patch (3.22 KB, patch)
2010-04-08 01:47 UTC, Tejun Heo
Details | Diff
Screenshot after applying the patch (388.69 KB, image/jpeg)
2010-04-08 20:32 UTC, Stefan Krause
Details
screenshot with initcall_debug and core_initcall(async_init) commented out (386.99 KB, image/jpeg)
2010-04-14 20:18 UTC, Stefan Krause
Details
Screenshot patch + initcall_debug + no core_initcall(async_init) (385.99 KB, image/jpeg)
2010-04-15 19:11 UTC, Stefan Krause
Details
activate-debug.patch (3.12 KB, patch)
2010-04-18 22:48 UTC, Tejun Heo
Details | Diff
crash with libata-sff patched (425.30 KB, image/jpeg)
2010-04-25 15:36 UTC, Stefan Krause
Details
dmidecode (7.55 KB, text/plain)
2010-04-25 18:16 UTC, Stefan Krause
Details
ahci-no-intx.patch (544 bytes, patch)
2010-05-17 06:33 UTC, Tejun Heo
Details | Diff

Description Stefan Krause 2010-04-07 08:56:15 UTC
Created attachment 25891 [details]
dmesg for kernel 2.6.33 with ata_piix disabled (booting correctly)

Ubuntu Lucid 10.04 started to freeze about half of the time I boot when I switched from from 2.6.32-16 to 2.6.32-17.
Fedora 13 has the same behaviour, whilst OpenSuse 11.3 M3 boots correctly.

To investigate further I downloaded kernel 2.6.33.2 and built my own kernel on Lucid.
I found that as soon as I enable "Intel ESB, ICH, PIIX3, PIIX4 PATA/SATA support (ATA_PIIX)" the system starts to freeze during boot about half the time.

I'll attach a (blurry) photo showing the screen when the system hangs. To me it appears to be some random problem (since (partially) disabling acpi seems to make it go away).

When I enable the other ATA options, but disable the above mentioned ata_piix option the kernel boots, but I lose access to my dvd-drive.

I managed to compile a kernel with some options from "ATA/ATAPI/MFM/RLL support (DEPRECATED)" that gave me partial access to my dvd-drive, but caused lots of error messages in dmesg.

Please let me know if I can do anything to help investigating this bug. I'm also interested in workarounds to get my notebook working.


uname -a
Linux hape 2.6.33.2-noata6 #2 SMP PREEMPT Wed Apr 7 09:02:59 CEST 2010 i686 GNU/Linux
Comment 1 Stefan Krause 2010-04-07 08:57:24 UTC
Created attachment 25892 [details]
lshw for kernel 2.6.33.2 without ata_piix
Comment 2 Stefan Krause 2010-04-07 09:14:31 UTC
Created attachment 25894 [details]
lsmod for kernel 2.6.33 with ata_piix disabled
Comment 3 Stefan Krause 2010-04-07 09:15:22 UTC
Created attachment 25895 [details]
lspci for kernel 2.6.33 with ata_piix disabled
Comment 4 Stefan Krause 2010-04-07 09:15:57 UTC
Created attachment 25896 [details]
Screenshot: System freezes with ata_piix
Comment 5 Stefan Krause 2010-04-07 09:18:04 UTC
Created attachment 25897 [details]
dmesg for 2.6.33 with ata_piix after a sucessful boot

It took three attempts to boot successfully into 2.6.33 when ata_piix is enabled. I'll attach dmesg, lspci, lshw for that configuration.
Comment 6 Stefan Krause 2010-04-07 09:19:29 UTC
Created attachment 25898 [details]
lshw for 2.6.33 with ata_piix after a sucessful boot
Comment 7 Stefan Krause 2010-04-07 09:20:26 UTC
Created attachment 25899 [details]
lsmod for 2.6.33 with ata_piix after a sucessful boot
Comment 8 Stefan Krause 2010-04-07 09:20:54 UTC
Created attachment 25900 [details]
lspci for 2.6.33 with ata_piix after a sucessful boot
Comment 9 Stefan Krause 2010-04-07 09:22:54 UTC
Created attachment 25901 [details]
config used for 2.6.33.2 without ata_piix

This is the kernel configuration that will boot fine (but dvd-drive is working strangely)
Comment 10 Stefan Krause 2010-04-07 09:24:23 UTC
Created attachment 25902 [details]
config used for 2.6.33.2 with ata_piix

This is the kernel configuration that freezes in about 50% of the time.
Comment 11 Tejun Heo 2010-04-08 01:47:53 UTC
Created attachment 25908 [details]
ata_piix-init-dbg.patch

Thanks for the detailed reporting Stefan.  That's a strange place to hang.  Let's see where it's hanging.  Can you please apply the attached patch, cause the hang and post the screenshot?
Comment 12 Stefan Krause 2010-04-08 20:32:12 UTC
Created attachment 25921 [details]
Screenshot after applying the patch

Hi Tejun,

thanks a lot. The screenshot is attached. Please note that I added one more debug message right after computing the return value.
1520d1519
< 	int rv;
1628,1630c1627
< 	rv = ata_pci_sff_activate_host(host, ata_sff_interrupt, &piix_sht);
< 	dev_printk(KERN_INFO, &pdev->dev, "XXX after ata_pci_sff_activate_host\n");
< 	return rv;
---
>       return ata_pci_sff_activate_host(host, ata_sff_interrupt, &piix_sht);


The bad news (for me) is that the kernel returns correctly from the method piix_init_one
I booted three times and my notebook froze 2 times. Both times the last message was now "... ehci_hcd: debug port 1"

Any idea?
Comment 13 Stefan Krause 2010-04-12 17:58:50 UTC
Two updates:
1. OpenSuse 11.3 also has the freeze issue after updating to 2.6.34-rc3-2-desktop.
2. Since the very last word from the last screenshot I posted came from ehci I tried a custom kernel with ata_piix and ehci (usb) disabled, but the system froze just as well.
Comment 14 Tejun Heo 2010-04-13 22:07:25 UTC
Can you please try the followings?

1. Comment out "core_initcall(async_init) in kernel/async.c and see whether the hang is reproducible.

2. If so, specify initcall_debug and see where it hangs.

Thanks.
Comment 15 Stefan Krause 2010-04-14 20:18:54 UTC
Created attachment 26006 [details]
screenshot with initcall_debug and core_initcall(async_init) commented out
Comment 16 Stefan Krause 2010-04-14 20:20:26 UTC
Hi Tejun, thanks for your reply.

It hang with the last line core_initcall(...) in async.c commented out.
A screenshot of a hang with the line commented out and initcall_debug passed as kernel option is attached.
Comment 17 Tejun Heo 2010-04-14 23:19:25 UTC
Hmm... it dies inside piix_init_one().  Can you please apply the first debug patch and see where it dies?
Comment 18 Stefan Krause 2010-04-15 19:10:21 UTC
Yes that's strange. It seems to crash now in the "ata_pci_sff_activate_host" call - I'll attach the screenshot (It does not print the "XXX after ata_pci_sff_activate_host" message I added to your patch.)
This is really odd because before I removed the core_initcall call it froze after printing the message!?

It took me 7 attempts to crash. Here's the dmesg for a successful boot:

[    2.028021] async_waiting @ 1
[    2.028100] async_continuing @ 1 after 1 usec
[    2.028241] scsi 2:0:0:0: Direct-Access     ATA      TOSHIBA MK3252GS LV01 PQ: 0 ANSI: 5
[    2.028408] sd 2:0:0:0: [sdb] 625142448 512-byte logical blocks: (320 GB/298 GiB)
[    2.028554] sd 2:0:0:0: [sdb] Write Protect is off
[    2.028634] sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    2.028656] sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.028870]  sdb: sdb1 sdb2 < sdb5 sdb6 sdb7 > sdb3
[    2.125730] sd 2:0:0:0: [sdb] Attached SCSI disk
[    2.125844] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    2.125948] initcall ahci_init+0x0/0x16 returned 0 after 1130224 usecs
[    2.126032] calling  piix_init+0x0/0x24 @ 1
[    2.126121] ata_piix 0000:00:1f.1: version 2.13
[    2.126124] ata_piix 0000:00:1f.1: XXX enabling
[    2.126207] ata_piix 0000:00:1f.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    2.126289] ata_piix 0000:00:1f.1: XXX enabled
[    2.126367] ata_piix 0000:00:1f.1: XXX caching iocfg
[    2.126447] ata_piix 0000:00:1f.1: XXX cached iocfg
[    2.126527] ata_piix 0000:00:1f.1: XXX prepping host
[    2.126627] ata_piix 0000:00:1f.1: XXX preped host
[    2.126707] ata_piix 0000:00:1f.1: XXX applying bit18 quirk
[    2.126788] ata_piix 0000:00:1f.1: XXX applied bit18 quirk
[    2.126869] ata_piix 0000:00:1f.1: XXX intx'ing
[    2.126949] ata_piix 0000:00:1f.1: XXX intx'ed
[    2.127030] ata_piix 0000:00:1f.1: XXX setting master
[    2.127113] ata_piix 0000:00:1f.1: setting latency timer to 64
[    2.127117] ata_piix 0000:00:1f.1: XXX master set, activating
[    2.127274] scsi3 : ata_piix
[    2.127399] scsi4 : ata_piix
[    2.127504] ata4: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x70a0 irq 14
[    2.127587] ata5: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x70a8 irq 15
[    2.288377] ata4.00: ATAPI: Optiarc BD ROM BC-5500A, 1.86, max MWDMA2
[    2.304275] ata4.00: configured for MWDMA2
[    2.304575] async_waiting @ 1
[    2.304654] async_continuing @ 1 after 1 usec
[    2.305681] scsi 3:0:0:0: CD-ROM            Optiarc  BD ROM BC-5500A  1.86 PQ: 0 ANSI: 5
[    2.308675] sr0: scsi3-mmc drive: 16x/16x writer dvd-ram cd/rw xa/form2 cdda tray
[    2.308772] Uniform CD-ROM driver Revision: 3.20
[    2.308926] sr 3:0:0:0: Attached scsi CD-ROM sr0
[    2.308960] sr 3:0:0:0: Attached scsi generic sg2 type 5
[    2.475165] async_waiting @ 1
[    2.475244] async_continuing @ 1 after 1 usec
Comment 19 Stefan Krause 2010-04-15 19:11:24 UTC
Created attachment 26021 [details]
Screenshot patch + initcall_debug + no core_initcall(async_init)
Comment 20 Tejun Heo 2010-04-18 22:48:16 UTC
Created attachment 26044 [details]
activate-debug.patch

Can you apply this on top and see where it dies?  Also, does it always die at the same place?
Comment 21 Stefan Krause 2010-04-19 19:42:04 UTC
Hi Tejun,

I'm afraid I can't reproduce the issue with the patch applied. I noticed this earlier - adding debug info to the kernel appeared to lower the chance of a crash. Appears to be some kind of timing issue / race condition.

I booted 15 times successfully with the patches applied and 5 times with the patches applied, but without initcall_debug.

I'll run another test tomorrow.
Comment 22 Tejun Heo 2010-04-19 22:16:34 UTC
You can remove most of printks and just move a couple of printks around to locate where it fails.  Hmm... I'm afraid this could be pretty tricky to track down.  Can you please also try "libata.noacpi=1"?
Comment 23 Stefan Krause 2010-04-25 15:35:33 UTC
Hi Tejun,

libata.noacpi=1 makes no difference (though completely disabling acpi seems to help)
When I added some printks to libata-sff it crashed, but the output contains no useful information for me (I'll attach a screenshot)
With libata-core patched it doesn't crash anymore.

I'm afraid to say that I don't feel I should invest much more time compiling kernels and booting my laptop - there are more useful things to do (even  under windows ;-).
Comment 24 Stefan Krause 2010-04-25 15:36:57 UTC
Created attachment 26133 [details]
crash with libata-sff patched
Comment 25 Tejun Heo 2010-04-25 15:45:17 UTC
Oh well, nobody can force you to do anything.  Hmmmm, this definitely looks like something at lower level than libata is causing problem.  Can you please attach the output of dmidecode?  I'll ping HP and see whether they can shed some light.

Thanks.
Comment 26 Stefan Krause 2010-04-25 18:16:45 UTC
Created attachment 26136 [details]
dmidecode

Thanks Tejun - would be really nice if you could contact HP.
(And please don't get me wrong: I absolutely appreciate your help and I'll keep an eye on that bug even though I can't invest too much time...)
Comment 27 Tejun Heo 2010-04-28 15:06:48 UTC
Heh... the dmi data doesn't contain the model name.  What's printed on the machine?
Comment 28 Stefan Krause 2010-04-28 18:24:29 UTC
;-) Possible that they couldn't decide. It's a HP Pavillon HDX. The exact model name is according the hp support site 9480eg, but the HP stamp on the bottom says just "hdx 9300".
Comment 29 Tejun Heo 2010-04-28 20:31:20 UTC
Alright, I pinged HP.  Let's hope it gets their attention.  Thanks.  :-)
Comment 30 Tejun Heo 2010-05-05 15:24:20 UTC
Stefan, one more thing.  You reported that ubuntu 2.6.32-16 didn't have this problem, right?  How confident are you about that?  ie. Is there any possibility that the sporadic nature of the problem just made such impression while the problem was still there?  Also, is it possible to test vanilla 2.6.32?

Thanks.
Comment 31 Stefan Krause 2010-05-05 18:43:34 UTC
Hi Tejun, I'm pretty sure. I booted certainly more than 100 times and I can't remember any crash. The probability of a crash starting with 2.6.32-17 was about 50% the time, so it's significant to me. But as I said before this is only true for the 32 bit version I had issues with the 64 bit version way longer (Ubunutu 9.10 64 bit) so the problem might indeed be lurking there for longer (I guess some timing issue or race condition). I'll check vanilla 2.6.32 32 bit the next few days.
Comment 32 Stefan Krause 2010-05-13 17:47:46 UTC
Hi Tejun, I already knew it was strange, but this afternoon was really bizarre.

Here are the results of other vanilla kernel versions (All used the attached config "config used for 2.6.33.2 with ata_piix" and were build with make oldconfig + fakeroot make-kpkg):
2.6.32.13: Crash
2.6.32.1: Crash
2.6.31.13: Crash
2.6.31.7: Crash
2.6.31.4: Crash
2.6.31.2: Crash
2.6.31.1: No Crash ( booted 14 times successfully)

I double checked that 2.6.31.2 really freezes (did so 3 times). At the same time I reassured that Ubuntu kernel 2.6.23-16 did not freeze (booted today 16 times without problems) 

All in all I'm really puzzled now.
Comment 33 Stefan Krause 2010-05-13 17:49:26 UTC
2.6.23-16 should read 2.6.32-16, sorry for that.
BTW. The chance of a crash  was again about 50%
Comment 34 Tejun Heo 2010-05-17 06:33:22 UTC
Created attachment 26405 [details]
ahci-no-intx.patch

Thanks a lot for testing.  So, it's something introduced between 2.6.31.1 and 2.  Going through the commits, okay, there are two libata commits.

 pata_amd: do not filter out valid modes in nv_mode_filter
 ahci: restore pci_intx() handling

and I can't find any suspicious looking x86 or pci changes, so the ahci commit seems most likely.  Strange.  That commit was to revert a change which caused problems on rare configurations.  Anyways, can you please try the attached patch on top of 2.6.32.13 and see whether it changes anything?

Thank you.
Comment 35 Stefan Krause 2010-05-17 19:43:41 UTC
Thank you!
I applied the patch to 2.6.32.13, double checked it's there in the source code, built a new kernel package, but I'm afraid to say it crashed.
Comment 36 Tejun Heo 2010-05-18 08:18:57 UTC
Hmmm... so it crashed the same way?  Strange, that was the only seemingly relevant change between 2.6.31.1 and 2.  Can you please bisect between 31.1 and 2?  There aren't too many between the two, so bisection won't take too long.

Thanks.
Comment 37 Stefan Krause 2010-05-24 17:21:31 UTC
The result of the bisection was:
[c9aac6645fcf788ba660ba456ec0cf9fd2f6f0f8] kbuild: fix cc1 options check to ensure we do not use -fPIC when compiling

I don't think this can be the reason. If removed from 2.6.31.2 it still freezes.
I'll try another bisect when I find some spare time (since there's just a probability that it hangs I could have possibly tested one bisect too few times).

During bisection I had the effect that it froze after some EHCI messages and not just after ATA_PIIX (something like ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 20 (level, low) -> IRQ 20, but I didn't take a screenshot).
Comment 38 Tejun Heo 2010-05-25 08:02:32 UTC
Yeah, the thing is that I couldn't spot anything which looks too suspicious in the commits for the stable release.  I wish I could.  :-(

So, please give one more shot on bisection.  Thank you very much.
Comment 39 ykzhao 2010-06-12 06:42:50 UTC
*** Bug 14244 has been marked as a duplicate of this bug. ***
Comment 40 Stefan Krause 2010-12-08 21:07:48 UTC
Tejun, just to let you know: I bought a new laptop so I've given up on this bug. Anyways thanks for your support in this issue.
Comment 41 Tejun Heo 2010-12-09 09:27:08 UTC
Eh, it's too bad that the problem couldn't be root caused.  I'm closing the bug.  Please feel free to reopen if necessary.

Thank you.

Note You need to log in before you can comment on or make changes to this bug.