Bug 10860
Summary: | total system freeze at boot with 2.6.26-rc | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Christian Casteyde (casteyde.christian) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | CLOSED CODE_FIX | ||
Severity: | blocking | CC: | bunk |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 | ||
Attachments: |
dmesg for working 2.6.25.4
lspci for working 2.6.25.4 lsusb for working 2.6.25.4 sata_uli-no-hrst.patch dmesg log for 2.6.26-rc6 + reset patch for uli |
Description
Christian Casteyde
2008-06-05 12:38:02 UTC
Created attachment 16405 [details]
dmesg for working 2.6.25.4
Created attachment 16406 [details]
lspci for working 2.6.25.4
Created attachment 16407 [details]
lsusb for working 2.6.25.4
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Thu, 5 Jun 2008 12:38:02 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=10860 > > Summary: total system freeze at boot with 2.6.26-rc > Product: Other > Version: 2.5 > KernelVersion: 2.6.26-rc4 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: blocking > Priority: P1 > Component: Other > AssignedTo: other_other@kernel-bugs.osdl.org > ReportedBy: casteyde.christian@free.fr > > > Latest working kernel version:2.6.25.4 > Earliest failing kernel version:2.6.24-rc4 (should have been 2.6.26-rc4) > also fails with -rc5 I was wating for, previous rcs not tested > Distribution: Bluewhite64 (64 bits slackware) > Hardware Environment: Athlon64 X2 / Ali 1689 north 1563 south > + bt848 v4l + PATA and SATA disk (SATA on sata_uli) > Seems to be related to sata disk detection (see below). > > Software Environment: > none > > Problem Description: > the computer freeze totally at boot after SATA disk detection. > I append 2.6.25.4 dmesg, lspci and lsusb. > The CPU is not 100% (otherwise I could hear the fan going crazy), the > keayboard > is dead (unable to scroll the console up or down). Nothings happen. > > The console shows somethings similar to the 2.6.25.4 logs, but hangs after > those lines: > hda: cache flushes not supported > hda: hda1 hda2 > hdd: ATAPI 40X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache > Uniform CD-ROM driver Revision: 3.20 > Driver 'sd' needs updating - please use bus_type methods > sata_uli 0000:00:0e.1: version 1.3 > ACPI: PCI Interrupt 0000:00:0e.1[A] -> GSI 19 (level, low) -> IRQ 19 > scsi0 : sata_uli > scsi1 : sata_uli > ata1: SATA max UDMA/133 cmd 0xf80 ctl 0xf00 bmdma 0xd880 irq 19 > ata2: SATA max UDMA/133 cmd 0xe80 ctl 0xe00 bmdma 0xd888 irq 19 > ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) > > <-- here, 2.6.24 hangs --> > > normal way for 2.6.25 : > > ata1.00: ATA-7: ST3200826AS, 3.06, max UDMA/133 > ata1.00: 390721968 sectors, multi 16: LBA48 NCQ (depth 0/32) > ata1.00: configured for UDMA/133 > ata2: SATA link down (SStatus 0 SControl 300) > scsi 0:0:0:0: Direct-Access ATA ST3200826AS 3.06 PQ: 0 ANSI: 5 > sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB) > sd 0:0:0:0: [sda] Write Protect is off > sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 > > Steps to reproduce: > Boot my PC :-) > I guess any sata_uli device may hang the same way? yup, I'd say that the ata code (or something very nearby) killed your box. This is a post-2.6.25 regression. This entry is being used for tracking a regression from 2.6.25. Please don't close it until the problem is fixed in the mainline. Of course, I got : HARDWARE ERROR This is not a software etc.. which is plain false. mcelog --ascii gives nothing. I didn't got more information... (for information, I passed several times memtest, and never got an error. I will check with another SATA cable tomorrow, but I do not believe at all this error. maybe some bad registers somewhere that causes the chipset to misbehave ?). For info, -rc6 still fails. I've tried a small bisect, and I got failure as soon as -rc2. -rc1 simply doesn't boot (fail just after LILO, don't get any message). I suspect the problem was introduced in -rc1. When I was bisecting, the NMI didn't triggered every time, sometimes the computer was silently blocked. You mean NMI watchdog? So, in some cases, NMI watchdog is triggered? What does it spit out? Sorry, I've just noticed that due to comment #4, there are some missing info in this bug track. Somebody told me to try nmi_watchdog to get more info once the computer is blocked. So I added this option, and the result is that now I get "Hardware error". This error never occurs if I don't add the "nmi_watchdog" option. So whatever the watchdog should do, it's unable to do it apparently due to another severe error. Hmmm... between 2.6.25.4 and no, sata_uli hasn't really changed. The hardware handling should be almost identical although core layer has seen some changes. I have no idea what could have caused this difference. Any chance you can try bisecting it? Well, I retried to boot -rc1, it did effectively block also at the same point. I bisected down to 2.6.25-git1, and it also fails. However, I managed to see some stack dumps very early at boot. It scrolls very fast and I havent managed to read it. I'm not sure either it is not fixed in later -rc, so I'll redo the test, but it may explain the system freeze. I don't know how to bisect sub-git patches, since I get all of them from kernel.org. Indeed, the problem was introduced very early so. I'll also try delaying each printk, but I'll have to disable multicore then, that may hide the problem also. well, single core -> same result -rc6 doesn't have the stack dumps, so it was another bug in -rc1 that was fixed later. The freeze resiedes in 25-git1, do not know what to do to go further, there are so many files modified - and today it took me the whole evening I'm tired Created attachment 16556 [details]
sata_uli-no-hrst.patch
I misread the bug hang trace (you have 2.6.24 and 25 switched there, right?). If it hangs while resetting, this patch might fix the problem. Can you please try it?
No, 2.6.25.* works. It is pre 2.6.26 that hangs, as soon as 2.6.25-git1 (26-rc1 was issued after 25-git20). I'll check the patch tonight, but as -git1 didn't changed many on uli, I think there is a core problem anywhere. Maybe there is a reset now that uli could not handle and that were not present before? So, you mean that 2.6.26-rcX hangs after the first link up message while 2.6.25 works fine and continues to print out the scsi messages, right? Please test the patch. It should help. OK, it boots with your patch. Thanks a lot :-) I append the dmesg log for 2.6.26, just in case you want to see any log that could confirm that was the only problem with sata_uli... Great jobs ! Created attachment 16562 [details]
dmesg log for 2.6.26-rc6 + reset patch for uli
hmm btw, comparing both dmesg, I noticed lapic is also broken on 2.6.26. It breaks high resolution mode, not critical in fact, but should I report another bug? I don't have much idea about lapic. You'll need to file a separate bug report. I'll forward the sata_uli patch upstream. Thanks. Handled-By : Tejun Heo <htejun@gmail.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=16556 fixed by commit 70a3143af87c6ca188107cbd49ab5eec2c86c456 |