Bug 10860

Summary: total system freeze at boot with 2.6.26-rc
Product: IO/Storage Reporter: Christian Casteyde (casteyde.christian)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: CLOSED CODE_FIX    
Severity: blocking CC: bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10492    
Attachments: dmesg for working 2.6.25.4
lspci for working 2.6.25.4
lsusb for working 2.6.25.4
sata_uli-no-hrst.patch
dmesg log for 2.6.26-rc6 + reset patch for uli

Description Christian Casteyde 2008-06-05 12:38:02 UTC
Latest working kernel version:2.6.25.4
Earliest failing kernel version:2.6.24-rc4
also fails with -rc5 I was wating for, previous rcs not tested
Distribution: Bluewhite64 (64 bits slackware)
Hardware Environment: Athlon64 X2 / Ali 1689 north 1563 south
+ bt848 v4l + PATA and SATA disk (SATA on sata_uli)
Seems to be related to sata disk detection (see below).

Software Environment:
none

Problem Description:
the computer freeze totally at boot after SATA disk detection.
I append 2.6.25.4 dmesg, lspci and lsusb.
The CPU is not 100% (otherwise I could hear the fan going crazy), the keayboard is dead (unable to scroll the console up or down). Nothings happen.

The console shows somethings similar to the 2.6.25.4 logs, but hangs after those lines:
hda: cache flushes not supported
 hda: hda1 hda2
hdd: ATAPI 40X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache
Uniform CD-ROM driver Revision: 3.20
Driver 'sd' needs updating - please use bus_type methods
sata_uli 0000:00:0e.1: version 1.3
ACPI: PCI Interrupt 0000:00:0e.1[A] -> GSI 19 (level, low) -> IRQ 19
scsi0 : sata_uli
scsi1 : sata_uli
ata1: SATA max UDMA/133 cmd 0xf80 ctl 0xf00 bmdma 0xd880 irq 19
ata2: SATA max UDMA/133 cmd 0xe80 ctl 0xe00 bmdma 0xd888 irq 19
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

<-- here, 2.6.24 hangs -->

normal way for 2.6.25 :

ata1.00: ATA-7: ST3200826AS, 3.06, max UDMA/133
ata1.00: 390721968 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata1.00: configured for UDMA/133
ata2: SATA link down (SStatus 0 SControl 300)
scsi 0:0:0:0: Direct-Access     ATA      ST3200826AS      3.06 PQ: 0 ANSI: 5
sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00

Steps to reproduce:
Boot my PC :-)
I guess any sata_uli device may hang the same way?
Comment 1 Christian Casteyde 2008-06-05 12:38:54 UTC
Created attachment 16405 [details]
dmesg for working 2.6.25.4
Comment 2 Christian Casteyde 2008-06-05 12:39:18 UTC
Created attachment 16406 [details]
lspci for working 2.6.25.4
Comment 3 Christian Casteyde 2008-06-05 12:39:37 UTC
Created attachment 16407 [details]
lsusb for working 2.6.25.4
Comment 4 Anonymous Emailer 2008-06-05 12:50:05 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu,  5 Jun 2008 12:38:02 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10860
> 
>            Summary: total system freeze at boot with 2.6.26-rc
>            Product: Other
>            Version: 2.5
>      KernelVersion: 2.6.26-rc4
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: blocking
>           Priority: P1
>          Component: Other
>         AssignedTo: other_other@kernel-bugs.osdl.org
>         ReportedBy: casteyde.christian@free.fr
> 
> 
> Latest working kernel version:2.6.25.4
> Earliest failing kernel version:2.6.24-rc4

(should have been 2.6.26-rc4)

> also fails with -rc5 I was wating for, previous rcs not tested
> Distribution: Bluewhite64 (64 bits slackware)
> Hardware Environment: Athlon64 X2 / Ali 1689 north 1563 south
> + bt848 v4l + PATA and SATA disk (SATA on sata_uli)
> Seems to be related to sata disk detection (see below).
> 
> Software Environment:
> none
> 
> Problem Description:
> the computer freeze totally at boot after SATA disk detection.
> I append 2.6.25.4 dmesg, lspci and lsusb.
> The CPU is not 100% (otherwise I could hear the fan going crazy), the
> keayboard
> is dead (unable to scroll the console up or down). Nothings happen.
> 
> The console shows somethings similar to the 2.6.25.4 logs, but hangs after
> those lines:
> hda: cache flushes not supported
>  hda: hda1 hda2
> hdd: ATAPI 40X DVD-ROM DVD-R-RAM CD-R/RW drive, 2048kB Cache
> Uniform CD-ROM driver Revision: 3.20
> Driver 'sd' needs updating - please use bus_type methods
> sata_uli 0000:00:0e.1: version 1.3
> ACPI: PCI Interrupt 0000:00:0e.1[A] -> GSI 19 (level, low) -> IRQ 19
> scsi0 : sata_uli
> scsi1 : sata_uli
> ata1: SATA max UDMA/133 cmd 0xf80 ctl 0xf00 bmdma 0xd880 irq 19
> ata2: SATA max UDMA/133 cmd 0xe80 ctl 0xe00 bmdma 0xd888 irq 19
> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> 
> <-- here, 2.6.24 hangs -->
> 
> normal way for 2.6.25 :
> 
> ata1.00: ATA-7: ST3200826AS, 3.06, max UDMA/133
> ata1.00: 390721968 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata1.00: configured for UDMA/133
> ata2: SATA link down (SStatus 0 SControl 300)
> scsi 0:0:0:0: Direct-Access     ATA      ST3200826AS      3.06 PQ: 0 ANSI: 5
> sd 0:0:0:0: [sda] 390721968 512-byte hardware sectors (200050 MB)
> sd 0:0:0:0: [sda] Write Protect is off
> sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
> 
> Steps to reproduce:
> Boot my PC :-)
> I guess any sata_uli device may hang the same way?

yup, I'd say that the ata code (or something very nearby) killed your
box.

This is a post-2.6.25 regression.
Comment 5 Rafael J. Wysocki 2008-06-05 15:17:34 UTC
This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 6 Christian Casteyde 2008-06-08 15:27:04 UTC
Of course, I got :
HARDWARE ERROR
This is not a software etc..

which is plain false.
mcelog --ascii gives nothing.
I didn't got more information...

(for information, I passed several times memtest, and never got an error. I will check with another SATA cable tomorrow, but I do not believe at all this error. maybe some bad registers somewhere that causes the chipset to misbehave ?).
Comment 7 Christian Casteyde 2008-06-14 23:28:29 UTC
For info, -rc6 still fails. I've tried a small bisect, and I got failure as soon as -rc2. -rc1 simply doesn't boot (fail just after LILO, don't get any message). I suspect the problem was introduced in -rc1.
When I was bisecting, the NMI didn't triggered every time, sometimes the computer was silently blocked.
Comment 8 Tejun Heo 2008-06-15 19:01:02 UTC
You mean NMI watchdog?  So, in some cases, NMI watchdog is triggered?  What does it spit out?
Comment 9 Christian Casteyde 2008-06-16 11:45:49 UTC
Sorry, I've just noticed that due to comment #4, there are some missing info in this bug track.
Somebody told me to try nmi_watchdog to get more info once the computer is blocked. So I added this option, and the result is that now I get "Hardware error". This error never occurs if I don't add the "nmi_watchdog" option.
So whatever the watchdog should do, it's unable to do it apparently due to another severe error.
Comment 10 Tejun Heo 2008-06-16 21:45:51 UTC
Hmmm... between 2.6.25.4 and no, sata_uli hasn't really changed.  The hardware handling should be almost identical although core layer has seen some changes.  I have no idea what could have caused this difference.  Any chance you can try bisecting it?
Comment 11 Christian Casteyde 2008-06-19 12:17:27 UTC
Well, I retried to boot -rc1, it did effectively block also at the same point.
I bisected down to 2.6.25-git1, and it also fails.
However, I managed to see some stack dumps very early at boot. It scrolls very fast and I havent managed to read it. I'm not sure either it is not fixed in later -rc, so I'll redo the test, but it may explain the system freeze.
I don't know how to bisect sub-git patches, since I get all of them from kernel.org.
Indeed, the problem was introduced very early so. I'll also try delaying each printk, but I'll have to disable multicore then, that may hide the problem also.
Comment 12 Christian Casteyde 2008-06-19 13:53:57 UTC
well, single core -> same result
-rc6 doesn't have the stack dumps, so it was another bug in -rc1 that was fixed later. The freeze resiedes in 25-git1, do not know what to do to go further, there are so many files modified - and today it took me the whole evening I'm tired
Comment 13 Tejun Heo 2008-06-19 18:36:04 UTC
Created attachment 16556 [details]
sata_uli-no-hrst.patch

I misread the bug hang trace (you have 2.6.24 and 25 switched there, right?).  If it hangs while resetting, this patch might fix the problem.  Can you please try it?
Comment 14 Christian Casteyde 2008-06-20 09:58:30 UTC
No, 2.6.25.* works.
It is pre 2.6.26 that hangs, as soon as 2.6.25-git1 (26-rc1 was issued after 25-git20).
I'll check the patch tonight, but as -git1 didn't changed many on uli, I think there is a core problem anywhere. Maybe there is a reset now that uli could not handle and that were not present before?
Comment 15 Tejun Heo 2008-06-20 10:12:08 UTC
So, you mean that 2.6.26-rcX hangs after the first link up message while 2.6.25 works fine and continues to print out the scsi messages, right?  Please test the patch.  It should help.
Comment 16 Christian Casteyde 2008-06-20 10:27:58 UTC
OK, it boots with your patch. Thanks a lot :-)
I append the dmesg log for 2.6.26, just in case you want to see any log that could confirm that was the only problem with sata_uli...
Great jobs !
Comment 17 Christian Casteyde 2008-06-20 10:28:35 UTC
Created attachment 16562 [details]
dmesg log for 2.6.26-rc6 + reset patch for uli
Comment 18 Christian Casteyde 2008-06-20 10:50:29 UTC
hmm btw, comparing both dmesg, I noticed lapic is also broken on 2.6.26.
It breaks high resolution mode, not critical in fact, but should I report another bug?
Comment 19 Tejun Heo 2008-06-20 16:56:34 UTC
I don't have much idea about lapic.  You'll need to file a separate bug report.  I'll forward the sata_uli patch upstream.  Thanks.
Comment 20 Adrian Bunk 2008-06-22 02:32:36 UTC
Handled-By      : Tejun Heo <htejun@gmail.com>
Patch           : http://bugzilla.kernel.org/attachment.cgi?id=16556
Comment 21 Adrian Bunk 2008-07-06 08:55:51 UTC
fixed by commit 70a3143af87c6ca188107cbd49ab5eec2c86c456