Most recent kernel where this bug did *NOT* occur: Distribution: FC6 Hardware Environment: x86_64, Intel Core 2 Duo, i945 PM, Silicon Image SiI 3132 on PCI-E Software Environment: Problem Description: Hi, recently I got my hands on an ASUS A8Js notebook (Core 2 Duo T7200, Intel 945 PM PCI-E Chipset, for details see attached log). After booting the latest 2.6.20-rc5-git3 kernel (but the same behaviour occurs also with the 2.6.19.2, didn't try any other), some strange behaviour can be observed. At first the following messages appear in the log: ... [ 40.303154] PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved [ 40.303157] PCI: Not using MMCONFIG. (not sure whether this is really a problem) ... [ 40.400899] PCI: Using ACPI for IRQ routing [ 40.400902] PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report [ 40.400911] PCI: Cannot allocate resource region 7 of bridge 0000:00:1c.3 [ 40.400914] PCI: Cannot allocate resource region 8 of bridge 0000:00:1c.3 [ 40.400917] PCI: Cannot allocate resource region 9 of bridge 0000:00:1c.3 and if I have the Kouwell 5652E SATA-II ExpressCard (with Silicon Image SiL 3132 controller) connected to the notebook's ExpressCard slot, also the following messages appear after the above ones: [ 40.400999] PCI: Cannot allocate resource region 0 of device 0000:04:00.0 [ 40.401003] PCI: Cannot allocate resource region 2 of device 0000:04:00.0 [ 40.401006] PCI: Cannot allocate resource region 4 of device 0000:04:00.0 ... where 00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 02) 04:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01) Both SATA controllers (the on-board using ata_piix and the ExpressCard using sata_sil24) are the detected ... [ 40.639495] ata_piix 0000:00:1f.2: version 2.00ac7 [ 40.639500] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ] [ 40.640063] ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19 [ 40.640763] PCI: Setting latency timer of device 0000:00:1f.2 to 64 [ 40.640814] ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0xFFA0 irq 14 [ 40.641467] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xFFA8 irq 15 [ 40.642114] scsi0 : ata_piix [ 40.853499] ata1.00: ATA-7, max UDMA/100, 234441648 sectors: LBA48 NCQ (depth 0/32) [ 40.854188] ata1.00: ata1: dev 0 multi count 16 [ 40.860490] ata1.00: configured for UDMA/100 [ 40.860881] scsi1 : ata_piix [ 41.025049] ATA: abnormal status 0x7F on port 0x177 [ 41.045585] ATA: abnormal status 0x7F on port 0x177 [ 41.231326] ata2.01: ATAPI, max UDMA/33 [ 41.417213] ata2.01: configured for UDMA/33 [ 41.426449] scsi 0:0:0:0: Direct-Access ATA FUJITSU MHV2120B 0000 PQ: 0 ANSI: 5 [ 41.436554] SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) [ 41.447026] sda: Write Protect is off [ 41.457669] sda: Mode Sense: 00 3a 00 00 [ 41.457693] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 41.469438] SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB) [ 41.481481] sda: Write Protect is off [ 41.493694] sda: Mode Sense: 00 3a 00 00 [ 41.493708] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 41.507017] sda: sda1 sda2 sda3 sda4 [ 41.604667] sd 0:0:0:0: Attached scsi disk sda [ 41.622803] scsi 1:0:1:0: CD-ROM TSSTcorp CD/DVDW TS-L632D AS05 PQ: 0 ANSI: 5 ... [ 48.817403] sata_sil24 0000:04:00.0: version 0.3 [ 48.817432] ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 19 (level, low) -> IRQ 19 [ 48.817475] PCI: Setting latency timer of device 0000:04:00.0 to 64 [ 48.817531] ata3: SATA max UDMA/100 cmd 0xFFFFC20000040000 ctl 0x0 bmdma 0x0 irq 19 [ 48.817585] ata4: SATA max UDMA/100 cmd 0xFFFFC20000042000 ctl 0x0 bmdma 0x0 irq 19 [ 48.817593] scsi2 : sata_sil24 [ 49.068344] Bluetooth: HCI USB driver ver 2.9 [ 49.142279] usbcore: registered new interface driver hci_usb [ 49.222098] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 49.222610] ata3.00: ATA-7, max UDMA/133, 156312576 sectors: LBA [ 49.222612] ata3.00: ata3: dev 0 multi count 0 [ 49.223246] ata3.00: configured for UDMA/100 [ 49.223260] scsi3 : sata_sil24 [ 49.525900] ata4: SATA link down (SStatus 0 SControl 300) [ 49.526898] scsi 2:0:0:0: Direct-Access ATA Maxtor 6Y080M0 YAR5 PQ: 0 ANSI: 5 [ 49.527093] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB) [ 49.527103] sdb: Write Protect is off [ 49.527105] sdb: Mode Sense: 00 3a 00 00 [ 49.527269] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 49.527309] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB) [ 49.527317] sdb: Write Protect is off [ 49.527319] sdb: Mode Sense: 00 3a 00 00 [ 49.527337] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 49.527341] sdb:<5>sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 49.536071] scsi 1:0:1:0: Attached scsi generic sg1 type 5 [ 49.552648] sdb1 [ 49.552700] sd 2:0:0:0: Attached scsi disk sdb [ 49.552733] sd 2:0:0:0: Attached scsi generic sg2 type 0 ... However, strange messages simillar to the following appear periodically (possibly a result of the system accessing the disk, but not always, when accessed) for both controllers ... [ 193.907771] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [ 193.927526] ata2.01: cmd a0/01:00:00:00:00/00:00:00:00:00/b0 tag 0 cdb 0x25 data 8 in [ 193.927529] res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout) [ 201.007291] ata2: port is slow to respond, please be patient (Status 0xd0) [ 223.974196] ata2: port failed to respond (30 secs, Status 0xd0) [ 224.000456] ata2: soft resetting port [ 224.163274] ATA: abnormal status 0x7F on port 0x177 [ 224.568225] ata2.01: configured for UDMA/33 [ 224.568237] ata2: EH complete [ 231.828794] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 [ 231.849673] ata3.00: (irq_stat 0x00020002, device error via D2H FIS) [ 231.871059] ata3.00: cmd c8/00:00:08:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 131072 in [ 231.871061] res 51/84:00:08:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) [ 232.217493] ata3: soft resetting port [ 232.318609] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 232.319782] ata3.00: configured for UDMA/100 [ 232.319792] ata3: EH complete [ 232.340315] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB) [ 232.363118] sdb: Write Protect is off [ 232.385795] sdb: Mode Sense: 00 3a 00 00 [ 232.387452] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 232.470053] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 [ 232.492477] ata3.00: (irq_stat 0x00020002, device error via D2H FIS) [ 232.515254] ata3.00: cmd c8/00:00:f8:16:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 131072 in [ 232.515256] res 51/84:00:f8:16:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error) [ 232.864801] ata3: soft resetting port [ 232.965875] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 232.967043] ata3.00: configured for UDMA/100 [ 232.967055] ata3: EH complete [ 232.988447] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB) [ 233.012277] sdb: Write Protect is off [ 233.035602] sdb: Mode Sense: 00 3a 00 00 [ 233.037536] SCSI device sdb: write cache: enabled, read cache: enabled, doesn't support DPO or FUA ... Until after some seconds of accessing the disk on the external controller, the whole disk is shut down: ... [ 269.026999] ata3.00: limiting speed to UDMA/33 [ 269.058614] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [ 269.090354] ata3.00: (irq_stat 0x00020002, failed to transmit command FIS) [ 269.122048] ata3.00: cmd c8/00:08:4f:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in [ 269.122051] res c8/00:08:4f:00:00/00:00:00:00:00/e0 Emask 0x12 (ATA bus error) [ 269.184814] ata3: hard resetting port [ 271.414531] ata3: softreset failed (port not ready) [ 271.445584] ata3: follow-up softreset failed, retrying in 5 secs [ 276.474431] ata3: hard resetting port [ 278.805006] ata3: softreset failed (port not ready) [ 278.835656] ata3: follow-up softreset failed, retrying in 5 secs [ 283.864906] ata3: hard resetting port [ 286.094543] ata3: softreset failed (port not ready) [ 286.124723] ata3: reset failed, giving up [ 286.154334] ata3.00: disabled [ 286.183738] ata3: EH pending after completion, repeating EH (cnt=4) [ 286.294424] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x6 frozen [ 286.324117] ata3: (irq_stat 0x00060002, failed to transmit command FIS) [ 286.874068] ata3: waiting for device to spin up (8 secs) [ 294.289524] ata3: hard resetting port [ 296.519160] ata3: softreset failed (port not ready) [ 296.519741] ata3: follow-up softreset failed, retrying in 5 secs [ 301.518100] ata3: hard resetting port [ 303.748733] ata3: softreset failed (port not ready) [ 303.748738] ata3: follow-up softreset failed, retrying in 5 secs [ 308.746700] ata3: hard resetting port [ 311.077274] ata3: softreset failed (port not ready) [ 311.077277] ata3: reset failed, giving up [ 311.077284] sd 2:0:0:0: SCSI error: return code = 0x08000002 [ 311.077286] sdb: Current [descriptor]: sense key: Aborted Command [ 311.077289] Additional sense: Scsi parity error [ 311.077294] Descriptor sense data with sense descriptors (in hex): [ 311.077296] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [ 311.077302] 00 00 00 4f [ 311.077304] end_request: I/O error, dev sdb, sector 79 [ 311.077314] ata3: EH complete [ 311.077320] ata3.00: detaching (SCSI 2:0:0:0) [ 311.077323] EXT3-fs: can't read group descriptor 1 [ 311.174529] Synchronizing SCSI cache for disk sdb: [ 311.196553] FAILED [ 311.196555] status = 0, message = 00, host = 4, driver = 00 [ 311.196556] <3>ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ... I didn't manage to shut down the disk on the primary ata_piix controller that easily, but I guess when I'd try long and hard enough, it may suffer the same fate (or maybe not?). I do not know whether this behaviour is somehow related to the ACPI detection problems stated at the beginning of the mail, or whether these are two different bugs. But it seems only the SATA is affected (so far I haven't noticed any thing else). Steps to reproduce: All of this is clearly reproducible, see above.
Created attachment 10174 [details] Kernel log (dmesg)
Created attachment 10175 [details] Kernel configuration for 2.6.20-rc5-git3 on the ASUS A8Js
Created attachment 10176 [details] lspci output
Created attachment 10177 [details] lspci -n output
Created attachment 10178 [details] lspci -vvv output
please attach a copy of /proc/interrupts Does the system behave any better if booted with "pci=noacpi"?
Created attachment 10187 [details] dmesg when using pci=noacpi
Created attachment 10188 [details] /proc/interrupts when using pci=noacpi
Nope, when using pci=noacpi, everything seems to be pretty much the same. See dmesg and /proc/interrupts submitted to the bugreport.
Created attachment 10190 [details] (.tar.bz2) Results of Linux-ready Firmware Developer Kit r1 automatic tests I don't know if it helps anything, but here are the results of the Linux-ready Firmware Developer Kit R1 automatic tests. There should also by the listing of ACPI as I see it.
ASUS refused to help fixing the BIOS with words that ASUS notebooks do not support Linux. Which means that either some special workaround is going to be found for this or it would really be unusable for Linux. :(
Created attachment 10197 [details] dmesg with acpi=off I tried booting with acpi=off. The messages about problem with MMCONFIG and about PCI being unable to allocate resource regions for the PCI-E controller and the SiI 3132 SATA controller disappeared, but the problem with SATA stil remains the same. Which makes me think whether this may not actually be a problem of libata or some such? On the other hand the following strange message appeared (see attached dmesg): -------------------- [ 39.917334] Uhhuh. NMI received for unknown reason 2d. [ 39.917344] Do you have a strange power saving mode enabled? [ 39.917349] Dazed and confused, but trying to continue [ 39.918217] Uhhuh. NMI received for unknown reason 00. [ 39.918229] Do you have a strange power saving mode enabled? [ 39.918234] Dazed and confused, but trying to continue --------------------
Created attachment 10206 [details] Just a testing patch to find out what's wrong with the resources. I tested this little patch, just to find out closer what's wrong with the PCI resources of the ICH7's Express Port 4 (bridge) and its attached SiI 3132 ExpressCard as mentioned in the initial bugreport.
Created attachment 10207 [details] Results of the testing patch This is the result of the testing patch: --------- ... [ 22.179122] PCI: Device 0000:00:1c.3, Resource 7: 0x0-0xfff, flags 0x100, "PCI Bus #04", parent PCI IO: 0x0-0xffff [ 22.179127] PCI: Cannot allocate resource region 7 of bridge 0000:00:1c.3 [ 22.179131] PCI: Device 0000:00:1c.3, Resource 8: 0x0-0xfffff, flags 0x200, "PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff [ 22.179136] PCI: Cannot allocate resource region 8 of bridge 0000:00:1c.3 [ 22.179140] PCI: Device 0000:00:1c.3, Resource 9: 0x0-0xfffff, flags 0x1201, "PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff [ 22.179145] PCI: Cannot allocate resource region 9 of bridge 0000:00:1c.3 ... [ 22.179341] PCI: Device 0000:04:00.0, Resource 0: 0xfe9ffc00-0xfe9ffc7f, flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0 [ 22.179346] PCI: Cannot allocate resource region 0 of device 0000:04:00.0 [ 22.179350] PCI: Device 0000:04:00.0, Resource 2: 0xfe9f8000-0xfe9fbfff, flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0 [ 22.179355] PCI: Cannot allocate resource region 2 of device 0000:04:00.0 [ 22.179359] PCI: Device 0000:04:00.0, Resource 4: 0xdc00-0xdc7f, flags 0x101, "0000:04:00.0", parent NOT FOUND: 0x0-0x0 [ 22.179364] PCI: Cannot allocate resource region 4 of device 0000:04:00.0 ... --------- So it turns out, that (as expected) the PCI-E bridge does not get the IO/MEM regions assigned. The ExpressCard does get them, but is unable to locate the parent area from which it should have been allocated (surprisingly;). However as you may see later in the dmesg, both of them are later (re)assigned these resources by kernel: --------------- ... [ 22.182944] got res [50100000:5017ffff] bus [50100000:5017ffff] flags 7200 for BAR 6 of 0000:04:00.0 [ 22.182948] got res [50000000:50003fff] bus [50000000:50003fff] flags 204 for BAR 2 of 0000:04:00.0 [ 22.182959] PCI: moved device 0000:04:00.0 resource 2 (204) to 0 [ 22.182962] got res [50004000:5000407f] bus [50004000:5000407f] flags 204 for BAR 0 of 0000:04:00.0 [ 22.182974] PCI: moved device 0000:04:00.0 resource 0 (204) to 0 [ 22.182977] got res [1000:107f] bus [1000:107f] flags 101 for BAR 4 of 0000:04:00.0 [ 22.182984] PCI: moved device 0000:04:00.0 resource 4 (101) to 1000 [ 22.182986] PCI: Bridge: 0000:00:1c.3 [ 22.182990] IO window: 1000-1fff [ 22.182997] MEM window: 50000000-500fffff [ 22.183002] PREFETCH window: 50100000-501fffff ... ---------------
Created attachment 10208 [details] See, that BIOS apparently allocated all the regions, just somehow forgot to tell it to the poor PCI-E Port 4 bridge. If you sort the BIOS regions assigned to the PCI devices, you clearly find 3 regions (2 MEM, 1 I/O), which were supposed to be assigned to the PCI-E bridge in question, by the apparent gaps in the allocations. And even the regions assigned to the subdevice of the bridge confirms that, as you can see from the table in the attachment.
Created attachment 10209 [details] Quick and dirty, just to confirm my hypothesis. This really is just to confirm my hypothesis about the missing gaps, it's really not meant to be a solution to the problem.
Created attachment 10212 [details] Results of the quick and dirty patch OK, the quick and dirty patch apparently does exactly what the BIOS had in mind (in this one particular case, ti's still not a general solution of course). All PCI resources are assigned as they should have been, now. However, the problem with the SATA still remains. Which (I think) we may clearly consider a bug independent on the BIOS bug of unassigned resources! So there really seems to be a problem somewhere in the libata or the appropriate drivers (ata_piix and sata_sil24). But I still don't know what do those strange SATA messages mean. Can anyone help me on this one, please?
Your drive is either timing out or reporting ICRC error (ATA bus error) indicating data transfer problems. This is more likely a hardware problem. Please apply common hardware debugging methods. * Rewire power and SATA connectors one-by-one and see to which the errors are attached. * Use different power supply to power the disks. * Use different controller or computer to verify the disks work properly.
That would be a little problematic, since it is a notebook, but I'll see what I can do. However, the reason why I think it is not a HW problem is, that under The Other System (TM) both controllers (the internal ICH7 and ExpressCard SiI 3132) work perfectly with the same HW setup. That's why I think it nas to be a SW problem. At least with the SiI 3132, because on Linux it detects the disk and when for instance I try mounting it, it fails (mostly to a complete shutdown of that HDD/cannel), while on The Other System (TM) the disk is accessible without any problem. I admit, that The Other System (TM) may suppress any error messages tha may possibly be generated, but that does not change the fact that it works.
I also have the same problem with Debian Etch running 2.6.18 kernel. My lspci output is: 00:00.0 Host bridge: Intel Corporation 975X Express Memory Controller Hub 00:01.0 PCI bridge: Intel Corporation 975X Express PCI Express Root Port 00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GH (ICH7DH) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300 (PCIE)] 01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE] 04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller 05:00.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30) 05:02.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface 05:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) 05:05.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (re v 02)
I've did some more tests here and disabling NCQ didn't help me. I tried to change the controller to AHCI. It didn't help. ata2.00: speed down requested but no transfer mode left ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (irq_stat 0x40000001) ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error) ata2: EH complete SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata2.00: speed down requested but no transfer mode left ... (repeated) ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (irq_stat 0x40000001) ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error) sd 1:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Aborted Command Additional sense: No additional sense information end_request: I/O error, dev sda, sector 17325460 ata2: EH complete SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata2: exception Emask 0x50 SAct 0x0 SErr 0x90800 action 0x2 frozen ata2: (irq_stat 0x00400000, PHY RDY changed) ata2: waiting for device to spin up (8 secs) ata2: soft resetting port ata2: softreset failed (1st FIS failed) ata2: softreset failed, retrying in 5 secs ata2: hard resetting port ata2: port is slow to respond, please be patient ata2: port failed to respond (30 secs) ata2: COMRESET failed (device not ready) ata2: hardreset failed, retrying in 5 secs ata2: hard resetting port ata2: port is slow to respond, please be patient ata2: port failed to respond (30 secs) ata2: COMRESET failed (device not ready) ata2: reset failed, giving up ata2.00: disabled ata2: EH complete sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 17198756 sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 242310092 sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 17198756 ... After the end_requent errors the machine gets unusable. It doesn't look to be my hard-drive since I moved it to another machine, with a different chipset, and it works well. The machine has a 900W power suply and then it shouldn't be the cause too.
Is it possible that it may have something to do with the ata_piix driving both SATA and PATA part of the ICH and that it somehow also influences other SATA controllers in the system? Because I also tried Knoppix 5.1 (with 2.6.19 kernel), which is configured the way that only the SATA part of ICH is controlled by the ata_piix and the PATA part seems to be controlled by a general IDE (or something?) driver (I'm not sure how exactly is that supposed to be achieved?), and there the ICH SATA does not seem to be generating these messages and it seems to run OK (at least for some while that I tried it, while otherwise some messages would definitely have appeared). Unfortunatelly because of the earlier mentioned BIOS bug, I wasn't able to test the SiI 3132 controller there, because the kernel compiled in Knoppix is not regenerating the resources that BIOS did not assign to the PCI-E bridge, and so the controller is not visible to the system.
Oh, and I very much doubt that there would be any problem in connection of the disk to the ICH controller (at least in my case), since there is no cable. It's a notebook solution, where there is a sort of a raiser card, directly form the notebook's baseboard, on which there is the SATA connector to which the disk is directly plugged. And I've tested the disk on a different (desktop) computer with a different SATA controller and there the disk runs without problems, so there is no problem with the disk itself as well. Also in Knoppix (which configured as described in previous mail) it seems running OK. So I still think this IS a software bug somewhere in kernel. Even though from the nature of the messages one may think of a HW problem, I'm still not that much convinced so.
Martin, I see. The reason why I was suggesting hardware problem is that the ahci and sata_sil24 drivers are very different and almost completely separate. It's very difficult for me to imagine a bug in common libata core code which can trigger errors you reported on both drivers. In addition, ahci and sata_sil24 are the most popular libata drivers. If there is the type of bug you described in there, we should be seeing a *LOT* of bug reports about it but so far there hasn't been any similar report. Plus, this is the first time I see a bug report where sil24 fails with "failed to transmit command FIS". I dunno there is something very weird with your setup. It might be that BIOS didn't set up PCI express bridge properly or somesuch. Can you post the result of 'lspci -tv' and 'lspci -nnvvv'? Otavio, you problem doesn't seem to be similar. The SError register in your controller is reporting link problems which are often observed when there is 1. power problem (it's always worth considering even when you have monster PSU) 2. signal interference Grab a short SATA cable, connect the HDD to another power connector or even better, another PSU and see what happens. If it still fails, please open a separate bug report and attach the result of 'dmesg' including the detection messages and preferably with timestamp. Thanks.
Tejun, I've found a bad sector on the disk and also I got the SMART long tests failing on at least one sector. I sent the HD for replacement and I'm waiting for a new one. My PowerSuply has 900W so I doubt it's lack of power since I just have this disc on the machine. How is the best organization of the cables? All together? All near of the power ones?
If you have only one harddisk attached to a power rail, capacity-wise it should be way more than enough. I'm just suggesting standard hardware debugging so that we don't waste time barking up the wrong tree. It happens quite often with SATA probably because the high-frequency serial connection is very susceptible to interference. If you perform a electro magnetic interference test on a machine, SATA is the first one to squeak and the same goes for poor power quality (insufficient power and/or noise in the power output). Below are not too difficult to do and will save a lot of both your and my time. 1. Does using different SATA port change anything? 2. Does using different power connector from different cable make any difference? 3. Does changing cable make any difference? Thanks.
Created attachment 10314 [details] lspci -nnvvv
Created attachment 10315 [details] lspci -tv
Tejun, about the BIOS problem, this is no longer an issue in here. I've found out the bug and I did a temporary workaround for this specific BIOS/HW combination that I have (as seen in my previous messages). I've also contacted ASUS with what I've found in a hope that perhaps now when they don't need to spend hours by searching for the bug, but rather just minutes fixing it, I hope they will do so. So this needs no longer bother us with respect to the SATA. As of the ICH7 SATA, it's the 0x8086:0x27c4 (00:1c.2) and according to the page 23 of Intel I/O Controller Hub 7 (ICH7) Family Specification Update from Dec. 2006, this is a "Mobile Non-AHCI and Non-RAID Mode" of the SATA controller. And there is no BIOS feature that would be able to switch it into the AHCI mode (if it is ever even possible with this specific model/variant of the ICH7 chip). So we're not talking about AHCI driver here, but rather about ata_piix (which of course no doubt is probably also used by a lot of people). Now I see, that I've read the logs a little wrong, as I thought it was the channel with the internal SATA HDD of the ICH7 that reported those strange errors, but now I see it was rather the PATA channel with an integrated DVD-RW on it. So it seems my suspission that this particular part of my problem indeed is a problem of ata_piix driving both SATA and PATA parts of the ICH7. To proove that I've recompiled the kernel with CONFIG_IDE=yes CONFIG_BLK_DEV_IDE=yes so that generic IDE driver got control over the ICH7's PATA channel and ata_piix only drives the SATA part and all of a sudden those messages generated by the DVD-RW are gone. However I'm not sure whether that would have any impact on the performance of both of them. Let's hope it wouldn't, but anyway, there is definitely something wrong in the ata_piix. The sata_sil24 then aparently is an entirely different matter. The problems there seem to remain, which again means that those problems are not connected, they were just two problems happening at the same time and giving similar messages - that's why I thought they were connected. I'll try some more experiments with the HW there.
Created attachment 10319 [details] The quick and dirty patch fixing the BIOS bug, this one is for the BIOS version 211 (the previous was for BIOS version 210)
OK, so after some testing, it seems, that after resolving the problem with BIOS and replacing ata_piix with ata_piix and generic IDE driver, the problem of the sata_sil24 shutting down really is a bad connection. The disk is placed in an external SATA box connected over a SATA->eSATA cable to the ExpressCard SiI 3132 controller plugged into the notebook. Unfortunatelly it seems that the problem is somewhere within the external SATA box because when I get the disk out of the box and plug the SATA-eSATA cable from the controller directly to the disk, it seems to work OK, no more disk shutdown neither those strange messages. So as a summary, it seems that the only real kernel bug that comes out of this whole bugreport of mine was cut down to the ata_piix being unable to properly handle the PATA together with SATA, especially when there is a CD/DVD mechanics attached to the PATA.
Is the bios hack still needed or has the BIOS been fixed, if it is we ought to kick that upstream anyway but keyed to the DMI data
It is sad to say so, but the hack is still needed. :( ASUS have promised me that they will fix this in the next BIOS release, which I presume would be version 212 (see the BIOSes here: <http://support.asus.com/download/download.aspx?SLanguage=en-us&model=A8Js>). Unfortunately nothing else have happened since, and it's been quite a while now (about 4 moths). :( I wonder if they do really plan to release some other BIOS anytime, or whether they've just told me this to get rid of me buggering them about it, knowing that there would actually be no other BIOS released. Anyway, so far, these patches are needed. If you'd like to push some variant of it upstream, I think it makes sense for now. However those two quick and dirty patches that I've proposed here are just really ugly. They are based on the fact that each version of the BIOS does the allocation always in the same way (always to the same addresses). So I've just manually (using the testing patch attached to the bugreport) discovered the unassigned space prepared for the PCI-E Port 4 bridge resources and hardwired these areas into the hardware during the bootup before it is going to be needed for discovering all the devices that hide beneath the port. However this solution is dependent on the version of the BIOS that is currently installed on the particular computer and I don't know if it is possible to get to know the version automatically during the boot process in order to decide which ranges to hardwire. So I guess we would have to be a bit smarter if we want to include this workaround into the mainline. If the patches are not applied, kernel would eventually allocate some resources for the PCI-E Port 4 bridge, but it is done far too late to make all the devices beneath the bridge discovered and configured properly to make them usable. I think that the perfect solution to this (and perhaps also for similar problems with other BIOSes for other comps) would be to add one other stage of checking and possible artificial allocation/assignment of the PCI resources of the top level PCI devices before they are actually touched by the routines that discover all the PCI devices. Or perhaps make this also on every level (with respect to the various PCI bridges) before the devices beyond this bridge are to be discovered. I was looking if there is some place in the kernel PCI discovery routines that either counts with actions like these or where it could be easily added, but haven't found anything like that. To tell you the truth, I got a bit lost in all that and I didn't feel to understand the routines good enough to make such a rather big intrusion there. So I gave up and just did those two quick and dirty patches instead. If there would be anyone who would understand these routines more than I do and would like to implement the mechanism that I briefly sketched in the previous paragraph (or something similar), that'll be really great. I can help testing it on this particular hardware. Martin
Ok open a new bug, assign it to me and attach the output of dmidecode, the lspci -vvxxx and any other info you think is useful and I'll take a look (or if its hard assign it to GregKH the PCI man ;))
OK, will do so tomorrow (since I'm going to have to get the notebook from a colleague that is currently using it ;), no problem. Thanks a lot. Martin
Okay, closing this one.
Where is the new bug that this discussion has moved to? I am still having problems with kernel 2.6.29-3 and BIOS 213 (which I would imagine might contain this fix) Googling the "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6" line reveals a lot of these problems with other drives and controllers. Any chance it could be a deeper kernel problem with the bus communication?