Bug 7883

Summary: Intel PCI-E bridge ACPI resources and possibly related SATA unstability problems on ASUS A8Js
Product: IO/Storage Reporter: Martin Drab (martin.drab)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: REJECTED INVALID    
Severity: high CC: alan, htejun, mizzao, otavio
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.20-rc5-git3 Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on: 8595    
Bug Blocks:    
Attachments: Kernel log (dmesg)
Kernel configuration for 2.6.20-rc5-git3 on the ASUS A8Js
lspci output
lspci -n output
lspci -vvv output
dmesg when using pci=noacpi
/proc/interrupts when using pci=noacpi
(.tar.bz2) Results of Linux-ready Firmware Developer Kit r1 automatic tests
dmesg with acpi=off
Just a testing patch to find out what's wrong with the resources.
Results of the testing patch
See, that BIOS apparently allocated all the regions, just somehow forgot to tell it to the poor PCI-E Port 4 bridge.
Quick and dirty, just to confirm my hypothesis.
Results of the quick and dirty patch
lspci -nnvvv
lspci -tv
The quick and dirty patch fixing the BIOS bug, this one is for the BIOS version 211 (the previous was for BIOS version 210)

Description Martin Drab 2007-01-24 19:09:06 UTC
Most recent kernel where this bug did *NOT* occur:
Distribution: FC6
Hardware Environment: x86_64, Intel Core 2 Duo, i945 PM, Silicon Image SiI 3132
on PCI-E
Software Environment:
Problem Description:

Hi,

recently I got my hands on an ASUS A8Js notebook (Core 2 Duo T7200,
Intel 945 PM PCI-E Chipset, for details see attached log). After booting
the latest 2.6.20-rc5-git3 kernel (but the same behaviour occurs also with
the 2.6.19.2, didn't try any other), some strange behaviour can be
observed.

At first the following messages appear in the log:

...
[   40.303154] PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
[   40.303157] PCI: Not using MMCONFIG.

(not sure whether this is really a problem)

...
[   40.400899] PCI: Using ACPI for IRQ routing
[   40.400902] PCI: If a device doesn't work, try "pci=routeirq".  If it helps,
post a report
[   40.400911] PCI: Cannot allocate resource region 7 of bridge 0000:00:1c.3
[   40.400914] PCI: Cannot allocate resource region 8 of bridge 0000:00:1c.3
[   40.400917] PCI: Cannot allocate resource region 9 of bridge 0000:00:1c.3

and if I have the Kouwell 5652E SATA-II ExpressCard (with Silicon Image
SiL 3132 controller) connected to the notebook's ExpressCard slot, also
the following messages appear after the above ones:

[   40.400999] PCI: Cannot allocate resource region 0 of device 0000:04:00.0
[   40.401003] PCI: Cannot allocate resource region 2 of device 0000:04:00.0
[   40.401006] PCI: Cannot allocate resource region 4 of device 0000:04:00.0
...

where

00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4
(rev 02)
04:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II
Controller (rev 01)

Both SATA controllers (the on-board using ata_piix and the ExpressCard
using sata_sil24) are the detected

...
[   40.639495] ata_piix 0000:00:1f.2: version 2.00ac7
[   40.639500] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]
[   40.640063] ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 19
[   40.640763] PCI: Setting latency timer of device 0000:00:1f.2 to 64
[   40.640814] ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0xFFA0 irq 14
[   40.641467] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xFFA8 irq 15
[   40.642114] scsi0 : ata_piix
[   40.853499] ata1.00: ATA-7, max UDMA/100, 234441648 sectors: LBA48 NCQ (depth
0/32)
[   40.854188] ata1.00: ata1: dev 0 multi count 16
[   40.860490] ata1.00: configured for UDMA/100
[   40.860881] scsi1 : ata_piix
[   41.025049] ATA: abnormal status 0x7F on port 0x177
[   41.045585] ATA: abnormal status 0x7F on port 0x177
[   41.231326] ata2.01: ATAPI, max UDMA/33
[   41.417213] ata2.01: configured for UDMA/33
[   41.426449] scsi 0:0:0:0: Direct-Access     ATA      FUJITSU MHV2120B 0000
PQ: 0 ANSI: 5
[   41.436554] SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB)
[   41.447026] sda: Write Protect is off
[   41.457669] sda: Mode Sense: 00 3a 00 00
[   41.457693] SCSI device sda: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[   41.469438] SCSI device sda: 234441648 512-byte hdwr sectors (120034 MB)
[   41.481481] sda: Write Protect is off
[   41.493694] sda: Mode Sense: 00 3a 00 00
[   41.493708] SCSI device sda: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[   41.507017]  sda: sda1 sda2 sda3 sda4
[   41.604667] sd 0:0:0:0: Attached scsi disk sda
[   41.622803] scsi 1:0:1:0: CD-ROM            TSSTcorp CD/DVDW TS-L632D AS05
PQ: 0 ANSI: 5
...
[   48.817403] sata_sil24 0000:04:00.0: version 0.3
[   48.817432] ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 19 (level, low) -> IRQ 19
[   48.817475] PCI: Setting latency timer of device 0000:04:00.0 to 64
[   48.817531] ata3: SATA max UDMA/100 cmd 0xFFFFC20000040000 ctl 0x0 bmdma 0x0
irq 19
[   48.817585] ata4: SATA max UDMA/100 cmd 0xFFFFC20000042000 ctl 0x0 bmdma 0x0
irq 19
[   48.817593] scsi2 : sata_sil24
[   49.068344] Bluetooth: HCI USB driver ver 2.9
[   49.142279] usbcore: registered new interface driver hci_usb
[   49.222098] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   49.222610] ata3.00: ATA-7, max UDMA/133, 156312576 sectors: LBA
[   49.222612] ata3.00: ata3: dev 0 multi count 0
[   49.223246] ata3.00: configured for UDMA/100
[   49.223260] scsi3 : sata_sil24
[   49.525900] ata4: SATA link down (SStatus 0 SControl 300)
[   49.526898] scsi 2:0:0:0: Direct-Access     ATA      Maxtor 6Y080M0   YAR5
PQ: 0 ANSI: 5
[   49.527093] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB)
[   49.527103] sdb: Write Protect is off
[   49.527105] sdb: Mode Sense: 00 3a 00 00
[   49.527269] SCSI device sdb: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[   49.527309] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB)
[   49.527317] sdb: Write Protect is off
[   49.527319] sdb: Mode Sense: 00 3a 00 00
[   49.527337] SCSI device sdb: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[   49.527341]  sdb:<5>sd 0:0:0:0: Attached scsi generic sg0 type 0
[   49.536071] scsi 1:0:1:0: Attached scsi generic sg1 type 5
[   49.552648]  sdb1
[   49.552700] sd 2:0:0:0: Attached scsi disk sdb
[   49.552733] sd 2:0:0:0: Attached scsi generic sg2 type 0
...

However, strange messages simillar to the following appear periodically
(possibly a result of the system accessing the disk, but not always, when
accessed) for both controllers

...
[  193.907771] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  193.927526] ata2.01: cmd a0/01:00:00:00:00/00:00:00:00:00/b0 tag 0 cdb 0x25
data 8 in
[  193.927529]          res 40/00:03:00:00:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
[  201.007291] ata2: port is slow to respond, please be patient (Status 0xd0)
[  223.974196] ata2: port failed to respond (30 secs, Status 0xd0)
[  224.000456] ata2: soft resetting port
[  224.163274] ATA: abnormal status 0x7F on port 0x177
[  224.568225] ata2.01: configured for UDMA/33
[  224.568237] ata2: EH complete
[  231.828794] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
[  231.849673] ata3.00: (irq_stat 0x00020002, device error via D2H FIS)
[  231.871059] ata3.00: cmd c8/00:00:08:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0
data 131072 in
[  231.871061]          res 51/84:00:08:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA
bus error)
[  232.217493] ata3: soft resetting port
[  232.318609] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  232.319782] ata3.00: configured for UDMA/100
[  232.319792] ata3: EH complete
[  232.340315] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB)
[  232.363118] sdb: Write Protect is off
[  232.385795] sdb: Mode Sense: 00 3a 00 00
[  232.387452] SCSI device sdb: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
[  232.470053] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
[  232.492477] ata3.00: (irq_stat 0x00020002, device error via D2H FIS)
[  232.515254] ata3.00: cmd c8/00:00:f8:16:00/00:00:00:00:00/e0 tag 0 cdb 0x0
data 131072 in
[  232.515256]          res 51/84:00:f8:16:00/00:00:00:00:00/e0 Emask 0x10 (ATA
bus error)
[  232.864801] ata3: soft resetting port
[  232.965875] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  232.967043] ata3.00: configured for UDMA/100
[  232.967055] ata3: EH complete
[  232.988447] SCSI device sdb: 156312576 512-byte hdwr sectors (80032 MB)
[  233.012277] sdb: Write Protect is off
[  233.035602] sdb: Mode Sense: 00 3a 00 00
[  233.037536] SCSI device sdb: write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
...

Until after some seconds of accessing the disk on the external controller,
the whole disk is shut down:

...
[  269.026999] ata3.00: limiting speed to UDMA/33
[  269.058614] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[  269.090354] ata3.00: (irq_stat 0x00020002, failed to transmit command FIS)
[  269.122048] ata3.00: cmd c8/00:08:4f:00:00/00:00:00:00:00/e0 tag 0 cdb 0x0
data 4096 in
[  269.122051]          res c8/00:08:4f:00:00/00:00:00:00:00/e0 Emask 0x12 (ATA
bus error)
[  269.184814] ata3: hard resetting port
[  271.414531] ata3: softreset failed (port not ready)
[  271.445584] ata3: follow-up softreset failed, retrying in 5 secs
[  276.474431] ata3: hard resetting port
[  278.805006] ata3: softreset failed (port not ready)
[  278.835656] ata3: follow-up softreset failed, retrying in 5 secs
[  283.864906] ata3: hard resetting port
[  286.094543] ata3: softreset failed (port not ready)
[  286.124723] ata3: reset failed, giving up
[  286.154334] ata3.00: disabled
[  286.183738] ata3: EH pending after completion, repeating EH (cnt=4)
[  286.294424] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4050000 action 0x6 frozen
[  286.324117] ata3: (irq_stat 0x00060002, failed to transmit command FIS)
[  286.874068] ata3: waiting for device to spin up (8 secs)
[  294.289524] ata3: hard resetting port
[  296.519160] ata3: softreset failed (port not ready)
[  296.519741] ata3: follow-up softreset failed, retrying in 5 secs
[  301.518100] ata3: hard resetting port
[  303.748733] ata3: softreset failed (port not ready)
[  303.748738] ata3: follow-up softreset failed, retrying in 5 secs
[  308.746700] ata3: hard resetting port
[  311.077274] ata3: softreset failed (port not ready)
[  311.077277] ata3: reset failed, giving up
[  311.077284] sd 2:0:0:0: SCSI error: return code = 0x08000002
[  311.077286] sdb: Current [descriptor]: sense key: Aborted Command
[  311.077289]     Additional sense: Scsi parity error
[  311.077294] Descriptor sense data with sense descriptors (in hex):
[  311.077296]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
[  311.077302]         00 00 00 4f
[  311.077304] end_request: I/O error, dev sdb, sector 79
[  311.077314] ata3: EH complete
[  311.077320] ata3.00: detaching (SCSI 2:0:0:0)
[  311.077323] EXT3-fs: can't read group descriptor 1
[  311.174529] Synchronizing SCSI cache for disk sdb:
[  311.196553] FAILED
[  311.196555]   status = 0, message = 00, host = 4, driver = 00
[  311.196556]   <3>ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
...

I didn't manage to shut down the disk on the primary ata_piix controller
that easily, but I guess when I'd try long and hard enough, it may suffer
the same fate (or maybe not?).

I do not know whether this behaviour is somehow related to the ACPI detection
problems stated at the beginning of the mail, or whether these are two different
bugs. But it seems only the SATA is affected (so far I haven't noticed any thing
else).

Steps to reproduce:

All of this is clearly reproducible, see above.
Comment 1 Martin Drab 2007-01-24 19:11:23 UTC
Created attachment 10174 [details]
Kernel log (dmesg)
Comment 2 Martin Drab 2007-01-24 19:13:41 UTC
Created attachment 10175 [details]
Kernel configuration for 2.6.20-rc5-git3 on the ASUS A8Js
Comment 3 Martin Drab 2007-01-24 19:14:24 UTC
Created attachment 10176 [details]
lspci output
Comment 4 Martin Drab 2007-01-24 19:14:54 UTC
Created attachment 10177 [details]
lspci -n output
Comment 5 Martin Drab 2007-01-24 19:15:33 UTC
Created attachment 10178 [details]
lspci -vvv output
Comment 6 Len Brown 2007-01-24 20:34:43 UTC
please attach a copy of /proc/interrupts

Does the system behave any better if booted with "pci=noacpi"?
Comment 7 Martin Drab 2007-01-25 04:02:17 UTC
Created attachment 10187 [details]
dmesg when using pci=noacpi
Comment 8 Martin Drab 2007-01-25 04:03:07 UTC
Created attachment 10188 [details]
/proc/interrupts when using pci=noacpi
Comment 9 Martin Drab 2007-01-25 04:06:44 UTC
Nope, when using pci=noacpi, everything seems to be pretty much the same. See
dmesg and /proc/interrupts submitted to the bugreport.
Comment 10 Martin Drab 2007-01-25 08:39:46 UTC
Created attachment 10190 [details]
(.tar.bz2) Results of Linux-ready Firmware Developer Kit r1 automatic tests

I don't know if it helps anything, but here are the results of the Linux-ready
Firmware Developer Kit R1 automatic tests. There should also by the listing of
ACPI as I see it.
Comment 11 Martin Drab 2007-01-26 04:11:44 UTC
ASUS refused to help fixing the BIOS with words that ASUS notebooks do not
support Linux. Which means that either some special workaround is going to be
found for this or it would really be unusable for Linux. :(
Comment 12 Martin Drab 2007-01-26 10:01:40 UTC
Created attachment 10197 [details]
dmesg with acpi=off

I tried booting with acpi=off. The messages about problem with MMCONFIG and
about PCI being unable to allocate resource regions for the PCI-E controller
and the SiI 3132 SATA controller disappeared, but the problem with SATA stil
remains the same. Which makes me think whether this may not actually be a
problem of libata or some such?

On the other hand the following strange message appeared (see attached dmesg):

--------------------
[   39.917334] Uhhuh. NMI received for unknown reason 2d.
[   39.917344] Do you have a strange power saving mode enabled?
[   39.917349] Dazed and confused, but trying to continue
[   39.918217] Uhhuh. NMI received for unknown reason 00.
[   39.918229] Do you have a strange power saving mode enabled?
[   39.918234] Dazed and confused, but trying to continue
--------------------
Comment 13 Martin Drab 2007-01-28 06:00:05 UTC
Created attachment 10206 [details]
Just a testing patch to find out what's wrong with the resources.

I tested this little patch, just to find out closer what's wrong with the PCI
resources of the ICH7's Express Port 4 (bridge) and its attached SiI 3132
ExpressCard as mentioned in the initial bugreport.
Comment 14 Martin Drab 2007-01-28 06:14:05 UTC
Created attachment 10207 [details]
Results of the testing patch

This is the result of the testing patch:

---------
...
[   22.179122] PCI: Device 0000:00:1c.3, Resource 7: 0x0-0xfff, flags 0x100,
"PCI Bus #04", parent PCI IO: 0x0-0xffff
[   22.179127] PCI: Cannot allocate resource region 7 of bridge 0000:00:1c.3
[   22.179131] PCI: Device 0000:00:1c.3, Resource 8: 0x0-0xfffff, flags 0x200,
"PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff
[   22.179136] PCI: Cannot allocate resource region 8 of bridge 0000:00:1c.3
[   22.179140] PCI: Device 0000:00:1c.3, Resource 9: 0x0-0xfffff, flags 0x1201,
"PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff
[   22.179145] PCI: Cannot allocate resource region 9 of bridge 0000:00:1c.3
...
[   22.179341] PCI: Device 0000:04:00.0, Resource 0: 0xfe9ffc00-0xfe9ffc7f,
flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[   22.179346] PCI: Cannot allocate resource region 0 of device 0000:04:00.0
[   22.179350] PCI: Device 0000:04:00.0, Resource 2: 0xfe9f8000-0xfe9fbfff,
flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[   22.179355] PCI: Cannot allocate resource region 2 of device 0000:04:00.0
[   22.179359] PCI: Device 0000:04:00.0, Resource 4: 0xdc00-0xdc7f, flags
0x101, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[   22.179364] PCI: Cannot allocate resource region 4 of device 0000:04:00.0
...
---------

So it turns out, that (as expected) the PCI-E bridge does not get the IO/MEM
regions assigned. The ExpressCard does get them, but is unable to locate the
parent area from which it should have been allocated (surprisingly;).

However as you may see later in the dmesg, both of them are later (re)assigned
these resources by kernel:

---------------
...
[   22.182944]	 got res [50100000:5017ffff] bus [50100000:5017ffff] flags 7200
for BAR 6 of 0000:04:00.0
[   22.182948]	 got res [50000000:50003fff] bus [50000000:50003fff] flags 204
for BAR 2 of 0000:04:00.0
[   22.182959] PCI: moved device 0000:04:00.0 resource 2 (204) to 0
[   22.182962]	 got res [50004000:5000407f] bus [50004000:5000407f] flags 204
for BAR 0 of 0000:04:00.0
[   22.182974] PCI: moved device 0000:04:00.0 resource 0 (204) to 0
[   22.182977]	 got res [1000:107f] bus [1000:107f] flags 101 for BAR 4 of
0000:04:00.0
[   22.182984] PCI: moved device 0000:04:00.0 resource 4 (101) to 1000
[   22.182986] PCI: Bridge: 0000:00:1c.3
[   22.182990]	 IO window: 1000-1fff
[   22.182997]	 MEM window: 50000000-500fffff
[   22.183002]	 PREFETCH window: 50100000-501fffff
...
---------------
Comment 15 Martin Drab 2007-01-28 06:57:28 UTC
Created attachment 10208 [details]
See, that BIOS apparently allocated all the regions, just somehow forgot to tell it to the poor PCI-E Port 4 bridge.

If you sort the BIOS regions assigned to the PCI devices, you clearly find 3
regions (2 MEM, 1 I/O), which were supposed to be assigned to the PCI-E bridge
in question, by the apparent gaps in the allocations. And even the regions
assigned to the subdevice of the bridge confirms that, as you can see from the
table in the attachment.
Comment 16 Martin Drab 2007-01-28 07:11:19 UTC
Created attachment 10209 [details]
Quick and dirty, just to confirm my hypothesis.

This really is just to confirm my hypothesis about the missing gaps, it's
really not meant to be a solution to the problem.
Comment 17 Martin Drab 2007-01-28 07:27:02 UTC
Created attachment 10212 [details]
Results of the quick and dirty patch

OK, the quick and dirty patch apparently does exactly what the BIOS had in mind
(in this one particular case, ti's still not a general solution of course). All
PCI resources are assigned as they should have been, now.

However, the problem with the SATA still remains. Which (I think) we may
clearly consider a bug independent on the BIOS bug of unassigned resources! So
there really seems to be a problem somewhere in the libata or the appropriate
drivers (ata_piix and sata_sil24). But I still don't know what do those strange
SATA messages mean. Can anyone help me on this one, please?
Comment 18 Tejun Heo 2007-01-29 17:47:17 UTC
Your drive is either timing out or reporting ICRC error (ATA bus error)
indicating data transfer problems.  This is more likely a hardware problem. 
Please apply common hardware debugging methods.

* Rewire power and SATA connectors one-by-one and see to which the errors are
attached.

* Use different power supply to power the disks.

* Use different controller or computer to verify the disks work properly.
Comment 19 Martin Drab 2007-01-30 06:48:00 UTC
That would be a little problematic, since it is a notebook, but I'll see what I
can do.

However, the reason why I think it is not a HW problem is, that under The Other
System (TM) both controllers (the internal ICH7 and ExpressCard SiI 3132) work
perfectly with the same HW setup. That's why I think it nas to be a SW problem.
At least with the SiI 3132, because on Linux it detects the disk and when for
instance I try mounting it, it fails (mostly to a complete shutdown of that
HDD/cannel), while on The Other System (TM) the disk is accessible without any
problem. I admit, that The Other System (TM) may suppress any error messages tha
may possibly be generated, but that does not change the fact that it works.
Comment 20 Otavio Salvador 2007-02-01 16:07:46 UTC
I also have the same problem with Debian Etch running 2.6.18 kernel.

My lspci output is:

00:00.0 Host bridge: Intel Corporation 975X Express Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation 975X Express PCI Express Root Port
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition
Audio Controller (rev 01)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1
(rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GH (ICH7DH) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller
(rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA
Storage Controller IDE 
(rev 01)                                                                       
                       00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family)
SMBus Controller (rev 01)
01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300
(PCIE)]
01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]
04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
05:00.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
05:02.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN
interface
05:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000
Controller (PHY/Link)
05:05.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid]
Serial ATA Controller (re
v 02)
Comment 21 Otavio Salvador 2007-02-02 17:12:01 UTC
I've did some more tests here and disabling NCQ didn't help me.

I tried to change the controller to AHCI. It didn't help.

ata2.00: speed down requested but no transfer mode left
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (irq_stat 0x40000001)
ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error)
ata2: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata2.00: speed down requested but no transfer mode left
... (repeated)

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: (irq_stat 0x40000001)
ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error)
sd 1:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Aborted Command
    Additional sense: No additional sense information
end_request: I/O error, dev sda, sector 17325460
ata2: EH complete
SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB)
sda: Write Protect is off
SCSI device sda: drive cache: write back
ata2: exception Emask 0x50 SAct 0x0 SErr 0x90800 action 0x2 frozen
ata2: (irq_stat 0x00400000, PHY RDY changed)
ata2: waiting for device to spin up (8 secs)
ata2: soft resetting port
ata2: softreset failed (1st FIS failed)
ata2: softreset failed, retrying in 5 secs
ata2: hard resetting port
ata2: port is slow to respond, please be patient
ata2: port failed to respond (30 secs)
ata2: COMRESET failed (device not ready)
ata2: hardreset failed, retrying in 5 secs
ata2: hard resetting port          
ata2: port is slow to respond, please be patient
ata2: port failed to respond (30 secs)
ata2: COMRESET failed (device not ready)
ata2: reset failed, giving up
ata2.00: disabled
ata2: EH complete
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 17198756
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 242310092
sd 1:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 17198756
...

After the end_requent errors the machine gets unusable. It doesn't look to be my
hard-drive since I moved it to another machine, with a different chipset, and it
works well. The machine has a 900W power suply and then it shouldn't be the
cause too.
Comment 22 Martin Drab 2007-02-04 09:23:44 UTC
Is it possible that it may have something to do with the ata_piix driving both
SATA and PATA part of the ICH and that it somehow also influences other SATA
controllers in the system? Because I also tried Knoppix 5.1 (with 2.6.19
kernel), which is configured the way that only the SATA part of ICH is
controlled by the ata_piix and the PATA part seems to be controlled by a general
IDE (or something?) driver (I'm not sure how exactly is that supposed to be
achieved?), and there the ICH SATA does not seem to be generating these messages
and it seems to run OK (at least for some while that I tried it, while otherwise
some messages would definitely have appeared).

Unfortunatelly because of the earlier mentioned BIOS bug, I wasn't able to test
the SiI 3132 controller there, because the kernel compiled in Knoppix is not
regenerating the resources that BIOS did not assign to the PCI-E bridge, and so
the controller is not visible to the system.
Comment 23 Martin Drab 2007-02-04 09:32:37 UTC
Oh, and I very much doubt that there would be any problem in connection of the
disk to the ICH controller (at least in my case), since there is no cable. It's
a notebook solution, where there is a sort of a raiser card, directly form the
notebook's baseboard, on which there is the SATA connector to which the disk is
directly plugged.

And I've tested the disk on a different (desktop) computer with a different SATA
controller and there the disk runs without problems, so there is no problem with
the disk itself as well.

Also in Knoppix (which configured as described in previous mail) it seems
running OK. So I still think this IS a software bug somewhere in kernel. Even
though from the nature of the messages one may think of a HW problem, I'm still
not that much convinced so.
Comment 24 Tejun Heo 2007-02-06 01:41:36 UTC
Martin, I see.  The reason why I was suggesting hardware problem is that the
ahci and sata_sil24 drivers are very different and almost completely separate. 
It's very difficult for me to imagine a bug in common libata core code which can
trigger errors you reported on both drivers.

In addition, ahci and sata_sil24 are the most popular libata drivers.  If there
is the type of bug you described in there, we should be seeing a *LOT* of bug
reports about it but so far there hasn't been any similar report.  Plus, this is
the first time I see a bug report where sil24 fails with "failed to transmit
command FIS".

I dunno there is something very weird with your setup.  It might be that BIOS
didn't set up PCI express bridge properly or somesuch.  Can you post the result
of 'lspci -tv' and 'lspci -nnvvv'?

Otavio, you problem doesn't seem to be similar.  The SError register in your
controller is reporting link problems which are often observed when there is

1. power problem (it's always worth considering even when you have monster PSU)
2. signal interference

Grab a short SATA cable, connect the HDD to another power connector or even
better, another PSU and see what happens.  If it still fails, please open a
separate bug report and attach the result of 'dmesg' including the detection
messages and preferably with timestamp.

Thanks.
Comment 25 Otavio Salvador 2007-02-06 02:07:01 UTC
Tejun,

I've found a bad sector on the disk and also I got the SMART long tests failing
on at least one sector. I sent the HD for replacement and I'm waiting for a new one.

My PowerSuply has 900W so I doubt it's lack of power since I just have this disc
on the machine. How is the best organization of the cables? All together? All
near of the power ones?
Comment 26 Tejun Heo 2007-02-06 06:30:44 UTC
If you have only one harddisk attached to a power rail, capacity-wise it should
be way more than enough.  I'm just suggesting standard hardware debugging so
that we don't waste time barking up the wrong tree.  It happens quite often with
SATA probably because the high-frequency serial connection is very susceptible
to interference.

If you perform a electro magnetic interference test on a machine, SATA is the
first one to squeak and the same goes for poor power quality (insufficient power
and/or noise in the power output).

Below are not too difficult to do and will save a lot of both your and my time.

1. Does using different SATA port change anything?

2. Does using different power connector from different cable make any difference?

3. Does changing cable make any difference?

Thanks.
Comment 27 Martin Drab 2007-02-06 08:14:00 UTC
Created attachment 10314 [details]
lspci -nnvvv
Comment 28 Martin Drab 2007-02-06 08:14:53 UTC
Created attachment 10315 [details]
lspci -tv
Comment 29 Martin Drab 2007-02-06 09:34:39 UTC
Tejun, about the BIOS problem, this is no longer an issue in here. I've found
out the bug and I did a temporary workaround for this specific BIOS/HW
combination that I have (as seen in my previous messages). I've also contacted
ASUS with what I've found in a hope that perhaps now when they don't need to
spend hours by searching for the bug, but rather just minutes fixing it, I hope
they will do so. So this needs no longer bother us with respect to the SATA.

As of the ICH7 SATA, it's the 0x8086:0x27c4 (00:1c.2) and according to the page
23 of Intel I/O Controller Hub 7 (ICH7) Family Specification Update from Dec.
2006, this is a "Mobile Non-AHCI and Non-RAID Mode" of the SATA controller. And
there is no BIOS feature that would be able to switch it into the AHCI mode (if
it is ever even possible with this specific model/variant of the ICH7 chip). So
we're not talking about AHCI driver here, but rather about ata_piix (which of
course no doubt is probably also used by a lot of people).

Now I see, that I've read the logs a little wrong, as I thought it was the
channel with the internal SATA HDD of the ICH7 that reported those strange
errors, but now I see it was rather the PATA channel with an integrated DVD-RW
on it. So it seems my suspission that this particular part of my problem indeed
is a problem of ata_piix driving both SATA and PATA parts of the ICH7. To proove
that I've recompiled the kernel with 

CONFIG_IDE=yes
CONFIG_BLK_DEV_IDE=yes

so that generic IDE driver got control over the ICH7's PATA channel and ata_piix
only drives the SATA part and all of a sudden those messages generated by the
DVD-RW are gone. However I'm not sure whether that would have any impact on the
performance of both of them. Let's hope it wouldn't, but anyway, there is
definitely something wrong in the ata_piix.

The sata_sil24 then aparently is an entirely different matter. The problems
there seem to remain, which again means that those problems are not connected,
they were just two problems happening at the same time and giving similar
messages - that's why I thought they were connected. I'll try some more
experiments with the HW there.
Comment 30 Martin Drab 2007-02-06 09:38:29 UTC
Created attachment 10319 [details]
The quick and dirty patch fixing the BIOS bug, this one is for the BIOS version 211 (the previous was for BIOS version 210)
Comment 31 Martin Drab 2007-02-06 13:05:11 UTC
OK, so after some testing, it seems, that after resolving the problem with BIOS
and replacing ata_piix with ata_piix and generic IDE driver, the problem of the
sata_sil24 shutting down really is a bad connection.

The disk is placed in an external SATA box connected over a SATA->eSATA cable to
the ExpressCard SiI 3132 controller plugged into the notebook. Unfortunatelly it
seems that the problem is somewhere within the external SATA box because when I
get the disk out of the box and plug the SATA-eSATA cable from the controller
directly to the disk, it seems to work OK, no more disk shutdown neither those
strange messages.


So as a summary, it seems that the only real kernel bug that comes out of this
whole bugreport of mine was cut down to the ata_piix being unable to properly
handle the PATA together with SATA, especially when there is a CD/DVD mechanics
attached to the PATA.
Comment 32 Alan 2007-06-05 08:25:57 UTC
Is the bios hack still needed or has the BIOS been fixed, if it is we ought to
kick that upstream anyway but keyed to the DMI data
Comment 33 Martin Drab 2007-06-05 09:28:40 UTC
It is sad to say so, but the hack is still needed. :( ASUS have promised me that
they will fix this in the next BIOS release, which I presume would be version
212 (see the BIOSes here:
<http://support.asus.com/download/download.aspx?SLanguage=en-us&model=A8Js>).
Unfortunately nothing else have happened since, and it's been quite a while now
(about 4 moths). :( I wonder if they do really plan to release some other BIOS
anytime, or whether they've just told me this to get rid of me buggering them
about it, knowing that there would actually be no other BIOS released.

Anyway, so far, these patches are needed. If you'd like to push some variant of
it upstream, I think it makes sense for now. However those two quick and dirty
patches that I've proposed here are just really ugly. They are based on the fact
that each version of the BIOS does the allocation always in the same way (always
to the same addresses). So I've just manually (using the testing patch attached
to the bugreport) discovered the unassigned space prepared for the PCI-E Port 4
bridge resources and hardwired these areas into the hardware during the bootup
before it is going to be needed for discovering all the devices that hide
beneath the port.

However this solution is dependent on the version of the BIOS that is currently
installed on the particular computer and I don't know if it is possible to get
to know the version automatically during the boot process in order to decide
which ranges to hardwire. So I guess we would have to be a bit smarter if we
want to include this workaround into the mainline.

If the patches are not applied, kernel would eventually allocate some resources
for the PCI-E Port 4 bridge, but it is done far too late to make all the devices
beneath the bridge discovered and configured properly to make them usable. I
think that the perfect solution to this (and perhaps also for similar problems
with other BIOSes for other comps) would be to add one other stage of checking
and possible artificial allocation/assignment of the PCI resources of the top
level PCI devices before they are actually touched by the routines that discover
all the PCI devices. Or perhaps make this also on every level (with respect to
the various PCI bridges) before the devices beyond this bridge are to be discovered.

I was looking if there is some place in the kernel PCI discovery routines that
either counts with actions like these or where it could be easily added, but
haven't found anything like that. To tell you the truth, I got a bit lost in all
that and I didn't feel to understand the routines good enough to make such a
rather big intrusion there. So I gave up and just did those two quick and dirty
patches instead. If there would be anyone who would understand these routines
more than I do and would like to implement the mechanism that I briefly sketched
in the previous paragraph (or something similar), that'll be really great. I can
help testing it on this particular hardware.

Martin
Comment 34 Alan 2007-06-05 09:37:36 UTC
Ok open a new bug, assign it to me and attach the output of dmidecode, the lspci
-vvxxx and any other info you think is useful and I'll take a look (or if its
hard assign it to GregKH the PCI man ;))
Comment 35 Martin Drab 2007-06-05 09:52:11 UTC
OK, will do so tomorrow (since I'm going to have to get the notebook from a
colleague that is currently using it ;), no problem. Thanks a lot. Martin
Comment 36 Tejun Heo 2007-06-06 23:56:58 UTC
Okay, closing this one.
Comment 37 Andrew Mao 2009-07-19 20:44:13 UTC
Where is the new bug that this discussion has moved to? I am still having problems with kernel 2.6.29-3 and BIOS 213 (which I would imagine might contain this fix)

Googling the "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6" line reveals a lot of these problems with other drives and controllers. Any chance it could be a deeper kernel problem with the bus communication?