Bug 7883
Description
Martin Drab
2007-01-24 19:09:06 UTC
Created attachment 10174 [details]
Kernel log (dmesg)
Created attachment 10175 [details]
Kernel configuration for 2.6.20-rc5-git3 on the ASUS A8Js
Created attachment 10176 [details]
lspci output
Created attachment 10177 [details]
lspci -n output
Created attachment 10178 [details]
lspci -vvv output
please attach a copy of /proc/interrupts Does the system behave any better if booted with "pci=noacpi"? Created attachment 10187 [details]
dmesg when using pci=noacpi
Created attachment 10188 [details]
/proc/interrupts when using pci=noacpi
Nope, when using pci=noacpi, everything seems to be pretty much the same. See dmesg and /proc/interrupts submitted to the bugreport. Created attachment 10190 [details]
(.tar.bz2) Results of Linux-ready Firmware Developer Kit r1 automatic tests
I don't know if it helps anything, but here are the results of the Linux-ready
Firmware Developer Kit R1 automatic tests. There should also by the listing of
ACPI as I see it.
ASUS refused to help fixing the BIOS with words that ASUS notebooks do not support Linux. Which means that either some special workaround is going to be found for this or it would really be unusable for Linux. :( Created attachment 10197 [details]
dmesg with acpi=off
I tried booting with acpi=off. The messages about problem with MMCONFIG and
about PCI being unable to allocate resource regions for the PCI-E controller
and the SiI 3132 SATA controller disappeared, but the problem with SATA stil
remains the same. Which makes me think whether this may not actually be a
problem of libata or some such?
On the other hand the following strange message appeared (see attached dmesg):
--------------------
[ 39.917334] Uhhuh. NMI received for unknown reason 2d.
[ 39.917344] Do you have a strange power saving mode enabled?
[ 39.917349] Dazed and confused, but trying to continue
[ 39.918217] Uhhuh. NMI received for unknown reason 00.
[ 39.918229] Do you have a strange power saving mode enabled?
[ 39.918234] Dazed and confused, but trying to continue
--------------------
Created attachment 10206 [details]
Just a testing patch to find out what's wrong with the resources.
I tested this little patch, just to find out closer what's wrong with the PCI
resources of the ICH7's Express Port 4 (bridge) and its attached SiI 3132
ExpressCard as mentioned in the initial bugreport.
Created attachment 10207 [details]
Results of the testing patch
This is the result of the testing patch:
---------
...
[ 22.179122] PCI: Device 0000:00:1c.3, Resource 7: 0x0-0xfff, flags 0x100,
"PCI Bus #04", parent PCI IO: 0x0-0xffff
[ 22.179127] PCI: Cannot allocate resource region 7 of bridge 0000:00:1c.3
[ 22.179131] PCI: Device 0000:00:1c.3, Resource 8: 0x0-0xfffff, flags 0x200,
"PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff
[ 22.179136] PCI: Cannot allocate resource region 8 of bridge 0000:00:1c.3
[ 22.179140] PCI: Device 0000:00:1c.3, Resource 9: 0x0-0xfffff, flags 0x1201,
"PCI Bus #04", parent PCI mem: 0x0-0xffffffffffffffff
[ 22.179145] PCI: Cannot allocate resource region 9 of bridge 0000:00:1c.3
...
[ 22.179341] PCI: Device 0000:04:00.0, Resource 0: 0xfe9ffc00-0xfe9ffc7f,
flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[ 22.179346] PCI: Cannot allocate resource region 0 of device 0000:04:00.0
[ 22.179350] PCI: Device 0000:04:00.0, Resource 2: 0xfe9f8000-0xfe9fbfff,
flags 0x204, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[ 22.179355] PCI: Cannot allocate resource region 2 of device 0000:04:00.0
[ 22.179359] PCI: Device 0000:04:00.0, Resource 4: 0xdc00-0xdc7f, flags
0x101, "0000:04:00.0", parent NOT FOUND: 0x0-0x0
[ 22.179364] PCI: Cannot allocate resource region 4 of device 0000:04:00.0
...
---------
So it turns out, that (as expected) the PCI-E bridge does not get the IO/MEM
regions assigned. The ExpressCard does get them, but is unable to locate the
parent area from which it should have been allocated (surprisingly;).
However as you may see later in the dmesg, both of them are later (re)assigned
these resources by kernel:
---------------
...
[ 22.182944] got res [50100000:5017ffff] bus [50100000:5017ffff] flags 7200
for BAR 6 of 0000:04:00.0
[ 22.182948] got res [50000000:50003fff] bus [50000000:50003fff] flags 204
for BAR 2 of 0000:04:00.0
[ 22.182959] PCI: moved device 0000:04:00.0 resource 2 (204) to 0
[ 22.182962] got res [50004000:5000407f] bus [50004000:5000407f] flags 204
for BAR 0 of 0000:04:00.0
[ 22.182974] PCI: moved device 0000:04:00.0 resource 0 (204) to 0
[ 22.182977] got res [1000:107f] bus [1000:107f] flags 101 for BAR 4 of
0000:04:00.0
[ 22.182984] PCI: moved device 0000:04:00.0 resource 4 (101) to 1000
[ 22.182986] PCI: Bridge: 0000:00:1c.3
[ 22.182990] IO window: 1000-1fff
[ 22.182997] MEM window: 50000000-500fffff
[ 22.183002] PREFETCH window: 50100000-501fffff
...
---------------
Created attachment 10208 [details]
See, that BIOS apparently allocated all the regions, just somehow forgot to tell it to the poor PCI-E Port 4 bridge.
If you sort the BIOS regions assigned to the PCI devices, you clearly find 3
regions (2 MEM, 1 I/O), which were supposed to be assigned to the PCI-E bridge
in question, by the apparent gaps in the allocations. And even the regions
assigned to the subdevice of the bridge confirms that, as you can see from the
table in the attachment.
Created attachment 10209 [details]
Quick and dirty, just to confirm my hypothesis.
This really is just to confirm my hypothesis about the missing gaps, it's
really not meant to be a solution to the problem.
Created attachment 10212 [details]
Results of the quick and dirty patch
OK, the quick and dirty patch apparently does exactly what the BIOS had in mind
(in this one particular case, ti's still not a general solution of course). All
PCI resources are assigned as they should have been, now.
However, the problem with the SATA still remains. Which (I think) we may
clearly consider a bug independent on the BIOS bug of unassigned resources! So
there really seems to be a problem somewhere in the libata or the appropriate
drivers (ata_piix and sata_sil24). But I still don't know what do those strange
SATA messages mean. Can anyone help me on this one, please?
Your drive is either timing out or reporting ICRC error (ATA bus error) indicating data transfer problems. This is more likely a hardware problem. Please apply common hardware debugging methods. * Rewire power and SATA connectors one-by-one and see to which the errors are attached. * Use different power supply to power the disks. * Use different controller or computer to verify the disks work properly. That would be a little problematic, since it is a notebook, but I'll see what I can do. However, the reason why I think it is not a HW problem is, that under The Other System (TM) both controllers (the internal ICH7 and ExpressCard SiI 3132) work perfectly with the same HW setup. That's why I think it nas to be a SW problem. At least with the SiI 3132, because on Linux it detects the disk and when for instance I try mounting it, it fails (mostly to a complete shutdown of that HDD/cannel), while on The Other System (TM) the disk is accessible without any problem. I admit, that The Other System (TM) may suppress any error messages tha may possibly be generated, but that does not change the fact that it works. I also have the same problem with Debian Etch running 2.6.18 kernel. My lspci output is: 00:00.0 Host bridge: Intel Corporation 975X Express Memory Controller Hub 00:01.0 PCI bridge: Intel Corporation 975X Express PCI Express Root Port 00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GH (ICH7DH) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controller IDE (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300 (PCIE)] 01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE] 04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller 05:00.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30) 05:02.0 Communication controller: Tiger Jet Network Inc. Tiger3XX Modem/ISDN interface 05:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) 05:05.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (re v 02) I've did some more tests here and disabling NCQ didn't help me. I tried to change the controller to AHCI. It didn't help. ata2.00: speed down requested but no transfer mode left ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (irq_stat 0x40000001) ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error) ata2: EH complete SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata2.00: speed down requested but no transfer mode left ... (repeated) ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata2.00: (irq_stat 0x40000001) ata2.00: tag 0 cmd 0xc4 Emask 0x1 stat 0x51 err 0x4 (device error) sd 1:0:0:0: SCSI error: return code = 0x08000002 sda: Current: sense key: Aborted Command Additional sense: No additional sense information end_request: I/O error, dev sda, sector 17325460 ata2: EH complete SCSI device sda: 488397168 512-byte hdwr sectors (250059 MB) sda: Write Protect is off SCSI device sda: drive cache: write back ata2: exception Emask 0x50 SAct 0x0 SErr 0x90800 action 0x2 frozen ata2: (irq_stat 0x00400000, PHY RDY changed) ata2: waiting for device to spin up (8 secs) ata2: soft resetting port ata2: softreset failed (1st FIS failed) ata2: softreset failed, retrying in 5 secs ata2: hard resetting port ata2: port is slow to respond, please be patient ata2: port failed to respond (30 secs) ata2: COMRESET failed (device not ready) ata2: hardreset failed, retrying in 5 secs ata2: hard resetting port ata2: port is slow to respond, please be patient ata2: port failed to respond (30 secs) ata2: COMRESET failed (device not ready) ata2: reset failed, giving up ata2.00: disabled ata2: EH complete sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 17198756 sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 242310092 sd 1:0:0:0: SCSI error: return code = 0x00040000 end_request: I/O error, dev sda, sector 17198756 ... After the end_requent errors the machine gets unusable. It doesn't look to be my hard-drive since I moved it to another machine, with a different chipset, and it works well. The machine has a 900W power suply and then it shouldn't be the cause too. Is it possible that it may have something to do with the ata_piix driving both SATA and PATA part of the ICH and that it somehow also influences other SATA controllers in the system? Because I also tried Knoppix 5.1 (with 2.6.19 kernel), which is configured the way that only the SATA part of ICH is controlled by the ata_piix and the PATA part seems to be controlled by a general IDE (or something?) driver (I'm not sure how exactly is that supposed to be achieved?), and there the ICH SATA does not seem to be generating these messages and it seems to run OK (at least for some while that I tried it, while otherwise some messages would definitely have appeared). Unfortunatelly because of the earlier mentioned BIOS bug, I wasn't able to test the SiI 3132 controller there, because the kernel compiled in Knoppix is not regenerating the resources that BIOS did not assign to the PCI-E bridge, and so the controller is not visible to the system. Oh, and I very much doubt that there would be any problem in connection of the disk to the ICH controller (at least in my case), since there is no cable. It's a notebook solution, where there is a sort of a raiser card, directly form the notebook's baseboard, on which there is the SATA connector to which the disk is directly plugged. And I've tested the disk on a different (desktop) computer with a different SATA controller and there the disk runs without problems, so there is no problem with the disk itself as well. Also in Knoppix (which configured as described in previous mail) it seems running OK. So I still think this IS a software bug somewhere in kernel. Even though from the nature of the messages one may think of a HW problem, I'm still not that much convinced so. Martin, I see. The reason why I was suggesting hardware problem is that the ahci and sata_sil24 drivers are very different and almost completely separate. It's very difficult for me to imagine a bug in common libata core code which can trigger errors you reported on both drivers. In addition, ahci and sata_sil24 are the most popular libata drivers. If there is the type of bug you described in there, we should be seeing a *LOT* of bug reports about it but so far there hasn't been any similar report. Plus, this is the first time I see a bug report where sil24 fails with "failed to transmit command FIS". I dunno there is something very weird with your setup. It might be that BIOS didn't set up PCI express bridge properly or somesuch. Can you post the result of 'lspci -tv' and 'lspci -nnvvv'? Otavio, you problem doesn't seem to be similar. The SError register in your controller is reporting link problems which are often observed when there is 1. power problem (it's always worth considering even when you have monster PSU) 2. signal interference Grab a short SATA cable, connect the HDD to another power connector or even better, another PSU and see what happens. If it still fails, please open a separate bug report and attach the result of 'dmesg' including the detection messages and preferably with timestamp. Thanks. Tejun, I've found a bad sector on the disk and also I got the SMART long tests failing on at least one sector. I sent the HD for replacement and I'm waiting for a new one. My PowerSuply has 900W so I doubt it's lack of power since I just have this disc on the machine. How is the best organization of the cables? All together? All near of the power ones? If you have only one harddisk attached to a power rail, capacity-wise it should be way more than enough. I'm just suggesting standard hardware debugging so that we don't waste time barking up the wrong tree. It happens quite often with SATA probably because the high-frequency serial connection is very susceptible to interference. If you perform a electro magnetic interference test on a machine, SATA is the first one to squeak and the same goes for poor power quality (insufficient power and/or noise in the power output). Below are not too difficult to do and will save a lot of both your and my time. 1. Does using different SATA port change anything? 2. Does using different power connector from different cable make any difference? 3. Does changing cable make any difference? Thanks. Created attachment 10314 [details]
lspci -nnvvv
Created attachment 10315 [details]
lspci -tv
Tejun, about the BIOS problem, this is no longer an issue in here. I've found out the bug and I did a temporary workaround for this specific BIOS/HW combination that I have (as seen in my previous messages). I've also contacted ASUS with what I've found in a hope that perhaps now when they don't need to spend hours by searching for the bug, but rather just minutes fixing it, I hope they will do so. So this needs no longer bother us with respect to the SATA. As of the ICH7 SATA, it's the 0x8086:0x27c4 (00:1c.2) and according to the page 23 of Intel I/O Controller Hub 7 (ICH7) Family Specification Update from Dec. 2006, this is a "Mobile Non-AHCI and Non-RAID Mode" of the SATA controller. And there is no BIOS feature that would be able to switch it into the AHCI mode (if it is ever even possible with this specific model/variant of the ICH7 chip). So we're not talking about AHCI driver here, but rather about ata_piix (which of course no doubt is probably also used by a lot of people). Now I see, that I've read the logs a little wrong, as I thought it was the channel with the internal SATA HDD of the ICH7 that reported those strange errors, but now I see it was rather the PATA channel with an integrated DVD-RW on it. So it seems my suspission that this particular part of my problem indeed is a problem of ata_piix driving both SATA and PATA parts of the ICH7. To proove that I've recompiled the kernel with CONFIG_IDE=yes CONFIG_BLK_DEV_IDE=yes so that generic IDE driver got control over the ICH7's PATA channel and ata_piix only drives the SATA part and all of a sudden those messages generated by the DVD-RW are gone. However I'm not sure whether that would have any impact on the performance of both of them. Let's hope it wouldn't, but anyway, there is definitely something wrong in the ata_piix. The sata_sil24 then aparently is an entirely different matter. The problems there seem to remain, which again means that those problems are not connected, they were just two problems happening at the same time and giving similar messages - that's why I thought they were connected. I'll try some more experiments with the HW there. Created attachment 10319 [details]
The quick and dirty patch fixing the BIOS bug, this one is for the BIOS version 211 (the previous was for BIOS version 210)
OK, so after some testing, it seems, that after resolving the problem with BIOS and replacing ata_piix with ata_piix and generic IDE driver, the problem of the sata_sil24 shutting down really is a bad connection. The disk is placed in an external SATA box connected over a SATA->eSATA cable to the ExpressCard SiI 3132 controller plugged into the notebook. Unfortunatelly it seems that the problem is somewhere within the external SATA box because when I get the disk out of the box and plug the SATA-eSATA cable from the controller directly to the disk, it seems to work OK, no more disk shutdown neither those strange messages. So as a summary, it seems that the only real kernel bug that comes out of this whole bugreport of mine was cut down to the ata_piix being unable to properly handle the PATA together with SATA, especially when there is a CD/DVD mechanics attached to the PATA. Is the bios hack still needed or has the BIOS been fixed, if it is we ought to kick that upstream anyway but keyed to the DMI data It is sad to say so, but the hack is still needed. :( ASUS have promised me that they will fix this in the next BIOS release, which I presume would be version 212 (see the BIOSes here: <http://support.asus.com/download/download.aspx?SLanguage=en-us&model=A8Js>). Unfortunately nothing else have happened since, and it's been quite a while now (about 4 moths). :( I wonder if they do really plan to release some other BIOS anytime, or whether they've just told me this to get rid of me buggering them about it, knowing that there would actually be no other BIOS released. Anyway, so far, these patches are needed. If you'd like to push some variant of it upstream, I think it makes sense for now. However those two quick and dirty patches that I've proposed here are just really ugly. They are based on the fact that each version of the BIOS does the allocation always in the same way (always to the same addresses). So I've just manually (using the testing patch attached to the bugreport) discovered the unassigned space prepared for the PCI-E Port 4 bridge resources and hardwired these areas into the hardware during the bootup before it is going to be needed for discovering all the devices that hide beneath the port. However this solution is dependent on the version of the BIOS that is currently installed on the particular computer and I don't know if it is possible to get to know the version automatically during the boot process in order to decide which ranges to hardwire. So I guess we would have to be a bit smarter if we want to include this workaround into the mainline. If the patches are not applied, kernel would eventually allocate some resources for the PCI-E Port 4 bridge, but it is done far too late to make all the devices beneath the bridge discovered and configured properly to make them usable. I think that the perfect solution to this (and perhaps also for similar problems with other BIOSes for other comps) would be to add one other stage of checking and possible artificial allocation/assignment of the PCI resources of the top level PCI devices before they are actually touched by the routines that discover all the PCI devices. Or perhaps make this also on every level (with respect to the various PCI bridges) before the devices beyond this bridge are to be discovered. I was looking if there is some place in the kernel PCI discovery routines that either counts with actions like these or where it could be easily added, but haven't found anything like that. To tell you the truth, I got a bit lost in all that and I didn't feel to understand the routines good enough to make such a rather big intrusion there. So I gave up and just did those two quick and dirty patches instead. If there would be anyone who would understand these routines more than I do and would like to implement the mechanism that I briefly sketched in the previous paragraph (or something similar), that'll be really great. I can help testing it on this particular hardware. Martin Ok open a new bug, assign it to me and attach the output of dmidecode, the lspci -vvxxx and any other info you think is useful and I'll take a look (or if its hard assign it to GregKH the PCI man ;)) OK, will do so tomorrow (since I'm going to have to get the notebook from a colleague that is currently using it ;), no problem. Thanks a lot. Martin Okay, closing this one. Where is the new bug that this discussion has moved to? I am still having problems with kernel 2.6.29-3 and BIOS 213 (which I would imagine might contain this fix) Googling the "exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6" line reveals a lot of these problems with other drives and controllers. Any chance it could be a deeper kernel problem with the bus communication? |