Hello. Hardware is Dell PowerEdge SC1425. Machine has two SATA hard disks (show up as sda and sdb). On 2.6.16.21 and below, everything worked fine. But as soon as upgraded to 2.6.17, the second disk (sdb) is only detected maybe 30% of the time the machine is booted. Not even a mention of it in dmesg or anything. The only way to make the machine see sdb is to reboot it enough times, and one of those times it may work if you are lucky. This is not a hardware malfunction. I have a rack of these machines here that all exhibit this behavior. This is definitely dependent on the kernel version- 2.6.16.21 consistently works and 2.6.17 consistently does not. This is the SATA controller in the machines (from lspci): 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02) I am using the libata driver in the kernel. Please try to fix asap- this is a critical bug for us.
Created attachment 8363 [details] dmesg output with properly detected second sata drive
Created attachment 8364 [details] dmesg output on the same system with the sdb device not detected
I posted some dmesg output. These are both from the same machine taken 1 reboot apart. Notice how one dmesg output says ata2: Sata port has no device and the other one shows how the device is detected the next time around.
Argh... Maybe this one too has flakey PCS. Can you please * post the result of 'lspci -n' * apply the 'debug' patch I'm attaching and report the detected/failed boot messages? Thanks.
Created attachment 8365 [details] debug message patch
Created attachment 8366 [details] dmesg output with properly detected second sata drive with debug patch
Created attachment 8367 [details] dmesg output on the same system with the sdb device not detected with debug patch
Here is lspci -n: 00:00.0 0600: 8086:3590 (rev 09) 00:02.0 0604: 8086:3595 (rev 09) 00:1d.0 0c03: 8086:24d2 (rev 02) 00:1d.1 0c03: 8086:24d4 (rev 02) 00:1d.7 0c03: 8086:24dd (rev 02) 00:1e.0 0604: 8086:244e (rev c2) 00:1f.0 0601: 8086:24d0 (rev 02) 00:1f.1 0101: 8086:24db (rev 02) 00:1f.2 0101: 8086:24d1 (rev 02) 01:00.0 0604: 8086:0329 (rev 09) 01:00.1 0800: 8086:0326 (rev 09) 01:00.2 0604: 8086:032a (rev 09) 01:00.3 0800: 8086:0327 (rev 09) 02:04.0 0200: 8086:1076 (rev 05) 03:07.0 0200: 8086:1026 (rev 04) 04:03.0 0200: 8086:1076 (rev 05) 04:0d.0 0300: 1002:5159
Created attachment 8375 [details] ignore PCS on ich5 Can you try this patch?
Patch did the trick, thanks! Any chance of this making it into the mainline?
I'll push it to Jeff.
Created attachment 8412 [details] add 100ms delay after modifying PCS As the first fix was a bit drastic, I'm trying to find another less-intrusive fix for the problem. Can you try the attached patch and see how it works (apply it to the vanilla 2.6.17 w/o any patch)? If this doesn't work, I have one more thing to try and it would be great if you can test that one too. Thanks.
Hello. The second patch (with teh 100ms delay) did not work unfortunately. Applied it to all the machines and some still will not see sdb sometimes. The first patch 9disable PCS) worked when I had it, however.
Created attachment 8419 [details] ICH5-readonly-PCS Thanks for your testing. Can you try just more patch? If this one works, I'll push it upstream.
No, ICH5-readonly-PCS did not help either
Created attachment 8433 [details] test_patch0
Created attachment 8434 [details] test_patch1
Created attachment 8435 [details] this patch fixes the problem - READONCE_PCS-patch
Hello, again. I've just attached tree patches. The PCS register causes several different problems depending on specific chipset and configuration, so I want to be sure about the problem and I wanna avoid simply setting IGNORE_PCS on all ICH5s. And your PCS seems to report the correct value at first and then becomes all zero for some reason. So, please bear with me. Each patch implements different fixes and contains debug messages. I think I'll be able to collect enough data with these three patches. I need to know... * which one works * failed boot dmesg of not-working ones Thanks a lot.
Created attachment 8436 [details] Failure dmesg from patch 0
Created attachment 8437 [details] Failure dmesg from patch 1
Patch 2 seems to be working for now. I rebooted a few times and the disk appears to come back up. Thank you. Do you think this will be put into the mainline?
Yeah, I think the third one is safe enough for the mainline. It seems that your controller clears PCS while devices on the first port are probed whether PCS itself is written to or not. Very weird. I just hate the PCS register. Almost all generations have their own unique behaviors and many are flakey. Arggh... Thanks a lot for all the testings. I'll push it to both -stable and #upstream.
Hello, Steve. A different PCS update made into 2.6.18-rc4 which may or may not fix your problem. Can you please verify whether 2.6.18-rc4 has the same problem? If so, READONCE_PCS patch should go in too. Thanks.
Hello. I have experienced two separate issues with the 2.6.18-rc4 kernel. After rebooting a few times, the bug with sdb not being detected still manifested itself. So the READONCE patch may still be needed. A second issue I ran into is that occasionally the system would not detect / at all (kernel panic, cannot mount root). This happened to me twice, both times after rebooting the system by pressing ctrl-alt-del at the console. I do not know if either of these issues are related to soft/hard rebooting the system. But anyway, the bug with the second device not being detected seems to still be present.
I just tried the upstream-greg branch of libata-dev and I get the same problem. Intermittent failure to detect multiple SATA disks on ich5. Setting PIIX_FLAG_IGNORE_PCS on ich5_sata works around the problem. I still see the occasional pcs=0x0 but the driver ignores that value and finds both disks.
None of the fixes has made into the tree yet. At the moment, the only solution seems to be turning on IGNORE_PCS but I'm a bit afraid that it might cause ghost device detection and accompanying long boot delays on some other ICH5s. Maybe we should just proceed w/ turning on IGNORE_PCS and see what happens. I don't think the danger is too big as we didn't use to honor PCS very well before adding PCS map code and I don't recall many detection bug reports on ICH5, but it's also possible that polling PIO had masked such problems well. (polling PIO fails quickly while IDENTIFYing ghost devices while irq-pio tends to timeout after waiting 30secs).
Tejun - is this closable now ?
Yeap, it's fixed now.