6724 – multiple SATA devices not detected most of the time

Bug 6724 - multiple SATA devices not detected most of the time

Summary: multiple SATA devices not detected most of the time

Status:	RESOLVED PATCH_ALREADY_AVAILABLE

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	Serial ATA (show other bugs)
Hardware:	i386 Linux

Importance:	P2 blocking
Assignee:	Tejun Heo

URL:
Keywords:

Depends on:
Blocks:

Reported:	2006-06-20 07:57 UTC by steve
Modified:	2007-06-18 09:24 UTC (History)
CC List:	3 users (show)

See Also:
Kernel Version:	2.6.17
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
dmesg output with properly detected second sata drive (16.26 KB, text/plain) 2006-06-21 07:37 UTC, steve	Details
dmesg output on the same system with the sdb device not detected (15.40 KB, text/plain) 2006-06-21 07:38 UTC, steve	Details
debug message patch (833 bytes, patch) 2006-06-21 08:45 UTC, Tejun Heo	Details \| Diff
dmesg output with properly detected second sata drive with debug patch (16.22 KB, text/plain) 2006-06-21 10:44 UTC, steve	Details
dmesg output on the same system with the sdb device not detected with debug patch (15.44 KB, text/plain) 2006-06-21 10:45 UTC, steve	Details
ignore PCS on ich5 (447 bytes, patch) 2006-06-22 00:17 UTC, Tejun Heo	Details \| Diff
add 100ms delay after modifying PCS (507 bytes, patch) 2006-06-24 20:30 UTC, Tejun Heo	Details \| Diff
ICH5-readonly-PCS (1.98 KB, patch) 2006-06-26 08:18 UTC, Tejun Heo	Details \| Diff
test_patch0 (2.25 KB, patch) 2006-06-28 05:25 UTC, Tejun Heo	Details \| Diff
test_patch1 (2.53 KB, patch) 2006-06-28 05:26 UTC, Tejun Heo	Details \| Diff
this patch fixes the problem - READONCE_PCS-patch (7.95 KB, patch) 2006-06-28 05:27 UTC, Tejun Heo	Details \| Diff
Failure dmesg from patch 0 (15.74 KB, text/plain) 2006-06-28 08:23 UTC, steve	Details
Failure dmesg from patch 1 (15.48 KB, text/plain) 2006-06-28 08:24 UTC, steve	Details
Show Obsolete (6) Add an attachment (proposed patch, testcase, etc.)

Description steve 2006-06-20 07:57:16 UTC

Hello.
Hardware is Dell PowerEdge SC1425. Machine has two SATA hard disks (show up as
sda and sdb). On 2.6.16.21 and below, everything worked fine. But as soon as
upgraded to 2.6.17, the second disk (sdb) is only detected maybe 30% of the time
the machine is booted. Not even a mention of it in dmesg or anything. The only
way to make the machine see sdb is to reboot it enough times, and one of those
times it may work if you are lucky.

This is not a hardware malfunction. I have a rack of these machines here that
all exhibit this behavior. This is definitely dependent on the kernel version-
2.6.16.21 consistently works and 2.6.17 consistently does not. 

This is the SATA controller in the machines (from lspci):
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)

I am using the libata driver in the kernel.
Please try to fix asap- this is a critical bug for us.

Comment 1 steve 2006-06-21 07:37:46 UTC

Created attachment 8363 [details]
dmesg output with properly detected second sata drive

Comment 2 steve 2006-06-21 07:38:34 UTC

Created attachment 8364 [details]
dmesg output on the same system with the sdb device not detected

Comment 3 steve 2006-06-21 07:38:51 UTC

I posted some dmesg output. These are both from the same machine taken 1 reboot
apart. Notice how one dmesg output says ata2: Sata port has no device and the
other one shows how the device is detected the next time around.

Comment 4 Tejun Heo 2006-06-21 08:44:19 UTC

Argh... Maybe this one too has flakey PCS.  Can you please

* post the result of 'lspci -n'
* apply the 'debug' patch I'm attaching and report the detected/failed boot
messages?

Thanks.

Comment 5 Tejun Heo 2006-06-21 08:45:01 UTC

Created attachment 8365 [details]
debug message patch

Comment 6 steve 2006-06-21 10:44:43 UTC

Created attachment 8366 [details]
dmesg output with properly detected second sata drive with debug patch

Comment 7 steve 2006-06-21 10:45:10 UTC

Created attachment 8367 [details]
dmesg output on the same system with the sdb device not detected with debug patch

Comment 8 steve 2006-06-21 10:47:40 UTC

Here is lspci -n:
00:00.0 0600: 8086:3590 (rev 09)
00:02.0 0604: 8086:3595 (rev 09)
00:1d.0 0c03: 8086:24d2 (rev 02)
00:1d.1 0c03: 8086:24d4 (rev 02)
00:1d.7 0c03: 8086:24dd (rev 02)
00:1e.0 0604: 8086:244e (rev c2)
00:1f.0 0601: 8086:24d0 (rev 02)
00:1f.1 0101: 8086:24db (rev 02)
00:1f.2 0101: 8086:24d1 (rev 02)
01:00.0 0604: 8086:0329 (rev 09)
01:00.1 0800: 8086:0326 (rev 09)
01:00.2 0604: 8086:032a (rev 09)
01:00.3 0800: 8086:0327 (rev 09)
02:04.0 0200: 8086:1076 (rev 05)
03:07.0 0200: 8086:1026 (rev 04)
04:03.0 0200: 8086:1076 (rev 05)
04:0d.0 0300: 1002:5159

Comment 9 Tejun Heo 2006-06-22 00:17:01 UTC

Created attachment 8375 [details]
ignore PCS on ich5

Can you try this patch?

Comment 10 steve 2006-06-23 21:36:45 UTC

Patch did the trick, thanks!
Any chance of this making it into the mainline?

Comment 11 Tejun Heo 2006-06-23 21:50:58 UTC

I'll push it to Jeff.

Comment 12 Tejun Heo 2006-06-24 20:30:36 UTC

Created attachment 8412 [details]
add 100ms delay after modifying PCS

As the first fix was a bit drastic, I'm trying to find another less-intrusive
fix for the problem.  Can you try the attached patch and see how it works
(apply it to the vanilla 2.6.17 w/o any patch)?  If this doesn't work, I have
one more thing to try and it would be great if you can test that one too.

Thanks.

Comment 13 steve 2006-06-26 08:03:26 UTC

Hello.
The second patch (with teh 100ms delay) did not work unfortunately. Applied it
to all the machines and some still will not see sdb sometimes. The first patch
9disable PCS) worked when I had it, however.

Comment 14 Tejun Heo 2006-06-26 08:18:39 UTC

Created attachment 8419 [details]
ICH5-readonly-PCS

Thanks for your testing.  Can you try just more patch?	If this one works, I'll
push it upstream.

Comment 15 steve 2006-06-27 06:32:03 UTC

No, ICH5-readonly-PCS did not help either

Comment 16 Tejun Heo 2006-06-28 05:25:59 UTC

Created attachment 8433 [details]
test_patch0

Comment 17 Tejun Heo 2006-06-28 05:26:39 UTC

Created attachment 8434 [details]
test_patch1

Comment 18 Tejun Heo 2006-06-28 05:27:28 UTC

Created attachment 8435 [details]
this patch fixes the problem - READONCE_PCS-patch

Comment 19 Tejun Heo 2006-06-28 05:33:16 UTC

Hello, again.

I've just attached tree patches.  The PCS register causes several different
problems depending on specific chipset and configuration, so I want to be sure
about the problem and I wanna avoid simply setting IGNORE_PCS on all ICH5s.  And
your PCS seems to report the correct value at first and then becomes all zero
for some reason.  So, please bear with me.

Each patch implements different fixes and contains debug messages.  I think I'll
be able to collect enough data with these three patches.  I need to know...

* which one works
* failed boot dmesg of not-working ones

Thanks a lot.

Comment 20 steve 2006-06-28 08:23:42 UTC

Created attachment 8436 [details]
Failure dmesg from patch 0

Comment 21 steve 2006-06-28 08:24:23 UTC

Created attachment 8437 [details]
Failure dmesg from patch 1

Comment 22 steve 2006-06-28 08:31:27 UTC

Patch 2 seems to be working for now. I rebooted a few times and the disk appears
to come back up. Thank you. Do you think this will be put into the mainline?

Comment 23 Tejun Heo 2006-06-28 08:42:26 UTC

Yeah, I think the third one is safe enough for the mainline.  It seems that your
controller clears PCS while devices on the first port are probed whether PCS
itself is written to or not.  Very weird.  I just hate the PCS register.  Almost
all generations have their own unique behaviors and many are flakey.  Arggh...

Thanks a lot for all the testings.  I'll push it to both -stable and #upstream.

Comment 24 Tejun Heo 2006-08-09 03:12:01 UTC

Hello, Steve.

A different PCS update made into 2.6.18-rc4 which may or may not fix your
problem.  Can you please verify whether 2.6.18-rc4 has the same problem?  If so,
READONCE_PCS patch should go in too.

Thanks.

Comment 25 steve 2006-08-09 09:36:38 UTC

Hello.

I have experienced two separate issues with the 2.6.18-rc4 kernel. After
rebooting a few times, the bug with sdb not being detected still manifested
itself. So the READONCE patch may still be needed.

A second issue I ran into is that occasionally the system would not detect / at
all (kernel panic, cannot mount root). This happened to me twice, both times
after rebooting the system by pressing ctrl-alt-del at the console. I do not
know if either of these issues are related to soft/hard rebooting the system. 

But anyway, the bug with the second device not being detected seems to still be
present.

Comment 26 Keith Owens 2006-08-13 22:49:07 UTC

I just tried the upstream-greg branch of libata-dev and I get the same problem.
Intermittent failure to detect multiple SATA disks on ich5.  Setting 
PIIX_FLAG_IGNORE_PCS on ich5_sata works around the problem.  I still see the
occasional pcs=0x0 but the driver ignores that value and finds both disks.

Comment 27 Tejun Heo 2006-08-13 23:00:15 UTC

None of the fixes has made into the tree yet.  At the moment, the only solution
seems to be turning on IGNORE_PCS but I'm a bit afraid that it might cause ghost
device detection and accompanying long boot delays on some other ICH5s.  Maybe
we should just proceed w/ turning on IGNORE_PCS and see what happens.

I don't think the danger is too big as we didn't use to honor PCS very well
before adding PCS map code and I don't recall many detection bug reports on
ICH5, but it's also possible that polling PIO had masked such problems well. 
(polling PIO fails quickly while IDENTIFYing ghost devices while irq-pio tends
to timeout after waiting 30secs).

Comment 28 Alan 2007-06-18 08:32:15 UTC

Tejun - is this closable now ?

Comment 29 Tejun Heo 2007-06-18 09:24:32 UTC

Yeap, it's fixed now.

Note You need to log in before you can comment on or make changes to this bug.