Bug 6758 - HD not detected
Summary: HD not detected
Status: CLOSED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-27 17:26 UTC by Vincent Legoll
Modified: 2006-09-11 16:53 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.17-git12 (vanilla from kernel.org)
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Vincent Legoll 2006-06-27 17:26:52 UTC
Most recent kernel where this bug did not occur: not known
Distribution: Gentoo
Hardware Environment: x86_64, Seagate HD
Software Environment:
Problem Description: Seagate HD not detected

Steps to reproduce:

I recently added a Seagate HD (500Go Sata 7200.10) to my PC.
It had already a ST3120026AS and a WD1500ADFD-0, properly running, and still
running correctly...

The drive is visible in the BIOS.

This one is not detected by the linux kernel whereas it is by the other OS (and
performs very well ~48Mo/s writing), and fail with this in dmesg:

ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)

I tried some random tweaking of libata-core.c:__libata_phy_reset()
by lengthening the timeouts and delays, because some googling revealed this to
paper over similar reports of failure to detect seagate drives...

But it is not working, I know get that:

Jun 28 04:10:04 localhost ata1: exception Emask 0x10 SAct 0x0 SErr 0x150000
action 0x2 frozen
Jun 28 04:10:05 localhost ata1: soft resetting port
Jun 28 04:10:12 localhost ata1: port is slow to respond, please be patient
Jun 28 04:10:35 localhost ata1: port failed to respond (30 secs)
Jun 28 04:10:35 localhost ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 28 04:10:35 localhost ata1: EH pending after completion, repeating EH (cnt=4)
Jun 28 04:10:35 localhost ata1: EH complete

but it's about 8 minutes after system is booted...

ask for more info, and I'll provide it...

I'currently compiling -mm to see if it has better results...
Comment 1 Vincent Legoll 2006-06-27 17:54:01 UTC
Here is the dmesg extract from 2.6.17-mm3, the drive is not detected, but the
port is slow to respond messages are gone...

[...]
libata version 1.30 loaded.
sata_nv 0000:00:07.0: version 0.9
ACPI: PCI Interrupt Link [APSI] enabled at IRQ 22
GSI 18 sharing vector 0xE9 and IRQ 18
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [APSI] -> GSI 22 (level, low) -> IRQ 233
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xD800 irq 233
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xD808 irq 233
scsi0 : sata_nv
ata1: SATA link down (SStatus 0 SControl 300)
scsi1 : sata_nv
ieee1394: Host added: ID:BUS[0-01:1023]  GUID[00301bb9000047cc]
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: configured for UDMA/133
  Vendor: ATA       Model: WDC WD1500ADFD-0  Rev: 19.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
ACPI: PCI Interrupt Link [APSJ] enabled at IRQ 21
GSI 19 sharing vector 0x32 and IRQ 19
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [APSJ] -> GSI 21 (level, low) -> IRQ 50
PCI: Setting latency timer of device 0000:00:08.0 to 64
ata3: SATA max UDMA/133 cmd 0x9E0 ctl 0xBE2 bmdma 0xC400 irq 50
ata4: SATA max UDMA/133 cmd 0x960 ctl 0xB62 bmdma 0xC408 irq 50
scsi2 : sata_nv
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
scsi3 : sata_nv
ata4: SATA link down (SStatus 0 SControl 300)
  Vendor: ATA       Model: ST3120026AS       Rev: 3.05
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 293046768 512-byte hdwr sectors (150040 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 293046768 512-byte hdwr sectors (150040 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
sd 1:0:0:0: Attached scsi disk sda
SCSI device sdb: 234441648 512-byte hdwr sectors (120034 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 234441648 512-byte hdwr sectors (120034 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
 sdb: sdb1 < sdb5 sdb6 > sdb2 sdb3 sdb4
sd 2:0:0:0: Attached scsi disk sdb
sd 1:0:0:0: Attached scsi generic sg0 type 0
sd 2:0:0:0: Attached scsi generic sg1 type 0
ieee1394: raw1394: /dev/raw1394 device initialized
[...]

anything to test / do to help narrow this down ?
Comment 2 Tejun Heo 2006-06-28 06:02:43 UTC
Hello,

Can you post the result of 'lspci -n'?  Also, please try to connect the drive to
other ports - ata4 or swap with ata3 - and see what happens.
Comment 3 Vincent Legoll 2006-06-28 14:07:10 UTC
Hello Tejun,
please ignore my noise,
this is probably not a bug...

I solved it in the BIOS, by disabling the IDE controller:
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev a2)
Which was probably interacting with the first sata port to which this new drive
has been plugged.
The problem somehow didn't show on the 2 other sata disks I had previously in
there, which is a bit strange...

Would it be possible to detect this kind of misconfiguration, and display a more
informative message, something like "Try tweaking BIOS ide related settings" to
help people having the same problems...

Maybe it could be considered as a bug that linux cannot use the drive despite
the BIOS being badly configured, as I said, the other OS did work around the
broken setup... Maybe they have better HW information than you have...

Here is the lspci -n output

00:00.0 0580: 10de:005e (rev a3)
00:01.0 0601: 10de:0050 (rev a3)
00:01.1 0c05: 10de:0052 (rev a2)
00:02.0 0c03: 10de:005a (rev a2)
00:02.1 0c03: 10de:005b (rev a3)
00:06.0 0101: 10de:0053 (rev a2)
00:07.0 0101: 10de:0054 (rev a3)
00:08.0 0101: 10de:0055 (rev a3)
00:09.0 0604: 10de:005c (rev a2)
00:0a.0 0680: 10de:0057 (rev a3)
00:0b.0 0604: 10de:005d (rev a3)
00:0c.0 0604: 10de:005d (rev a3)
00:0d.0 0604: 10de:005d (rev a3)
00:0e.0 0604: 10de:005d (rev a3)
00:18.0 0600: 1022:1100
00:18.1 0600: 1022:1101
00:18.2 0600: 1022:1102
00:18.3 0600: 1022:1103
01:00.0 0300: 1002:554d
01:00.1 0380: 1002:556d
05:06.0 0401: 1412:1724 (rev 01)
05:07.0 0c00: 1106:3044 (rev 80)

I'll try to hook that disk to another port later, maybe saturday, because I want
to see the performance impact of separating it from the primary disk's (WDC)
controller.

If you want to close the bug, I've no objection.
Comment 4 Tejun Heo 2006-06-28 14:23:38 UTC
Yeah, probably the IDE and SATA controllers are sharing primary/secondary IDE
interfaces, allocating two to IDE and two to SATA.  Boards w/ intel chipsets do
similar stuff.  The reason why the other OS (Windows, I presume) can use all
four ports is probably because it uses the better interface (ADMA) to program
the controller which doesn't have legacy IDE limitations.  There's alpha version
of adma sata_nv driver in libata-dev tree but currently it seems to lack momentum.

Anyways, I'm closing the bug.
Comment 5 Vincent Legoll 2006-06-28 14:33:51 UTC
Maybe I jumped to conclusion too quickly...

The first fdisk -l showed wrong partition type, a "SFS" that I've never seen
anywhere, but I mounted it nevertheless, knowing or thinking it was ntfs...

localhost ~ # fdisk -l /dev/sda

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1       60801   488384001   42  SFS

localhost ~ # mount -t ntfs /dev/sda1 /mnt/cdrom
localhost ~ # cd /mnt/cdrom/
localhost cdrom # l
total 0
dr-x------ 1 root root 0 Jun 27 20:40 System Volume Information/

I left that disk alone, idle, the ntfs partition on it mounted (rw, btw, d'oh...)

and later wanted to "fdisk -l" it again, that command hanged in D+ state,
another "ls -l   /mount/point" one too. After a while, they exited with:

localhost ~ # ls /mnt/cdrom/
ls: reading directory /mnt/cdrom/: Input/output error

or nothing from fdisk, and the dmesg showed that:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x180000 action 0x2 frozen
ata1.00: (BMDMA stat 0x21)
ata1.00: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: soft resetting port
ata1: port is slow to respond, please be patient
ata1: port failed to respond (30 secs)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ATA: abnormal status 0xD0 on port 0x9F7
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 100 SControl 300)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 100 SControl 300)
ata1.00: disabled
ata1: EH pending after completion, repeating EH (cnt=4)
ata1: soft resetting port
ata1: SATA link down (SStatus 100 SControl 300)
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key=0xb
    ASC=0x0 ASCQ=0x0
end_request: I/O error, dev sda, sector 0
Buffer I/O error on device sda, logical block 0
Buffer I/O error on device sda, logical block 1
Buffer I/O error on device sda, logical block 2
Buffer I/O error on device sda, logical block 3
sd 0:0:0:0: rejecting I/O to offline device
ata1: EH complete
ata1.00: detaching (SCSI 0:0:0:0)
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 6291527
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x600008.
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x600009.
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x60000a.
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x60000b.
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x60000c.
NTFS-fs error (device sda1): ntfs_end_buffer_async_read(): Buffer I/O error,
logical block 0x60000d.
scsi 0:0:0:0: rejecting I/O to dead device
Reducing readahead size to 32K
scsi 0:0:0:0: rejecting I/O to dead device
ata1: soft resetting port
ata1: SATA link down (SStatus 100 SControl 300)
ata1: EH pending after completion, repeating EH (cnt=4)
ata1: EH complete

with last 4 lines repeating over and over...
Comment 6 Vincent Legoll 2006-06-28 14:43:05 UTC
Is this looking like a failing drive or driver ?

Should I test it more thoroughly with that other OS to rule out disk failure, or
is it pointless anyway ?

I'm sorry but I fear testing "adma sata_nv driver in libata-dev tree" because
all of my other disks (system and data) are hooked to the same controller and so
handled by the same driver... If it would be possible to use vanilla sata_nv
driver for those disks and the adma one for the problematic one, I might try it...
Comment 7 Tejun Heo 2006-06-28 14:45:15 UTC
And I have closed it too soon.  I'll just leave it closed until things become
clearer.  I have the same request - move the problemetic disk to another port,
swap with working disks and see what happens.  Your last report seems like a
hardware or cabling problem to me and swapping disks is a good way to find out
those problems.

The adma driver is in early alpha stage.  I don't think using it is a good idea.
Comment 8 Vincent Legoll 2006-06-30 10:14:20 UTC
Argh, I think you were right, I've just plugged it on the cable of a working HD,
and it worked great there, I've not yet tested the other disk on the failing
cable, will do next.

Survived:
dd if=/dev/sda1 of=/dev/null bs=64k (^C after 2 minutes, 15Go read from HD, 70+
Mo/s)
mkfs.ext2 /dev/sda1
bonnie++ on ext2
1.93c,1.93c,localhost,1,1151688993,2G,,584,99,68588,17,30958,8,921,98,72852,11,226.8,5,16,,,,,4412,97,+++++,+++,+++++,+++,4836,98,+++++,+++,16319,97,14725us,229ms,188ms,47224us,35835us,546ms,22464us,8065us,78us,4410us,8021us,7025us

But I'm puzzled, how come that Win2k3 did work around a HW problem ? This does
not sound right to me.

I'll test more combinations...
Comment 9 Vincent Legoll 2006-06-30 10:45:30 UTC
This is starting to get really strange, I let the 7200.10 on the WDC cable,
plugged the WDC in the cable where the 7200.10 failed, and now both seem to
work, the 3rd either, as I'm currently pushing the 3 of them to hard work and
everything goes right...

I think the 2 that are on the same controller are stealing performance from each
other, it's not benchmarked, but it's visible.
Comment 10 Vincent Legoll 2006-07-05 13:31:18 UTC
Argh, something is definitively wrong with this sata port, as now it's the
perfectly working WDC drive that get the EH's, but strangely, it does not get
the port is slow msg at boot time...

Is there something I could try, or some info I could send to get help debugging
this ?

Should the bug status change to "reopened" ?
Comment 11 Vincent Legoll 2006-07-30 15:08:41 UTC
Just a comment to let other know this really was a HW/cabling problem, as now
with a new cable all 3 drives are working properly...

Thanks Tejun, and sorry to have make you lose time on this issue...
Comment 12 Tejun Heo 2006-07-30 15:18:54 UTC
Great. Thanks for reporting the result. This one is *really* closed now.
Comment 13 Vincent Legoll 2006-09-11 16:53:51 UTC
Just a note to let others know this is looking more and more like a cabling
problem, because after a few monthes, I got new sata problems which disappeared
as soon as I changed the sata cable from the one built into the case to a
probably better one I got with another motherboard, despite the new one being
longer.

The now 2 failing cables were the ones I got builtin the Shuttle SN25p.

Maybe I'll come again here when the third one (still there) fails...
:-)

Note You need to log in before you can comment on or make changes to this bug.