Bug 8421 - sata hotplug removal of drive freezes all 2.6.21 kernels
Summary: sata hotplug removal of drive freezes all 2.6.21 kernels
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Tejun Heo
URL:
Keywords:
: 6665 7545 8787 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-05-02 13:02 UTC by Harald Hope
Modified: 2008-09-23 11:13 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.21.1
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Patch to fix some ADMA interrupt clearing issues (5.37 KB, patch)
2007-05-05 14:38 UTC, Robert Hancock
Details | Diff
Hotplug debug patch (988 bytes, patch)
2007-05-08 17:31 UTC, Robert Hancock
Details | Diff
smartctl output if of any use for debugging (4.00 KB, text/plain)
2007-05-10 17:25 UTC, Harald Hope
Details
dmesg from 2.6.21.1 boot (23.98 KB, text/plain)
2007-05-10 19:56 UTC, Harald Hope
Details
64 bit nvidia nforce4 data (4.60 KB, application/octet-stream)
2007-07-02 14:40 UTC, Harald Hope
Details
32 bit data, as much as I could collect (1.09 KB, application/octet-stream)
2007-07-02 14:46 UTC, Harald Hope
Details
Some logs this time, and no kernel crash... (30.50 KB, text/x-log)
2007-07-04 22:41 UTC, Harald Hope
Details
kernel log, i810 firewire issue (939.10 KB, text/x-log)
2007-07-05 00:11 UTC, Harald Hope
Details
lspci -v output (7.63 KB, application/octet-stream)
2007-08-07 18:28 UTC, Harald Hope
Details
root version of lspci-v, has some more data (9.47 KB, application/octet-stream)
2007-08-07 19:38 UTC, Harald Hope
Details

Description Harald Hope 2007-05-02 13:02:08 UTC
Most recent kernel where this bug did *NOT* occur: all 2.6.20.x kernels.
Distribution: Debian Sid 32 bit( sidux packaged kernels )
Hardware Environment: amd x2 3800 running in 32 bit mode, nforce4 MSI neo4
platinum mobo, varying sata disks, seagate, western digital.
Software Environment: any, but for this test: standard bash shell to remove any
possibility of gui desktop manager interference.
Problem Description: With all 2.6.21.x kernels, starting with rc versions, esata
and sata hotplug removal fails and freezes kernel instantly. Hotplug attachment
works fine. Kernel freeze always happens on removal of sata drive. I first saw
this with ESATA external drives, but today I tested it with internal sata to
make sure to remove the remote possibility that there was some issue with drive
enclosure hardware.

Freeze occurred as usual when I removed the sata drive cable with the direct
sata connection, so the issue is not related to esata. Also retested on 2.6.20.x
kernels, and those work fine as usual.

This problem never existed with any previous kernel, I've been using esata
backup units for a while now, and no kernel has had this issue until 2.6.21 came
out, and they have all had it.

Sorry for not filing this bug report sooner, I just assumed something so
glaringly obvious would get caught, but I guess it didn't.

Steps to reproduce:
1. turn on machine
2. boot up any 2.6.21 kernel, including the latest 2.6.21.1. I'm using smp
kernels, I guess I could also test uniprocessor kernels to make sure it's always
an issue.
3. plugin sata drive connector cable (or, alternately, just power on esata
external enclosure) to sata drive. With esata, I am just using a physical sata
to esata plug adaptor card, no active circuitry, just physical adaptor. Drive is
seen by system, everything is fine. Mount to test, fine.
4. Once confirmed mountable etc, umount, wait, then unplug sata connector cable
(or turn off esata external enclosure, or removable hard drive tray). 

Kernel crashes instantly, I have not been able to get any error output at all,
although if you tell me where to look I might find some somewhere, but I can't
see anything anywhere. This is a 100% system freeze, have to do hard reset/power
off to get out of it.

Not sure if this helps, but here's the mounting data from bash shell first and
dmesg second:
==========================================
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
==========================================
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata2.00: 625142448 sectors, multi 0: LBA48
ata2.00: configured for UDMA/100
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
sd 1:0:0:0: Attached scsi disk sdd
============================================

as noted, I can't get any failure output data because the freeze is instant.
Comment 1 Andrew Morton 2007-05-02 17:55:59 UTC
That sounds like an unpleasant regression.

Do we officially support sata hotplug yet?
Comment 2 Harald Hope 2007-05-02 22:45:10 UTC
By a stroke of good fortune, I got another sata box today to test:

Hardware Environment: MSI nforce3 based motherboard, K8N neo v 2.0, nforce3 250
chipset, main hard drive ide, test sata same as other box. This one is
uniprocessor, amd socket 754 single core 3200.

Same operating system and other factors as initial report.

I tested both smp and up kernels, and there were no issues, it works as expected
with both 2.6.20.10 and 2.6.21.1 kernels. No crashes, everything is normal. So
it looks like this is related to the nforce4 chipset specifically.

I tested both smp and up kernels to make sure that it's not some smp thing
spefically.

I made these notes, if they are helpful, from dmesg add and remove of sata
drive, hotplugging:
========================================================
2.6.20.10 uniprocessor sata hotplug install
========================================================
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-6, max UDMA/100, 625142448 sectors: LBA48
ata1.00: ata1: dev 0 multi count 0
ata1.00: configured for UDMA/100
ata1: EH complete
scsi 0:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sda: sda1
sd 0:0:0:0: Attached scsi disk sda
================================================
2.6.20.10 uniprocessor sata hotplug removal
ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
================================================
ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1.00: disabled
ata1: EH complete
ata1.00: detaching (SCSI 0:0:0:0)
Synchronizing SCSI cache for disk sda:
FAILED
  status = 0, message = 00, host = 4, driver = 00
===========================================================
2.6.21.1 uniprocessor sata hotplug install
===========================================================
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata1.00: 625142448 sectors, multi 0: LBA48
ata1.00: configured for UDMA/100
ata1: EH complete
scsi 0:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sda: sda1
sd 0:0:0:0: Attached scsi disk sda
===========================================================
2.6.21.1 uniprocessor sata hotplug remove
===========================================================
ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1.00: limiting speed to UDMA/100:PIO3
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1.00: disabled
ata1: EH complete
ata1.00: detaching (SCSI 0:0:0:0)
Synchronizing SCSI cache for disk sda:
FAILED
  status = 0, message = 00, host = 4, driver = 00

==========================================================
2.6.21.1 smp hotplug install
==========================================================
ata1: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata1.00: 625142448 sectors, multi 0: LBA48
ata1.00: configured for UDMA/100
ata1: EH complete
scsi 0:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sda: sda1
sd 0:0:0:0: Attached scsi disk sda
=========================================================
2.6.21.1 smp hotplug remove
=========================================================
ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata1: SATA link down (SStatus 0 SControl 300)
ata1.00: limiting speed to UDMA/100:PIO3
ata1: failed to recover some devices, retrying in 5 secs

=======================================

Hope this is useful information for debugging the issue.

Comment 3 Tejun Heo 2007-05-03 03:01:00 UTC
Yes, we do support SATA hotplug officially for some time now actually.  It seems
like a low level driver regression and I'm guessing the controller is not happy
with the way driver is accessing it and just dies while holding the PCI bus thus
leaving libata no chance for fighting.  Harald, does specifying kernel parameter
'sata_nv.adma=0' make any difference?
Comment 4 Harald Hope 2007-05-03 13:47:41 UTC
Using that kernel parameter made it not crash on removing sata drive. Here's the
dmesg output for that.

There is one other thing I forgot to mention, even with the 2.6.20.x kernels,
with the nforce4 chipset, it takes a long time (more than 30 seconds, maybe 60)
for the system to register that the device has been removed, whereas when I used
the kernel parameter you suggested, both install and removal occur almost
immediately on the 2.6.21.1 test kernel.

The test for the delay to detect removal is this on 2.6.20.x kernels:

plug in sata/esata drive
mount it to somewhere
umount it
unplug it
wait 10 seconds, then try to mount it again
system waits for 30-60 seconds before reporting that sdd1 does not exist with
mount command.


Here's the data from the successful test:

test with kernel parameter: sata_nv.adma=0
===================================================
2.6.21.1 smp: sata hotplug install:
===================================================

ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: port is slow to respond, please be patient (Status 0x80)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata2.00: 625142448 sectors, multi 0: LBA48
ata2.00: configured for UDMA/100
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
sdd: Mode Sense: 00 3a 00 00
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
sd 1:0:0:0: Attached scsi disk sdd
====================================================
2.6.21.1 smp: sata hotplug remove:
====================================================
ata2: exception Emask 0x10 SAct 0x0 SErr 0x1910000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: limiting speed to UDMA/100:PIO3
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
ata2.00: detaching (SCSI 1:0:0:0)
Synchronizing SCSI cache for disk sdd:
FAILED
  status = 0, message = 00, host = 4, driver = 00
Comment 5 Tejun Heo 2007-05-03 23:25:58 UTC
That delay thing is probably something fixed somewhere else regardless of the
adma thing.  Cc'ing Robert Hancock for investigating further why adma error
handling locks up the whole machine.  Robert, this is the second report of adma
EH locking up the whole machine.  What should we do about it?
Comment 6 Robert Hancock 2007-05-04 07:38:56 UTC
The nForce3 system (and the nForce4 with adma=0) is using entirely different
code paths than nForce4 with ADMA enabled, so it's not too surprising that the
behavior may be different or that there may be an ADMA-specific issue.

We should be able to turn on some debug to help figure out if anything is
happening with libata just before the freeze.. Harald, have you tried running in
console mode (not in X) with the console log level set to show all kernel
messages and checked if anything shows up there before the lockup?
Comment 7 Harald Hope 2007-05-04 12:25:14 UTC
I tried setting console log level to show all kernel messages today, but
unfortunately the freeze is instant on removal of sata drive, so there's no
output at all.
Comment 8 Robert Hancock 2007-05-04 17:12:24 UTC
It's a bit unclear how hotplug is supposed to be handled when the controller has
ADMA enabled but is in port-register mode. Normally in this case we just ignore
the ADMA-specific status and just check the non-ADMA status register and the
standard ATA registers to see if we need to do anything. These controllers
provide a hotplug bit in the non-ADMA status register but it's not clear if that
is valid when in ADMA mode. Maybe we still ought to be looking at the ADMA
status register in this case?

There's some code in sata_nv.c that looks like this:

/* if in ATA register mode, use standard ata interrupt handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
	u8 irq_stat = readb(host->iomap[NV_MMIO_BAR] + NV_INT_STATUS_CK804)
		>> (NV_INT_PORT_SHIFT * i);
	if(ata_tag_valid(ap->active_tag))
		/** NV_INT_DEV indication seems unreliable at times
		    at least in ADMA mode. Force it on always when a
		    command is active, to prevent losing interrupts. */
		irq_stat |= NV_INT_DEV;
	handled += nv_host_intr(ap, irq_stat);
	continue;
}

Try taking the "continue;" line out and see what that does. That will continue
with ADMA status handling even if we are in port-register mode.
Comment 9 Robert Hancock 2007-05-05 14:38:49 UTC
Created attachment 11402 [details]
Patch to fix some ADMA interrupt clearing issues

Please try out the attached patch (also posted to linux-kernel and linux-ide)
and see if it resolves your problem. I did some hotplug testing and while I
didn't see any full system lockup, I did get a timeout on the other port on the
same controller that the hotplugged drive was on. This patch fixes that problem
for me, hotplug appears to work perfectly on my system with it.
Comment 10 Harald Hope 2007-05-05 23:13:36 UTC
I tried new kernel with the patch, the behavior is different slightly, but the
end result is the same, system freezes at some point. It's more variable though,
not always the same, for example, I can remove the disk from the system without
the system instantly freezing sometimes, but parts of it fail and destabilize,
until I have to reboot anyway. Other times the system locks up right away.

I tried with loglevel=7 several times, but there's no output when I remove the
disk, but it didn't freeze completely always, although sometimes it does, it
depends on what I did before.

For example, I turned it on, it is detected, I turned it off, the system did not
freeze completely, but then I tried to turn it on again, and the system froze.
Probably because the drive was not actually removed in kernel I'd guess, but the
kernel didn't do a full freeze.

Another time the system was half frozen, and I started to reboot, and it allowed
to to partially reboot, but then froze about half way, probably I would guess
when it discovered the missing drive or something.

Again, no output at all from kernel logging on removal, same on install.
Comment 11 Robert Hancock 2007-05-05 23:41:51 UTC
That seems rather odd.. Does the same thing happen with hotplugging an internal
drive with the patch?

eSATA is not quite the same as SATA with a different connector, the signalling
voltages are higher (or are supposed to be) to accomodate a longer cable length,
and so really requires a controller designed for it. Using a SATA to eSATA
adapter may have some problems, i.e. the max cable length may be limited.
Comment 12 Harald Hope 2007-05-05 23:50:29 UTC
There hasn't been any issue at all with this until 2.6.21 came out. So I doubt
it's meaningful whether it's esata or sata direct, but I'll check.
Comment 13 Harald Hope 2007-05-06 00:04:46 UTC
Yes, I can confirm it, tested with internal, makes no difference. But I can give
a repeatable error now, it's always the same:

With patch, add sata drive, shows fine, expected kernel messages.

remove drive, no messages at all. The one difference the patch made was that now
I can actually confirm that there are no kernel messages at all on removal,
since it doesn't crash until I try to do something, for example, mount the
removed drive.

Basically the patch seems to have let the system keep functioning until it
notices that the drive is missing for whatever reason, be it a reboot/shutdown,
an attempt to mount the drive it thinks is still there, or what have you. then
the system freezes completely, actually more frozen than before, since even the
cursor stops blinking.

But there is no kernel output at all on removal, so that's obviously got to the
be the problem, the removal event is simply not registering, which it was in the
2.6.20 and earlier kernels.
Comment 14 Robert Hancock 2007-05-06 01:20:21 UTC
Could you post what output you get from 2.6.20 (with ADMA enabled) for the plug
and unplug?

Obviously something is happening on your setup that's not on mine when the drive
gets unplugged.. For it to lock up such that the cursor even stops blinking
(that was on regular VGA text console?) would seem to suggest that something is
making the system very unhappy. I'm not sure what could even do that.

If you could set up to do a git bisect between 2.6.20 and 2.6.21, that would
likely narrow down whatever change is triggering this..
Comment 15 Harald Hope 2007-05-06 02:16:26 UTC
That was a good question. The 2.6.20 kernel almost made me think that it also
was going to show no output for removal, but while I was doing something else
after removing it, the output suddenly appeared, maybe 2 or 3 minutes after I
removed it. I realized that the 2.6.20 has probably always had this delay, I
just never noticed it before because usually I turned off the device and didn't
try to reconnect it right away. So I'd say actually the 2.6.20.x kernels aren't
actually working correctly for nforce4 chipsets. 

Adding notice also seems to take signficantly longer than with nforce3 by the
way, several seconds more.

This is much easier to see for me now because I have an example of properly
functioning add and remove with nforce3, where both add and remove notice appear
pretty much right away.

Further tests of the patched 2.6.21.1 kernel however are erratic, this time it
simply froze as it did before when I removed the sata drive. It's highly
unpredictable at this point.

=============================================
loglevel=7 sata_nv.adma=1 2.6.20.7 smp kernel: install 
=============================================
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: soft resetting port
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6, max UDMA/100, 625142448 sectors: LBA48
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/100
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
 sd 1:0:0:0: Attached scsi disk sdd
 ===========================================
 remove sata, delay of maybe 2 to 3 minutes for
any output at all to show from kernel
 ===========================================
pnp: Device 00:09 disabled.
ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status
0x402
ata2: CPB 0: ctl_flags 0xd, resp_flags 0x0
ata2: Resetting port
ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen
ata2.00: cmd 25/00:08:3f:d6:42/00:00:25:00:00/e0 tag 0 cdb 0x0 data 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
ata2: soft resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
sd 1:0:0:0: SCSI error: return code = 0x08000002
sdd: Current [descriptor]: sense key: Aborted Command
Additional sense: No additional sense information
Descriptor sense data with sense descriptors (in hex):
72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
00 00 00 00
end_request: I/O error, dev sdd, sector 625137215
Buffer I/O error on device sdd1, logical block 312568576
Buffer I/O error on device sdd1, logical block 312568577
Buffer I/O error on device sdd1, logical block 312568578
Buffer I/O error on device sdd1, logical block 312568579
ata2: EH complete
ata2.00: detaching (SCSI 1:0:0:0)
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568576
Buffer I/O error on device sdd1, logical block 312568577
Buffer I/O error on device sdd1, logical block 312568578
Buffer I/O error on device sdd1, logical block 312568579
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568632
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568633
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
Synchronizing SCSI cache for disk sdd:
FAILED
status = 0, message = 00, host = 4, driver = 00
							
Comment 16 Robert Hancock 2007-05-06 10:12:39 UTC
Looking at the code, it seems likely that 2.6.20 ADMA would never notice the
drive was gone until something tried to access it, because we didn't have the
hotplug interrupt enable bits set. So a bisect likely won't help much since it
will likely just point here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ce0cf6fafd02fb4c43fc1a1bee6069d6c0a36b1

which fixed one problem, but apparently in your case created another.

Trying "nmi_watchdog=1 idle=poll" on the kernel command line might get you some
useful output if the NMI watchdog can catch it. Otherwise I'll see if I can whip
up a debug patch so we can get some idea of what's going on.
Comment 17 Harald Hope 2007-05-06 14:15:23 UTC
Those kernel options let me get error output on 2.6.21.1 kernels, with and
without the adma patch, which didn't produce any different output, so I'm just
posting the data from the unpatched 2.6.21.1 smp kernel:

=================================================
2.6.21.1-2 no adma patch: sata drive install
added boot options: nmi_watchdog=1 loglevel=7 idle=poll
Note: install notification was almost instant with these settings
=================================================
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata2.00: 625142448 sectors, multi 0: LBA48
ata2.00: configured for UDMA/100
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
sd 1:0:0:0: Attached scsi disk sdd

=================================================
sata removal
Note: did not crash on removal
Note: ls /dev after removal continues to show sdd and sdd1, the new drive
Note: to trigger detection: mount /dev/sdd1 /media/sata takes about 25 seconds
before producing this output
=================================================
ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1581000 status
0x402 next cpb count 0x0 next cpb idx 
0x0
ata2: CPB 0: ctl_flags 0xd, resp_flags 0x0
ata2: timeout waiting for ADMA IDLE, stat=0x402
ata2: timeout waiting for ADMA LEGACY, stat=0x402
ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen
ata2.00: cmd 25/00:08:3f:d6:42/00:00:25:00:00/e0 tag 0 cdb 0x0 data 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: limiting speed to UDMA/100:PIO3
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
ata2: EH pending after completion, repeating EH (cnt=4)
sd 1:0:0:0: SCSI error: return code = 0x08000002
sdd: Current [descriptor]: sense key: Aborted Command
    Additional sense: No additional sense information
Descriptor sense data with sense descriptors (in hex):
        72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        00 00 00 00
end_request: I/O error, dev sdd, sector 625137215
Buffer I/O error on device sdd1, logical block 312568576
Buffer I/O error on device sdd1, logical block 312568577
Buffer I/O error on device sdd1, logical block 312568578
Buffer I/O error on device sdd1, logical block 312568579
ata2: EH complete
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568576
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568577
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568578
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 312568579
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 0
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd1, logical block 1
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
sd 1:0:0:0: rejecting I/O to offline device
ata2.00: detaching (SCSI 1:0:0:0)
Synchronizing SCSI cache for disk sdd:
FAILED
  status = 0, message = 00, host = 4, driver = 00
  mount: special device /dev/sdd1 does not exist

Comment 18 Robert Hancock 2007-05-08 17:31:22 UTC
Created attachment 11441 [details]
Hotplug debug patch

Can you try out this debug patch (over top of the previous one) and see if it
gives you any output when the drive is unplugged?
Comment 19 Harald Hope 2007-05-09 12:32:56 UTC
Created new kernel, with both patches. Using first set of kernel options, kernel
gave no output until drive removal, as above. On removal, same as above, 25
second or so delay, then above kernel messages.

Second try, without the kernel options, install message same as first time, on
removal, system did not freeze, but also no remove messages. After waiting a few
minutes, I tried running mount /dev/sdd1 /media/sata to force system to check
drive, and system hung, not completely, I could still type, I could still change
consoles, but I couldn't do  anything, required a hard reboot to get out of it.

The main change was that the system didn't hang on removal of drive, until the
system was asked to use the drive, then it hung again, although it didn't freeze
as completely as it did before, keyboard and mouse still worked (using gpm),
input just wasn't processed.

=================================================
2.6.21.1- with adma patch 1 and 2: sata drive install
added kernel options first try: nmi_watchdog=1 loglevel=7 idle=poll
added kernel options second try: loglevel=7
=================================================
ata2: Non-ADMA hotplug
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: Non-ADMA hotplug
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata2.00: 625142448 sectors, multi 0: LBA48
ata2.00: configured for UDMA/100
ata2: EH pending after completion, repeating EH (cnt=4)
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
sd 1:0:0:0: Attached scsi disk sdd
Comment 20 Robert Hancock 2007-05-09 16:25:49 UTC
Are you waiting a minute or so after it appearss to hang to see if it unseizes?
It could be something was causing a timeout.

I'm at a bit of a loss to explain how you aren't seeing a hot-unplug interrupt
get generated, given that it works on my machine..

You could try some commands after unmounting and before unplugging the drive to
cause it to be in either ADMA or port-register mode, and see if that makes a
difference. Try:

smartctl -d ata -a /dev/sdd

to put it in port-register mode, or

sdparm --command=sync /dev/sdd

for ADMA mode.
Comment 21 Harald Hope 2007-05-10 17:25:15 UTC
Created attachment 11472 [details]
smartctl output if of any use for debugging

This is probably not important information but I thought I'd provide it anyway,
it's the output of running the smartctl command listed prior to removing sata
drive.
Comment 22 Harald Hope 2007-05-10 17:31:09 UTC
I ran further tests with the two commands listed:

With kernel 2.6.21.1 smp, with both patches installed, first I plugged in sata
drive, then ran the smartctl command, which resulted in this data:

=================================================
2.6.21.1- with adma patch 1 and 2: sata drive install
added boot options: loglevel=7
=================================================
ata2: Non-ADMA hotplug
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100
ata2.00: 625142448 sectors, multi 0: LBA48
ata2.00: configured for UDMA/100
ata2: EH complete
scsi 1:0:0:0: Direct-Access     ATA      WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5
ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB)
sdd: Write Protect is off
SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdd: sdd1
sd 1:0:0:0: Attached scsi disk sdd
=================================================
smartctl -d ata -a /dev/sdd output:
=================================================
see attached file: smartctl-output
=================================================
sata removal
Note: does not freeze on removal
after running smartctl command, when I unplugged the drive, I got this
information fairly quickly, the first notification of hotplug removal a second
or two after unplugging drive.
=================================================
 ata2: Non-ADMA hotplug
ata2: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: limiting speed to UDMA/100:PIO3
ata2: failed to recover some devices, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 300)
ata2.00: disabled
ata2: EH complete
ata2.00: detaching (SCSI 1:0:0:0)
Synchronizing SCSI cache for disk sdd:
FAILED
  status = 0, message = 00, host = 4, driver = 00

======================================================

Next, I tried the sdparm command, after rebooting, hotplugging the drive, before
removing drive, but unfortunately that pretty much froze up the system when I
removed the sata drive. No output, I waited a few minutes, but nothing. Not a
full freeze, I was able to switch between tty consoles, and I was even able to
shutdown nano in my second tty, su to root, and issue the reboot command, but it
hung completely as soon as the text 'system going down for reboot now' appeared.

Also, I was not able to do anything in the initial root login, the one that hung
after I removed the drive, and I couldn't create a new tty session, if that's
relevant.

But I was able to still type and work in nano as regular user, save, etc. 
Comment 23 Robert Hancock 2007-05-10 19:17:18 UTC
Can you post your full dmesg output from bootup with 2.6.21? It would appear
that you're not getting any hot-unplug interrupt when in ADMA mode, but I do on
my setup, so it would seem there must be some difference responsible for that..
Comment 24 Harald Hope 2007-05-10 19:56:00 UTC
Created attachment 11473 [details]
dmesg from 2.6.21.1 boot

This is the dmesg output for boot in 2.6.21.1 patched kernel. Hope it shows you
what you need.
Comment 25 Robert Hancock 2007-05-12 10:59:10 UTC
The only apparent difference that seems relevant (other than you running a
32-bit kernel and me running a 64-bit) would be that you're hotplugging a drive
that's the second port on the controller (each controller instance has 2 ports)
whereas most of my tests were with the first port. Could you try a hot-unplug
test (after doing the sdparm command) on one of the drives on the first port,
i.e. where either the ST380817AS or the ST3160827AS are connected?
Comment 26 Harald Hope 2007-05-22 16:44:20 UTC
kernel 2.6.21.2-rc1 just came out, so I thought I'd give that one a spin, but
sadly, now the stuff just locks up completely, on sata insertion as well as
removal (well, I can't actually test removal any more because of the initial lockup.

On the bright side, usually the rseivb sequence lets me reboot.

I tried removing another drive from what I assume is the other port (4 sata
ports, 3-4 and 1-2, I assume each block has one controller. 

It's hard to say what happened, but the end result is the same, system freezes.
However, since you're running 64 bit kernel here, with a different setup, I
assume only 1 or 2 sata disks, I don't see much chance of working out this issue
to be honest, unless you think of something different.

The other drives I'm removing are my primary work data drives, and I'm not going
to take any further risks with them, sorry, that's it for now.

I wish I could help more, but I think we need to locate a setup for you that
more closely resembles mine, 64 bit, multiple drives, too bad you don't live
down the street, heh heh...

Since I now basically have no sata hotplugging at all anymore, I'm giving up for
now. The new kernels, I asked our kernel guru guy, are made with the libata
patches, that's all I can tell you, and the only way I can get any output at all
is using the args you gave: nmi_watchdog=1 idle=poll 
that lets me install/remove the drive, without that it's instant freeze, now on
install too.

I guess I'll just keep my fingers crossed that some other bug when it gets fixed
takes care of this one, but I can't really do much more to help unless you come
up with a really concrete test, although I'll give anything you want a try now
and then, but not using my main data drives, that's just too risky for my taste.

But I really do recommend that you create another partition so you can test 32
bit too, otherwise there are just too many differences, I guess I could test 64
bit, I have a 64 bit test install that I never use as well, but seems like we're
grasping at straws at this point, so I'll wait until something more solid comes up.

As to what happened when I removed another drive, it was pretty odd, nothing
happened, nothing at all. No output message, no crash, until I did it a few
times, then cfdisk to make sure it was there, then everything froze solid, not
even the sysrq sequence could free it, and since that was my data drive, I
decided, ok, that's it, no more testing on live disks with live data.

My disk setup is this, by the way, I think

on mobo, sata plugs are:
2. - this is the one I've been testing, esata/sata hotswapping
1. maybe /dev/sda, not sure
4  maybe /dev/sdb
3. I think this is /dev/sdc, not sure, that's the one I tried today to remove

that's all I can tell you, 3 disks on system, sata connector 2 is empty or is
used for esata depending on what I'm doing, the other 3 are active data drives.

Hope something in here clicks and makes someone realize what's wrong, but I
don't think I can do much more on my end at this point, since all the tests are
resulting in the same essential outcome at this point, whether I use the new
kernel, the kernels I made with patches, etc, so I really think it's now a
matter of someone in the kernel world realizing what it could be, I'm not expert
enough to be of much more use, sorry, wish I could be. I guess from now on, if I
need to use esata I'll just boot with it on and forget all about hotplugging for
a while.
Comment 27 Harald Hope 2007-05-22 16:49:58 UTC
Sorry, typo, change the 64 to 32 here, in other words, the fact we're using
different systems, 32 and 64, just makes the odds of getting this debugged even
lower:

<<< more closely resembles mine, 64 bit, multiple drives
Comment 28 Robert Hancock 2007-05-22 16:53:15 UTC
2.6.21.2 isn't going to behave any differently. Have you tried 2.6.22-rc2?
Comment 29 Harald Hope 2007-05-22 17:16:11 UTC
No, once we get it, which should be fairly soon, I'll give it a try. I guess
what I'll do is just try the latest kernels when we get them, which is usually
quite soon after release, if not day of, do a quick test on my 32 bit test
partition, confirm if issue has changed, improved, or worsened, and leave it at
that.

I'll let you know what rc2 shows as soon as we get it here, rc1 we just got today.

basically I think what I'll do is just the simple test: install kernel, reboot,
try hotswap sata/esata, see what happens. If it fails on install, removal, etc,
then I'll just leave it at that.

I'll keep you posted though, I have to test these new kernels anyway, so might
as well add one kernel crash recovery sequence to the initial test, just adds a
minute or two to the test, now that I learned the sysrq method I don't have to
do the hard resets anymore, at least usually not.
Comment 30 Harald Hope 2007-05-22 17:18:10 UTC
Oh, sorry, misread, you said: 2.6.22-rc2, no, that one I won't be trying for a
while, maybe I can get someone to make me one though, we'll see. I usually just
test our user kernels when they are released, I've avoided the rough stuff for
now. I'll see if someone has that one made up though and give it a try if they do.
Comment 31 Harald Hope 2007-05-26 19:28:40 UTC
Small update: since I needed to do a backup and esata is now fully
non-functional, I booted into an older kernel, 2.6.18.1, and found that there
are no problems at all with sata hotplugging. Turned on device, got immediate
notifiction in console, when finished with backup, turned it off, system
responded exactly as it should have, no problems at all. The next backup I do
I'll try running in the 2.6.19 kernels, and see if it's the same there.

Also, I'm not sure if this is related, or simply a side affect of all the
crashes, but when I tried to start 2.6.21.1 with esata drive attached and on,
the system insists that the drive is mounted even though df -f, cat /proc/mounts
all show it's not, so I couldn't even run a backup that way anymore since I
couldn't mount the drive at all. In other words, for me now, 2.6.21 kernels are
totally useless for esata/sata hotplugging activities, I'll report back on
2.6.19 and 2.6.22 when I get one running.
Comment 32 Harald Hope 2007-05-26 19:30:57 UTC
sorry, typo: I meant df -h
Comment 33 Konstantin Sobolev 2007-06-04 03:10:42 UTC
I have the same kind of problems.
NForce4, 64-bit system, kernel 2.6.21. System hangs on any attempt to hot-plug
or unplug the box using eSATA cable.
Even worse, sometimes it hangs when just accessing the drive. For instance: I
had the drive plugged, booted the OS, used it for several hours, then I left my
workstation for another several hours, and when I came back and tried to acces
the drive (just tried to play a video off it with mplayer) - system freezed.
Freeze looks very similar, magic SysRq doesn't work as when it hangs on hotplug.
System died 4 times in 2 days and seem to work OK after I've unplugged the drive.
Comment 34 Robert Hancock 2007-06-04 07:41:33 UTC
You'll want to try the patch in comment 9 to see if that helps the problem. If
not, can you try the debug patch in comment 18 and see what output it gives you,
if any?
Comment 35 Tejun Heo 2007-06-20 00:53:26 UTC
Setting status to NEEDINFO.  Please provide the requested information.
Comment 36 Harald Hope 2007-07-02 01:02:57 UTC
Kernel 2.6.22-rc7 smp, hard lockup, identical to 2.6.21 behavior, could reboot with rseiub.

I pretty much gave on this getting fixed (and looks like that was a good move) , and bought a pci-x Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller card to do esata with since I have to have this stuff work, the sil stuff seems to work fine. 

I did find some weird behaviors with esata even with that today however, but I can't yet pinpoint anything specific beyond the system went into some type of loop trying to mount/do some type of IO operation after the drive was removed (problems in EH? doesn't seem to be working very robustly) the esata drive (via the sil3132 card) which nothing could stop except a reboot.

Sorry didn't get to 2.6.22 test until today, had to do other things with box than try to break it.
Comment 37 Tejun Heo 2007-07-02 03:16:11 UTC
Harald, care to post the weird esata log?  EH actually is supposed to be pretty robust.
Comment 38 Harald Hope 2007-07-02 14:37:48 UTC
I decided, since 2.6.22 clearly DID in no way fix this issue, to spend some time this morning seeing where the bug is and isn't.

Remember I have dual channel sata, so the question was: will this happen if the other drives are removed? The answer is yes, it does, but it slightly different ways, I removed all drives but one, then tried hot installing into either the second port(?) in the main sda channel, or a port in the second sata channel. Both cases were different, but both crashed the system, with varying severity.

I also decided to test 64 bit, which didn't crash it (by the way, you'll have to run 32 bit to actually ever find and kill this bug, it's pointless trying to debug this on a different kernel).

I'll add some attachments, first for 32 bit then 64 bit.

I also got some error output from my first try, using rseifub, at the e stage, I'll post that here, that was a total lockup though, on same channel as main drive.

Code: 
04 00 00 ba 1d 00 00 00 83 c0 01 83 e0 3f 89 83 00  04 00 00 8d 83 08 04 00 00 e8 2a 07 4f c7 8b 9b 10 04 00 00 81 eb 10 04 00 00 <8b> 83 10 04 00 00 0f 18 00 90 8d 83 10 04 00 00 39 e8 0f 85 78

EIP:[<f8c9bb72>] evdev event +ox152/0x170 [evdev] SS: ESP 0068:c03e7e70[or 6, can't read my writing] kernel panic: not syncing fatal failure exception in interupt.

The details on the other issue I didn't write down, basically what happened is that I tried to manually mount esata drive, it hung, then I tried it again, and I think the first attempt kept banging at it, giving the same error over and over, which I didn't write down. I can't say I can call what I'm seeing 'robust EH', no offense, to me it looks like badly broken EH.
Comment 39 Harald Hope 2007-07-02 14:40:46 UTC
Created attachment 11925 [details]
64 bit nvidia nforce4 data

64 bit data, 64 bit is not as badly broken as 32 bit. I recommend someone with a 32 bit system and at least 2 sata hard disks work on this, or that someone installs a 32 bit system (15 minutes to do)
Comment 40 Harald Hope 2007-07-02 14:46:48 UTC
Created attachment 11926 [details]
32 bit data, as much as I could collect

Note that the behavior is slightly different depending on if you use the second port on channel 1, or if you use one of the empty ports on channel 2, using the same port seems to create a harder crash.

Since I removed 2 of the drives, this should make it easier: it's not related to the other drives, and it happens on both channels, but in different ways.

Since this is always reproducible, it's hard to see how someone without the same hardware couldn't get this, unless it's related to some totally hitherto unsuspected issue (which it might be, who knows?). Again, nforce 3 has no problems, it's perfect.

And sil stuff is working fine too, with that one exception I mentioned, which I might be able to reproduce I think, though it's also a fairly nasty hang if it happens, the system basically endlessly complains about missing IO device for action.

To reproduce, if I remember right, this is with sil, and it was in kde, not console: automount comes up, cancel, then use manual mount command to try to mount esata. Something internally in the system just doesn't register that the drive is gone, and for hours will keep trying to write to the missing device. Can't say if this is hal, but I do  know it was flashing this kernel error line in console continuously, so quickly I wasn't able to do anything, and it lasted for about 5 hours before I gave up and rebooted. Again, this EH does not strike me as robust at all, sorry.
Comment 41 Robert Hancock 2007-07-02 17:05:30 UTC
The Silicon Image card issue will be something else, a separate bug entry should be opened for that problem as it is just confusing the issue otherwise. (More details would be needed as well).

I'm not seeing a conclusive difference between 32-bit and 64-bit kernels from your output. It all comes back to the same thing I concluded before, you're somehow not getting hotplug or hot-unplug interrupts when the port is in ADMA mode, and whether or not it will be in that mode depends on the last command issued to the drive. When the drive is connected or disconnected, it should be detected immediately. If not, there's little point comparing what happens after that point since it's clearly not working properly.

I'm still entirely unable to reproduce this, also - hotplug and hot-unplug seem to work perfectly fine for me regardless of which mode the controller is in. I can't really explain how that would be, unless it's somehow related to the particular mainboard or chipset revision or something. I'm adding Jeff Garzik to the cc list, he has the docs for this controller which might be able to shed some light on this.
Comment 42 Tejun Heo 2007-07-02 23:13:27 UTC
Yeah, robustness my ass.

I'll try to reproduce the infinite retries on device removal.  I've seen that behavior with mplayer which just never gives up no matter how many times it was told the requested IO operation failed.  I don't think we have in-kernel user which is that persistent but who knows.  When the infinite message occurs, what were you doing exactly?  Does fuser show anything accessing the device when the problem occurs?

I'll prep a patch to shut up the messages after certain times.

Robert, hotplug event delivery working or not, I think it's really important to avoid at least oopsing.  Would that be difficult with the controller?
Comment 43 Robert Hancock 2007-07-03 07:31:13 UTC
You're right, it shouldn't be oopsing. It's not clear where that is happening though, the EIP they mentioned looks like it's in evdev somewhere?
Comment 44 Tejun Heo 2007-07-03 18:30:27 UTC
Harald, can you please build the kernel with frame pointer and capture the full oops using either serial or netconsole (Documentation/networking/netconsole.txt)?

Thanks.
Comment 45 Harald Hope 2007-07-04 15:32:51 UTC
< When the infinite message occurs, what were you doing exactly? 

I think what happened was: I hot installed esata drive, popup automounter appeared, I clicked mount (which I usually don't do with esata), it gave a mount error (you can't actually mount esata as user in kde unless I change some permissions I believe), then I turned it off because a normal command line root mount failed, then turned it back on, mounted it and did some work with it, but the whole time the system was running at 100% or so, hot, and I finally realized that the initial process had gotten stuck in some loop of failed IO operations, didn't write down the message, but it was streaming by in console so fast I couldn't actually see what I was typing.

I checked in htop, and the job that was eating up the cpu was a mount operation, I believe the second one I tried after I got the mount error from automount. I couldn't kill the job at all, no matter what I tried, so I finally gave up and rebooted.

However, since this involves kde and maybe other things I agree it doesn't belong in this bug report, except maybe to indicate that the kernel error handling isn't all it could be...

<<< can you please build the kernel with frame pointer and capture the full

Not this week unless I find a few free hours, maybe next week, sorry, bad time for me this week, but I'll see.
oops using either serial or netconsole
Comment 46 Tejun Heo 2007-07-04 19:23:26 UTC
I'm worried that we might have some in-kernel retry logic which enters infinite loop on failure.  If that's not the case, the only fault in kernel is not throttling the IO-rejected-to-offline-device messages.  A process which can't be killed mean that the process is stuck in the kernel tho.
Comment 47 Harald Hope 2007-07-04 22:39:41 UTC
Well, this is getting fun... brand new machine, old intel pentium 3, i810 era chipsets, HP, with TI firewire card going to external firewire drive, data storage.

This one looks remarkably similar to my recent looping issue. 

Kernel 2.6.22-rc7 in both cases. 2.6.22-rc7 is now off my main box, I do  not consider it stable enough to run, and with this latest failure I'm even less enthusiastic.

Attached dmesg log, looks scary, right? Looks like a dead drive, with lots of damanged sectors, in fact. Turn off drive, let system stabilize, turn back on, works fine, no issues.

I did realize one common thread however, both times the drive is mounted in a directory shared via nfs, in other words, the nfs-kernel-server is running on these two machines.

I haven't had time to test 2.6.22 on a non shared directory, and unfortunatley I won't be able to get to this, but since this one just happened, thought I'd post it since it looks like this might actually have something in common, failure of hot plugged drives, but this case is firewire. 

Overall it seems like only some things trigger this, I can't say what yet, not enough facts, I wish I had time to test non nfs mounted stuff to see if that pinpoints it further but won't for a while.
Comment 48 Harald Hope 2007-07-04 22:41:53 UTC
Created attachment 11945 [details]
Some logs this time, and no kernel crash...

Since this output reminded me of what I saw with the esata issue, and the event itself suggests hardware failure where none appears to be, this is sort of interesting. Of course, these 3 events could all be totally unrelated, stranger things have happened.
Comment 49 Tejun Heo 2007-07-04 22:51:37 UTC
The message is missing the interesting part.  What happened is..

1. For some reason the drive went offline (could be a hardware issue, could be a problem in firewire stack)

2. The disk device is offlined.

3. Each user of the filesystem gets errors.  Some will retry but they eventually give up.

4. Device turned back on and reattached as different device.

Note that #3 and #4 are separate events, so you turning off the device and turning it back on didn't really affect #3.  Or did you turn off the device before #1?  Please attach full log.
Comment 50 Harald Hope 2007-07-04 23:11:03 UTC
drive went offline, correct, first. After an hour or so I realized it had in fact gone offline, and turned it off. This has happened before on this test box, but I didn't register it last time, this time I noticed it and realized what was happening.

First attempt to turn on was just a quick off/on, with the posted results.

Then I left it off for about 10 minutes, started it, mounted it, was fine.

You're right, it reinstalled a new device, and as that new device, worked fine, and is still working fine, which to me rules out or makes less likely hardware issues.

I'll have to wait to post more, sorry.
Comment 51 Tejun Heo 2007-07-04 23:19:04 UTC
That doesn't necessarily rule out hardware issue but in fact we're getting a new firewire stack and you might be hitting a bug there.  It could be that the new driver isn't properly dealing with transient hardware failures or what not.  Anyways, please post full kernel log when you have time.  Thanks.
Comment 52 Harald Hope 2007-07-05 00:11:12 UTC
Created attachment 11946 [details]
kernel log, i810 firewire issue

kernel log from july 4, intel i810 box.
Comment 53 Harald Hope 2007-08-07 13:27:42 UTC
I'll be trying the fix later, but I wanted to post this, to keep this alive:

Since we now have 2.6.22.1 kernels, I ran some tests again. First, and I think foremost, I want to really urge the bug handler here to stop assuming that using 64 bit systems to test a primarily 32 bit issue is a good idea. So can we either get someone assigned who can install a 32 bit test system, or who runs one, or get the current bug person to do that? The assumption that the bugs would be the same, as I found today, was utterly unwarranted. This point is to me so utterly basic that I have little hopes of this bug ever getting resolved if such a simple thing cannot be done.

So, without further ado, here's today's basic findings, with 32 and 64 bit kernels.

First, with 64 bit, latest Debian Sid system, 2.6.22.1 kernel, smp:

hotplug sata drive: system sees the event right away, all appears good and normal, no lag on detection, drive is present and accounted for. All is good.

hot unplug drive: system sees nothing, but does not crash. Recalling the advice here, I decide to give system a kick, and try to cfdisk /dev/sdd
I wait quite a while, maybe a minute or two, then the remove events scroll by and cfdisk complains no drive is present. So after the system is given the kick, it does discover the missing drive, then implements the error handling, which works. But the discovery itself does not work.

All, except for failing to detect automatically the drive removal, is working reasonably well.

So that's it for 64 bit. Some small issues, but I assume issues that are handleable.

Now, getting back to the initial bug report, 32 bit kernel, latest Debian Sid, 2.6.22.1 smp kernel.

Boot into kernel, hotplug in sata. System immediately freezes. Fully. Instantly. As you can see, hopefully, the behavior in 64 bit and 32 bit is not the same, so let's test 32 bit, and then worry about fixing the smaller 64 bit issues. Installing a 32 bit system on an amd64 box takes a few minutes. I would really like to get this issue resolved one day, but we're going to need to run the same kernels I think.

Examining the freeze, I give up, and start the rseiub sequence, but then notice that at RS, suddenly the kernel sees the drive, with this message:

ata6: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
then, about 40 of these in a row:
vt: argh, driver_data is NULL!

then the normal sata discovery data below that. 

Curious at this point, I check to see if I can remove the disk again, but, sadly, no joy, full freeze, rseiub.

So that's the current status with the current kernels. Nothing much has changed, on 32 bit the situation is a bit worse than when I first reported this issue, on 64 bit, a bit better. 

Meanwhile, having given up on any near term solution in kernel, I'm still happily running my esata stuff via a pci-e sil 3112 card, which hotplugs happily.

I'll see about running the next test building a kernel, as requested, but can we at least agree that 32 and 64 bit is not the same thing? and then maybe start looking at the same system for testing and debugging purposes? This does not strike me as an unreasonable request.
Comment 54 Robert Hancock 2007-08-07 17:06:19 UTC
First of all, I think we need to clarify what is a freeze and what is disk activity stalling. If SysRq, etc. works then I don't call that a freeze, though if disk activity has stalled then processes may appear to be hung up. In this situation it may still un-hang after commands time out and trigger a reset. If it was truly frozen (interrupt storm, hardware stall, etc.) then all that would work would be reset button.

Given that ambiguity, I'm still not entirely convinced that there is any consistent difference between 32 and 64 bit. The fact still remains that in neither case are you seeing all of the expected hotplug/hot-unplug interrupts showing up, whereas for me, in 64-bit mode, it consistently works. Without explaining why that is, there's little point in running other tests.

Maybe it is a different chipset revision? Can you post the output of "lspci -v" ?
Comment 55 Harald Hope 2007-08-07 18:28:56 UTC
Created attachment 12312 [details]
lspci -v output

There's the lspci -v output. 

Why you won't test your claims re 32/64 bit, validate them empirically as correct or incorrect by actually running the stuff for real, is beyond me. If the result is the same, great, then you know. If it's not, great, then you know that too. Isn't it better to KNOW than to believe when it comes to questions like this? From what I can see, the guesses are making this bug worse on 32 bit and better on 64 bit. I find it odd to even have to say this, to be honest.

Anyway, the 64 bit gives me some hope, since it's quite close to working, so that's something.
Comment 56 Harald Hope 2007-08-07 18:34:38 UTC
I'll give the theory that this is an event that should time out a try, and post back, however, that never was the case in the past.

I'll give it... 10 minutes? Longer? then post back with the results. If I understand you right, if it's not a freeze, it will at some point recover.

.
Comment 57 Harald Hope 2007-08-07 19:22:21 UTC
I gave it 30 minutes, nothing happened, it stays crashed. Whatever semantics you want to apply to this event is fine with me, when you hotplug sata in 32 bit kernel it's crashed and it doesn't recover. You can still use rbseiub, but that's all you can do. You are technically correct to note that the kernel is not fully frozen, gpm for example still works, but that's all that works, keyboard is dead, however you want to describe this is up to you, the kernel has crashed and does not recover on its own. For real world users, that's it, dead.

Also noteworthy in my opinion is the fact that while on hotplug after doing RS you get the event of sata install registered, output etc, on removal of the drive, nothing happens, you just have to rseiub your way out of the situation, it's fully locked up.

I can't really tell you more than this, when I have time I'll try to make one of the kernels with framepointer to see if I can get some more output.

By the way, hotplug works great in windows 2000 on this box, mobo, etc, can't remember if I mentioned that.
Comment 58 Harald Hope 2007-08-07 19:38:53 UTC
Created attachment 12315 [details]
root version of lspci-v, has some more data

Just noticed that the first version was not complete, this one has all the output, no 'access denied', sorry.
Comment 59 Tejun Heo 2007-08-07 19:55:23 UTC
Harald, please lemme clarify a few things.

In, say, 95% of cases, 64 and 32 bit kernels act the same as far as libata drivers are concerned.  I guess this one might fall in the 5% region tho.  Also, distinguishing between 'system freeze' and 'disk stall' is very important primarily because it gives a good clue about what's going on in the machine.  System freeze, as Robert said, indicates IRQ storm or ATA controller dying while holding PCI BUS, etc whereas 'disk stall' indicates missing EH events, transmission errors or other driver bugs, so getting this right is important for debugging.  Robert isn't trying to tell you it's less broken that way.  A simple way to test is pressing numlock and see whether the LED responds.

And, yeah, I suppose windows drivers are in better shape.  Till this day, no one in Linux land has docs for nv adma.  All we have is quasi-working sample driver from nvidia and Robert has been reverse engineering a lot, so let's give him some credit for that.

I understand that debugging process can be very frustrating and appreciate the time and energy you put into fixing this problem.  Bug reporters who can test patches and report the result are really valuable to the development process.  I just wanted to clarify a few things so that there's no misunderstanding.

Thanks.
Comment 60 Harald Hope 2007-08-07 20:09:33 UTC
yes, I understand all this stuff, but I still fail to understand the logic, if there is any, behind simply spending 10 minutes to install a 32 bit system then confirming that yes, in fact, it's the same. This is what I find frustrating more than anything else in this process. I do it, and I'm not some elite hacker, so what is the problem here? The refusal to do this basic step I find almost absurd, given that it takes about 10 minutes to install a distro, far far far less time than I'm spending on each one of these steps, so other than sheer poverty, a 100% full hard disk, or whatever other reason, I simply cannot see any valid reason to not do this simple thing.

The freeze issue I understand fine, but since I'm describing the phenomenae as they occur, it's fairly clear what's happening.

I pointed out that windows worked to show that the hardware itself is probably not at fault, nothing more.

Maybe I should start posting on the nv forums. The problem here is that a large number of higher end consumer mobos are still running nforce4 stuff, I was surprised recently to find how many are running this. 

having to reverse engineer is obviously no fun at all, and it's not to nvidia's credit that they are not providing better code, why that is is hard to understand.

However, it does seem that this process is basically not getting anywhere, so it strikes me as a good time to try different things at this point.
Comment 61 Robert Hancock 2007-08-07 21:04:44 UTC
Well, your nVidia Corporation CK804 Serial ATA Controller is revision f3, same as mine, so that doesn't explain any difference.

Looking at some of your earlier output, I see this line on a hot-unplug case:

ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1581000 status
0x402 next cpb count 0x0 next cpb idx 0x0

This indicates that the controller is showing interrupt pending and ADMA hot-unplug in status, but apparently it didn't actually raise the interrupt.

These controllers have this funky notifier mechanism for acknowledging interrupts which is a real pain, honestly. Can you try making a change in sata_nv.c, this line is on line 898 in the current git tree, should be nearby in similar versions:

        if(notifier_clears[0] || notifier_clears[1]) {

Can you change that to just:

        {

and let me know if that changes anything? That will make it write to the notifier clear registers regardless of whether we think we need to or not.
Comment 62 Harald Hope 2007-08-07 23:24:40 UTC
That patch seems to have undone whatever damage other patches did since I first filed this bug report. So it's an improvement. In 32 bit, now, when I hotplug the sata drive, it acts as expected, it registers the event, detects the drive, everything is normal, pretty much the same as in 64 bit, maybe a bit slower, I'd have to really look at both to compare, my impression was that 64 bit registered and reacted a bit faster, but I'd have to really time it to be sure, neither took very long.

Sadly, the initial problem remains the same, 32 bit crashes on hotunplug, rseiub required to get out.

I don't mind running these patches if they are simple like this, it's fairly easy to do, albeit time consuming.

So that's promising, or something anyway, which is nice to see, your guess looks like it was right for hotplug, not for hotunplug.
Comment 63 Tejun Heo 2007-08-08 20:12:31 UTC
*** Bug 7545 has been marked as a duplicate of this bug. ***
Comment 64 Tejun Heo 2007-08-08 20:41:24 UTC
*** Bug 6665 has been marked as a duplicate of this bug. ***
Comment 65 Tejun Heo 2007-08-08 21:11:07 UTC
*** Bug 8787 has been marked as a duplicate of this bug. ***
Comment 66 Harald Hope 2007-08-22 12:41:01 UTC
Small update, some more facts. I'm hoping that one day, one of these facts is going to set off a bell in someone's head, and the fix will become obvious:

# dmidecode | grep -A4 Base
Base Board Information
        Manufacturer: MICRO-STAR INTERNATIONAL CO., LTD
        Product Name: MS-7125
        Version: 1.0
        Serial Number:

I was talking to a developer, and he was trying to reproduce the bug on an MSI neo4 platinum nforce4 board, but he realized that there are several main revisions, he was testing on revision 3.0, while mine is the first, Version: 1.0, the one with the mobo dual nvidia+marvel gigabit networking.

So that's the first thing, this bug appears to strike nforce4 PCB version 1.0 but not 3.0. I don't know if there is a 2.0, or if it is affected. This might account for why this is so difficult to reproduce.

Second, and I thought much more interesting. Since my last tests, I had forgotten to change my esata drive from the mobo attached ports to the sil pci-x card, I started it up, having forgotten this, and of course the system instantly froze, as expected with the stock kernels, unpatched.

I swore to myself, then thought, oh well, might as well try to move the esata cable from the mobo connected port to the pcix card port before I give up and do a hard reboot. And guess what! The kernel instantly de-crashed, unlocked, unfroze, pick your term... not after after a few seconds, it was instant.

I think this has to give some meaningful clue, I can I suspect reproduce this with some kernel and dmesg output, but this has to be meaningful, an event exists that will uncrash the kernel instantly when applied, and that event is removing the connection and creating a new one to a non problem sata chip.
Comment 67 Robert Hancock 2007-10-04 17:43:12 UTC
It would certainly appear that the board type/revision is somehow related here.. Possibly it's a different revision of the chipset or something (though we already checked that the PCI revision code is the same as mine, that doesn't necessarily mean anything though).

I'm not really sure how we can tell what is going on here, though, without getting some assistance from NVIDIA and/or MSI..
Comment 68 kiev 2008-05-25 16:07:41 UTC
for me she showed up one time in the floor of hour, however as a result of this
problem I lost a mysql database - mysql innodb not start - "Accertion error" -
did not help even "innodb_force_recovery = 4", backup was an a week remoteness
- the works of whole department lost data for a few days, the management simply
in shock - I going to discharge from job (((

this problem already whole year:
-----------
I'm stumped trying to track down the below intermittent problem.....
I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21.
-----------
http://lkml.org/lkml/2007/6/14/154
http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765
http://kerneltrap.org/node/16175

"System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300
"sata hotplug removal of drive freezes all 2.6.21 kernels"
http://bugzilla.kernel.org/show_bug.cgi?id=8421
"(sata_via) system freeze in random time"
http://bugzilla.kernel.org/show_bug.cgi?id=9115
"kernel freezes with on clockevent warning"
http://bugzilla.kernel.org/show_bug.cgi?id=9834
"[pata_ali] Unspecified hang on Acer laptop"
http://bugzilla.kernel.org/show_bug.cgi?id=9898
"System freezes after I/O on pata_jmicron device"
http://bugzilla.kernel.org/show_bug.cgi?id=10296

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437
https://bugs.launchpad.net/ubuntu/+bug/226600
Comment 69 Robert Hancock 2008-05-26 00:04:16 UTC
kiev, I think you are referring to a bunch of entirely unrelated problems that happen to have some similar symptoms..

Note You need to log in before you can comment on or make changes to this bug.