Most recent kernel where this bug did *NOT* occur: all 2.6.20.x kernels. Distribution: Debian Sid 32 bit( sidux packaged kernels ) Hardware Environment: amd x2 3800 running in 32 bit mode, nforce4 MSI neo4 platinum mobo, varying sata disks, seagate, western digital. Software Environment: any, but for this test: standard bash shell to remove any possibility of gui desktop manager interference. Problem Description: With all 2.6.21.x kernels, starting with rc versions, esata and sata hotplug removal fails and freezes kernel instantly. Hotplug attachment works fine. Kernel freeze always happens on removal of sata drive. I first saw this with ESATA external drives, but today I tested it with internal sata to make sure to remove the remote possibility that there was some issue with drive enclosure hardware. Freeze occurred as usual when I removed the sata drive cable with the direct sata connection, so the issue is not related to esata. Also retested on 2.6.20.x kernels, and those work fine as usual. This problem never existed with any previous kernel, I've been using esata backup units for a while now, and no kernel has had this issue until 2.6.21 came out, and they have all had it. Sorry for not filing this bug report sooner, I just assumed something so glaringly obvious would get caught, but I guess it didn't. Steps to reproduce: 1. turn on machine 2. boot up any 2.6.21 kernel, including the latest 2.6.21.1. I'm using smp kernels, I guess I could also test uniprocessor kernels to make sure it's always an issue. 3. plugin sata drive connector cable (or, alternately, just power on esata external enclosure) to sata drive. With esata, I am just using a physical sata to esata plug adaptor card, no active circuitry, just physical adaptor. Drive is seen by system, everything is fine. Mount to test, fine. 4. Once confirmed mountable etc, umount, wait, then unplug sata connector cable (or turn off esata external enclosure, or removable hard drive tray). Kernel crashes instantly, I have not been able to get any error output at all, although if you tell me where to look I might find some somewhere, but I can't see anything anywhere. This is a 100% system freeze, have to do hard reset/power off to get out of it. Not sure if this helps, but here's the mounting data from bash shell first and dmesg second: ========================================== ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ========================================== ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata2.00: 625142448 sectors, multi 0: LBA48 ata2.00: configured for UDMA/100 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd ============================================ as noted, I can't get any failure output data because the freeze is instant.
That sounds like an unpleasant regression. Do we officially support sata hotplug yet?
By a stroke of good fortune, I got another sata box today to test: Hardware Environment: MSI nforce3 based motherboard, K8N neo v 2.0, nforce3 250 chipset, main hard drive ide, test sata same as other box. This one is uniprocessor, amd socket 754 single core 3200. Same operating system and other factors as initial report. I tested both smp and up kernels, and there were no issues, it works as expected with both 2.6.20.10 and 2.6.21.1 kernels. No crashes, everything is normal. So it looks like this is related to the nforce4 chipset specifically. I tested both smp and up kernels to make sure that it's not some smp thing spefically. I made these notes, if they are helpful, from dmesg add and remove of sata drive, hotplugging: ======================================================== 2.6.20.10 uniprocessor sata hotplug install ======================================================== ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ATA-6, max UDMA/100, 625142448 sectors: LBA48 ata1.00: ata1: dev 0 multi count 0 ata1.00: configured for UDMA/100 ata1: EH complete scsi 0:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sd 0:0:0:0: Attached scsi disk sda ================================================ 2.6.20.10 uniprocessor sata hotplug removal ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ================================================ ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1: failed to recover some devices, retrying in 5 secs ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1: failed to recover some devices, retrying in 5 secs ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1.00: disabled ata1: EH complete ata1.00: detaching (SCSI 0:0:0:0) Synchronizing SCSI cache for disk sda: FAILED status = 0, message = 00, host = 4, driver = 00 =========================================================== 2.6.21.1 uniprocessor sata hotplug install =========================================================== ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata1.00: 625142448 sectors, multi 0: LBA48 ata1.00: configured for UDMA/100 ata1: EH complete scsi 0:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sd 0:0:0:0: Attached scsi disk sda =========================================================== 2.6.21.1 uniprocessor sata hotplug remove =========================================================== ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1: failed to recover some devices, retrying in 5 secs ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1.00: limiting speed to UDMA/100:PIO3 ata1: failed to recover some devices, retrying in 5 secs ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1.00: disabled ata1: EH complete ata1.00: detaching (SCSI 0:0:0:0) Synchronizing SCSI cache for disk sda: FAILED status = 0, message = 00, host = 4, driver = 00 ========================================================== 2.6.21.1 smp hotplug install ========================================================== ata1: soft resetting port ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata1.00: 625142448 sectors, multi 0: LBA48 ata1.00: configured for UDMA/100 ata1: EH complete scsi 0:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sd 0:0:0:0: Attached scsi disk sda ========================================================= 2.6.21.1 smp hotplug remove ========================================================= ata1: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0x2 frozen ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1: failed to recover some devices, retrying in 5 secs ata1: hard resetting port ata1: SATA link down (SStatus 0 SControl 300) ata1.00: limiting speed to UDMA/100:PIO3 ata1: failed to recover some devices, retrying in 5 secs ======================================= Hope this is useful information for debugging the issue.
Yes, we do support SATA hotplug officially for some time now actually. It seems like a low level driver regression and I'm guessing the controller is not happy with the way driver is accessing it and just dies while holding the PCI bus thus leaving libata no chance for fighting. Harald, does specifying kernel parameter 'sata_nv.adma=0' make any difference?
Using that kernel parameter made it not crash on removing sata drive. Here's the dmesg output for that. There is one other thing I forgot to mention, even with the 2.6.20.x kernels, with the nforce4 chipset, it takes a long time (more than 30 seconds, maybe 60) for the system to register that the device has been removed, whereas when I used the kernel parameter you suggested, both install and removal occur almost immediately on the 2.6.21.1 test kernel. The test for the delay to detect removal is this on 2.6.20.x kernels: plug in sata/esata drive mount it to somewhere umount it unplug it wait 10 seconds, then try to mount it again system waits for 30-60 seconds before reporting that sdd1 does not exist with mount command. Here's the data from the successful test: test with kernel parameter: sata_nv.adma=0 =================================================== 2.6.21.1 smp: sata hotplug install: =================================================== ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: port is slow to respond, please be patient (Status 0x80) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata2.00: 625142448 sectors, multi 0: LBA48 ata2.00: configured for UDMA/100 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd ==================================================== 2.6.21.1 smp: sata hotplug remove: ==================================================== ata2: exception Emask 0x10 SAct 0x0 SErr 0x1910000 action 0x2 frozen ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: limiting speed to UDMA/100:PIO3 ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete ata2.00: detaching (SCSI 1:0:0:0) Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00
That delay thing is probably something fixed somewhere else regardless of the adma thing. Cc'ing Robert Hancock for investigating further why adma error handling locks up the whole machine. Robert, this is the second report of adma EH locking up the whole machine. What should we do about it?
The nForce3 system (and the nForce4 with adma=0) is using entirely different code paths than nForce4 with ADMA enabled, so it's not too surprising that the behavior may be different or that there may be an ADMA-specific issue. We should be able to turn on some debug to help figure out if anything is happening with libata just before the freeze.. Harald, have you tried running in console mode (not in X) with the console log level set to show all kernel messages and checked if anything shows up there before the lockup?
I tried setting console log level to show all kernel messages today, but unfortunately the freeze is instant on removal of sata drive, so there's no output at all.
It's a bit unclear how hotplug is supposed to be handled when the controller has ADMA enabled but is in port-register mode. Normally in this case we just ignore the ADMA-specific status and just check the non-ADMA status register and the standard ATA registers to see if we need to do anything. These controllers provide a hotplug bit in the non-ADMA status register but it's not clear if that is valid when in ADMA mode. Maybe we still ought to be looking at the ADMA status register in this case? There's some code in sata_nv.c that looks like this: /* if in ATA register mode, use standard ata interrupt handler */ if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { u8 irq_stat = readb(host->iomap[NV_MMIO_BAR] + NV_INT_STATUS_CK804) >> (NV_INT_PORT_SHIFT * i); if(ata_tag_valid(ap->active_tag)) /** NV_INT_DEV indication seems unreliable at times at least in ADMA mode. Force it on always when a command is active, to prevent losing interrupts. */ irq_stat |= NV_INT_DEV; handled += nv_host_intr(ap, irq_stat); continue; } Try taking the "continue;" line out and see what that does. That will continue with ADMA status handling even if we are in port-register mode.
Created attachment 11402 [details] Patch to fix some ADMA interrupt clearing issues Please try out the attached patch (also posted to linux-kernel and linux-ide) and see if it resolves your problem. I did some hotplug testing and while I didn't see any full system lockup, I did get a timeout on the other port on the same controller that the hotplugged drive was on. This patch fixes that problem for me, hotplug appears to work perfectly on my system with it.
I tried new kernel with the patch, the behavior is different slightly, but the end result is the same, system freezes at some point. It's more variable though, not always the same, for example, I can remove the disk from the system without the system instantly freezing sometimes, but parts of it fail and destabilize, until I have to reboot anyway. Other times the system locks up right away. I tried with loglevel=7 several times, but there's no output when I remove the disk, but it didn't freeze completely always, although sometimes it does, it depends on what I did before. For example, I turned it on, it is detected, I turned it off, the system did not freeze completely, but then I tried to turn it on again, and the system froze. Probably because the drive was not actually removed in kernel I'd guess, but the kernel didn't do a full freeze. Another time the system was half frozen, and I started to reboot, and it allowed to to partially reboot, but then froze about half way, probably I would guess when it discovered the missing drive or something. Again, no output at all from kernel logging on removal, same on install.
That seems rather odd.. Does the same thing happen with hotplugging an internal drive with the patch? eSATA is not quite the same as SATA with a different connector, the signalling voltages are higher (or are supposed to be) to accomodate a longer cable length, and so really requires a controller designed for it. Using a SATA to eSATA adapter may have some problems, i.e. the max cable length may be limited.
There hasn't been any issue at all with this until 2.6.21 came out. So I doubt it's meaningful whether it's esata or sata direct, but I'll check.
Yes, I can confirm it, tested with internal, makes no difference. But I can give a repeatable error now, it's always the same: With patch, add sata drive, shows fine, expected kernel messages. remove drive, no messages at all. The one difference the patch made was that now I can actually confirm that there are no kernel messages at all on removal, since it doesn't crash until I try to do something, for example, mount the removed drive. Basically the patch seems to have let the system keep functioning until it notices that the drive is missing for whatever reason, be it a reboot/shutdown, an attempt to mount the drive it thinks is still there, or what have you. then the system freezes completely, actually more frozen than before, since even the cursor stops blinking. But there is no kernel output at all on removal, so that's obviously got to the be the problem, the removal event is simply not registering, which it was in the 2.6.20 and earlier kernels.
Could you post what output you get from 2.6.20 (with ADMA enabled) for the plug and unplug? Obviously something is happening on your setup that's not on mine when the drive gets unplugged.. For it to lock up such that the cursor even stops blinking (that was on regular VGA text console?) would seem to suggest that something is making the system very unhappy. I'm not sure what could even do that. If you could set up to do a git bisect between 2.6.20 and 2.6.21, that would likely narrow down whatever change is triggering this..
That was a good question. The 2.6.20 kernel almost made me think that it also was going to show no output for removal, but while I was doing something else after removing it, the output suddenly appeared, maybe 2 or 3 minutes after I removed it. I realized that the 2.6.20 has probably always had this delay, I just never noticed it before because usually I turned off the device and didn't try to reconnect it right away. So I'd say actually the 2.6.20.x kernels aren't actually working correctly for nforce4 chipsets. Adding notice also seems to take signficantly longer than with nforce3 by the way, several seconds more. This is much easier to see for me now because I have an example of properly functioning add and remove with nforce3, where both add and remove notice appear pretty much right away. Further tests of the patched 2.6.21.1 kernel however are erratic, this time it simply froze as it did before when I removed the sata drive. It's highly unpredictable at this point. ============================================= loglevel=7 sata_nv.adma=1 2.6.20.7 smp kernel: install ============================================= ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: soft resetting port ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6, max UDMA/100, 625142448 sectors: LBA48 ata2.00: ata2: dev 0 multi count 0 ata2.00: configured for UDMA/100 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd =========================================== remove sata, delay of maybe 2 to 3 minutes for any output at all to show from kernel =========================================== pnp: Device 00:09 disabled. ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1501000 status 0x402 ata2: CPB 0: ctl_flags 0xd, resp_flags 0x0 ata2: Resetting port ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen ata2.00: cmd 25/00:08:3f:d6:42/00:00:25:00:00/e0 tag 0 cdb 0x0 data 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error) ata2: soft resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled sd 1:0:0:0: SCSI error: return code = 0x08000002 sdd: Current [descriptor]: sense key: Aborted Command Additional sense: No additional sense information Descriptor sense data with sense descriptors (in hex): 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 00 00 00 00 end_request: I/O error, dev sdd, sector 625137215 Buffer I/O error on device sdd1, logical block 312568576 Buffer I/O error on device sdd1, logical block 312568577 Buffer I/O error on device sdd1, logical block 312568578 Buffer I/O error on device sdd1, logical block 312568579 ata2: EH complete ata2.00: detaching (SCSI 1:0:0:0) sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568576 Buffer I/O error on device sdd1, logical block 312568577 Buffer I/O error on device sdd1, logical block 312568578 Buffer I/O error on device sdd1, logical block 312568579 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568632 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568633 sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00
Looking at the code, it seems likely that 2.6.20 ADMA would never notice the drive was gone until something tried to access it, because we didn't have the hotplug interrupt enable bits set. So a bisect likely won't help much since it will likely just point here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ce0cf6fafd02fb4c43fc1a1bee6069d6c0a36b1 which fixed one problem, but apparently in your case created another. Trying "nmi_watchdog=1 idle=poll" on the kernel command line might get you some useful output if the NMI watchdog can catch it. Otherwise I'll see if I can whip up a debug patch so we can get some idea of what's going on.
Those kernel options let me get error output on 2.6.21.1 kernels, with and without the adma patch, which didn't produce any different output, so I'm just posting the data from the unpatched 2.6.21.1 smp kernel: ================================================= 2.6.21.1-2 no adma patch: sata drive install added boot options: nmi_watchdog=1 loglevel=7 idle=poll Note: install notification was almost instant with these settings ================================================= ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata2.00: 625142448 sectors, multi 0: LBA48 ata2.00: configured for UDMA/100 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd ================================================= sata removal Note: did not crash on removal Note: ls /dev after removal continues to show sdd and sdd1, the new drive Note: to trigger detection: mount /dev/sdd1 /media/sata takes about 25 seconds before producing this output ================================================= ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1581000 status 0x402 next cpb count 0x0 next cpb idx 0x0 ata2: CPB 0: ctl_flags 0xd, resp_flags 0x0 ata2: timeout waiting for ADMA IDLE, stat=0x402 ata2: timeout waiting for ADMA LEGACY, stat=0x402 ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen ata2.00: cmd 25/00:08:3f:d6:42/00:00:25:00:00/e0 tag 0 cdb 0x0 data 4096 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error) ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: limiting speed to UDMA/100:PIO3 ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled ata2: EH pending after completion, repeating EH (cnt=4) sd 1:0:0:0: SCSI error: return code = 0x08000002 sdd: Current [descriptor]: sense key: Aborted Command Additional sense: No additional sense information Descriptor sense data with sense descriptors (in hex): 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 00 00 00 00 end_request: I/O error, dev sdd, sector 625137215 Buffer I/O error on device sdd1, logical block 312568576 Buffer I/O error on device sdd1, logical block 312568577 Buffer I/O error on device sdd1, logical block 312568578 Buffer I/O error on device sdd1, logical block 312568579 ata2: EH complete sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568576 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568577 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568578 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 312568579 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 0 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd1, logical block 1 sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device sd 1:0:0:0: rejecting I/O to offline device ata2.00: detaching (SCSI 1:0:0:0) Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00 mount: special device /dev/sdd1 does not exist
Created attachment 11441 [details] Hotplug debug patch Can you try out this debug patch (over top of the previous one) and see if it gives you any output when the drive is unplugged?
Created new kernel, with both patches. Using first set of kernel options, kernel gave no output until drive removal, as above. On removal, same as above, 25 second or so delay, then above kernel messages. Second try, without the kernel options, install message same as first time, on removal, system did not freeze, but also no remove messages. After waiting a few minutes, I tried running mount /dev/sdd1 /media/sata to force system to check drive, and system hung, not completely, I could still type, I could still change consoles, but I couldn't do anything, required a hard reboot to get out of it. The main change was that the system didn't hang on removal of drive, until the system was asked to use the drive, then it hung again, although it didn't freeze as completely as it did before, keyboard and mouse still worked (using gpm), input just wasn't processed. ================================================= 2.6.21.1- with adma patch 1 and 2: sata drive install added kernel options first try: nmi_watchdog=1 loglevel=7 idle=poll added kernel options second try: loglevel=7 ================================================= ata2: Non-ADMA hotplug ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: Non-ADMA hotplug ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata2.00: 625142448 sectors, multi 0: LBA48 ata2.00: configured for UDMA/100 ata2: EH pending after completion, repeating EH (cnt=4) ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd
Are you waiting a minute or so after it appearss to hang to see if it unseizes? It could be something was causing a timeout. I'm at a bit of a loss to explain how you aren't seeing a hot-unplug interrupt get generated, given that it works on my machine.. You could try some commands after unmounting and before unplugging the drive to cause it to be in either ADMA or port-register mode, and see if that makes a difference. Try: smartctl -d ata -a /dev/sdd to put it in port-register mode, or sdparm --command=sync /dev/sdd for ADMA mode.
Created attachment 11472 [details] smartctl output if of any use for debugging This is probably not important information but I thought I'd provide it anyway, it's the output of running the smartctl command listed prior to removing sata drive.
I ran further tests with the two commands listed: With kernel 2.6.21.1 smp, with both patches installed, first I plugged in sata drive, then ran the smartctl command, which resulted in this data: ================================================= 2.6.21.1- with adma patch 1 and 2: sata drive install added boot options: loglevel=7 ================================================= ata2: Non-ADMA hotplug ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen ata2: hard resetting port ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: ATA-6: WDC WD3200JD-22KLB0, 08.05J08, max UDMA/100 ata2.00: 625142448 sectors, multi 0: LBA48 ata2.00: configured for UDMA/100 ata2: EH complete scsi 1:0:0:0: Direct-Access ATA WDC WD3200JD-22K 08.0 PQ: 0 ANSI: 5 ata2: bounce limit 0xFFFFFFFFFFFFFFFF, segment boundary 0xFFFFFFFF, hw segs 61 SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdd: 625142448 512-byte hdwr sectors (320073 MB) sdd: Write Protect is off SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdd: sdd1 sd 1:0:0:0: Attached scsi disk sdd ================================================= smartctl -d ata -a /dev/sdd output: ================================================= see attached file: smartctl-output ================================================= sata removal Note: does not freeze on removal after running smartctl command, when I unplugged the drive, I got this information fairly quickly, the first notification of hotplug removal a second or two after unplugging drive. ================================================= ata2: Non-ADMA hotplug ata2: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0x2 frozen ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: limiting speed to UDMA/100:PIO3 ata2: failed to recover some devices, retrying in 5 secs ata2: hard resetting port ata2: SATA link down (SStatus 0 SControl 300) ata2.00: disabled ata2: EH complete ata2.00: detaching (SCSI 1:0:0:0) Synchronizing SCSI cache for disk sdd: FAILED status = 0, message = 00, host = 4, driver = 00 ====================================================== Next, I tried the sdparm command, after rebooting, hotplugging the drive, before removing drive, but unfortunately that pretty much froze up the system when I removed the sata drive. No output, I waited a few minutes, but nothing. Not a full freeze, I was able to switch between tty consoles, and I was even able to shutdown nano in my second tty, su to root, and issue the reboot command, but it hung completely as soon as the text 'system going down for reboot now' appeared. Also, I was not able to do anything in the initial root login, the one that hung after I removed the drive, and I couldn't create a new tty session, if that's relevant. But I was able to still type and work in nano as regular user, save, etc.
Can you post your full dmesg output from bootup with 2.6.21? It would appear that you're not getting any hot-unplug interrupt when in ADMA mode, but I do on my setup, so it would seem there must be some difference responsible for that..
Created attachment 11473 [details] dmesg from 2.6.21.1 boot This is the dmesg output for boot in 2.6.21.1 patched kernel. Hope it shows you what you need.
The only apparent difference that seems relevant (other than you running a 32-bit kernel and me running a 64-bit) would be that you're hotplugging a drive that's the second port on the controller (each controller instance has 2 ports) whereas most of my tests were with the first port. Could you try a hot-unplug test (after doing the sdparm command) on one of the drives on the first port, i.e. where either the ST380817AS or the ST3160827AS are connected?
kernel 2.6.21.2-rc1 just came out, so I thought I'd give that one a spin, but sadly, now the stuff just locks up completely, on sata insertion as well as removal (well, I can't actually test removal any more because of the initial lockup. On the bright side, usually the rseivb sequence lets me reboot. I tried removing another drive from what I assume is the other port (4 sata ports, 3-4 and 1-2, I assume each block has one controller. It's hard to say what happened, but the end result is the same, system freezes. However, since you're running 64 bit kernel here, with a different setup, I assume only 1 or 2 sata disks, I don't see much chance of working out this issue to be honest, unless you think of something different. The other drives I'm removing are my primary work data drives, and I'm not going to take any further risks with them, sorry, that's it for now. I wish I could help more, but I think we need to locate a setup for you that more closely resembles mine, 64 bit, multiple drives, too bad you don't live down the street, heh heh... Since I now basically have no sata hotplugging at all anymore, I'm giving up for now. The new kernels, I asked our kernel guru guy, are made with the libata patches, that's all I can tell you, and the only way I can get any output at all is using the args you gave: nmi_watchdog=1 idle=poll that lets me install/remove the drive, without that it's instant freeze, now on install too. I guess I'll just keep my fingers crossed that some other bug when it gets fixed takes care of this one, but I can't really do much more to help unless you come up with a really concrete test, although I'll give anything you want a try now and then, but not using my main data drives, that's just too risky for my taste. But I really do recommend that you create another partition so you can test 32 bit too, otherwise there are just too many differences, I guess I could test 64 bit, I have a 64 bit test install that I never use as well, but seems like we're grasping at straws at this point, so I'll wait until something more solid comes up. As to what happened when I removed another drive, it was pretty odd, nothing happened, nothing at all. No output message, no crash, until I did it a few times, then cfdisk to make sure it was there, then everything froze solid, not even the sysrq sequence could free it, and since that was my data drive, I decided, ok, that's it, no more testing on live disks with live data. My disk setup is this, by the way, I think on mobo, sata plugs are: 2. - this is the one I've been testing, esata/sata hotswapping 1. maybe /dev/sda, not sure 4 maybe /dev/sdb 3. I think this is /dev/sdc, not sure, that's the one I tried today to remove that's all I can tell you, 3 disks on system, sata connector 2 is empty or is used for esata depending on what I'm doing, the other 3 are active data drives. Hope something in here clicks and makes someone realize what's wrong, but I don't think I can do much more on my end at this point, since all the tests are resulting in the same essential outcome at this point, whether I use the new kernel, the kernels I made with patches, etc, so I really think it's now a matter of someone in the kernel world realizing what it could be, I'm not expert enough to be of much more use, sorry, wish I could be. I guess from now on, if I need to use esata I'll just boot with it on and forget all about hotplugging for a while.
Sorry, typo, change the 64 to 32 here, in other words, the fact we're using different systems, 32 and 64, just makes the odds of getting this debugged even lower: <<< more closely resembles mine, 64 bit, multiple drives
2.6.21.2 isn't going to behave any differently. Have you tried 2.6.22-rc2?
No, once we get it, which should be fairly soon, I'll give it a try. I guess what I'll do is just try the latest kernels when we get them, which is usually quite soon after release, if not day of, do a quick test on my 32 bit test partition, confirm if issue has changed, improved, or worsened, and leave it at that. I'll let you know what rc2 shows as soon as we get it here, rc1 we just got today. basically I think what I'll do is just the simple test: install kernel, reboot, try hotswap sata/esata, see what happens. If it fails on install, removal, etc, then I'll just leave it at that. I'll keep you posted though, I have to test these new kernels anyway, so might as well add one kernel crash recovery sequence to the initial test, just adds a minute or two to the test, now that I learned the sysrq method I don't have to do the hard resets anymore, at least usually not.
Oh, sorry, misread, you said: 2.6.22-rc2, no, that one I won't be trying for a while, maybe I can get someone to make me one though, we'll see. I usually just test our user kernels when they are released, I've avoided the rough stuff for now. I'll see if someone has that one made up though and give it a try if they do.
Small update: since I needed to do a backup and esata is now fully non-functional, I booted into an older kernel, 2.6.18.1, and found that there are no problems at all with sata hotplugging. Turned on device, got immediate notifiction in console, when finished with backup, turned it off, system responded exactly as it should have, no problems at all. The next backup I do I'll try running in the 2.6.19 kernels, and see if it's the same there. Also, I'm not sure if this is related, or simply a side affect of all the crashes, but when I tried to start 2.6.21.1 with esata drive attached and on, the system insists that the drive is mounted even though df -f, cat /proc/mounts all show it's not, so I couldn't even run a backup that way anymore since I couldn't mount the drive at all. In other words, for me now, 2.6.21 kernels are totally useless for esata/sata hotplugging activities, I'll report back on 2.6.19 and 2.6.22 when I get one running.
sorry, typo: I meant df -h
I have the same kind of problems. NForce4, 64-bit system, kernel 2.6.21. System hangs on any attempt to hot-plug or unplug the box using eSATA cable. Even worse, sometimes it hangs when just accessing the drive. For instance: I had the drive plugged, booted the OS, used it for several hours, then I left my workstation for another several hours, and when I came back and tried to acces the drive (just tried to play a video off it with mplayer) - system freezed. Freeze looks very similar, magic SysRq doesn't work as when it hangs on hotplug. System died 4 times in 2 days and seem to work OK after I've unplugged the drive.
You'll want to try the patch in comment 9 to see if that helps the problem. If not, can you try the debug patch in comment 18 and see what output it gives you, if any?
Setting status to NEEDINFO. Please provide the requested information.
Kernel 2.6.22-rc7 smp, hard lockup, identical to 2.6.21 behavior, could reboot with rseiub. I pretty much gave on this getting fixed (and looks like that was a good move) , and bought a pci-x Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller card to do esata with since I have to have this stuff work, the sil stuff seems to work fine. I did find some weird behaviors with esata even with that today however, but I can't yet pinpoint anything specific beyond the system went into some type of loop trying to mount/do some type of IO operation after the drive was removed (problems in EH? doesn't seem to be working very robustly) the esata drive (via the sil3132 card) which nothing could stop except a reboot. Sorry didn't get to 2.6.22 test until today, had to do other things with box than try to break it.
Harald, care to post the weird esata log? EH actually is supposed to be pretty robust.
I decided, since 2.6.22 clearly DID in no way fix this issue, to spend some time this morning seeing where the bug is and isn't. Remember I have dual channel sata, so the question was: will this happen if the other drives are removed? The answer is yes, it does, but it slightly different ways, I removed all drives but one, then tried hot installing into either the second port(?) in the main sda channel, or a port in the second sata channel. Both cases were different, but both crashed the system, with varying severity. I also decided to test 64 bit, which didn't crash it (by the way, you'll have to run 32 bit to actually ever find and kill this bug, it's pointless trying to debug this on a different kernel). I'll add some attachments, first for 32 bit then 64 bit. I also got some error output from my first try, using rseifub, at the e stage, I'll post that here, that was a total lockup though, on same channel as main drive. Code: 04 00 00 ba 1d 00 00 00 83 c0 01 83 e0 3f 89 83 00 04 00 00 8d 83 08 04 00 00 e8 2a 07 4f c7 8b 9b 10 04 00 00 81 eb 10 04 00 00 <8b> 83 10 04 00 00 0f 18 00 90 8d 83 10 04 00 00 39 e8 0f 85 78 EIP:[<f8c9bb72>] evdev event +ox152/0x170 [evdev] SS: ESP 0068:c03e7e70[or 6, can't read my writing] kernel panic: not syncing fatal failure exception in interupt. The details on the other issue I didn't write down, basically what happened is that I tried to manually mount esata drive, it hung, then I tried it again, and I think the first attempt kept banging at it, giving the same error over and over, which I didn't write down. I can't say I can call what I'm seeing 'robust EH', no offense, to me it looks like badly broken EH.
Created attachment 11925 [details] 64 bit nvidia nforce4 data 64 bit data, 64 bit is not as badly broken as 32 bit. I recommend someone with a 32 bit system and at least 2 sata hard disks work on this, or that someone installs a 32 bit system (15 minutes to do)
Created attachment 11926 [details] 32 bit data, as much as I could collect Note that the behavior is slightly different depending on if you use the second port on channel 1, or if you use one of the empty ports on channel 2, using the same port seems to create a harder crash. Since I removed 2 of the drives, this should make it easier: it's not related to the other drives, and it happens on both channels, but in different ways. Since this is always reproducible, it's hard to see how someone without the same hardware couldn't get this, unless it's related to some totally hitherto unsuspected issue (which it might be, who knows?). Again, nforce 3 has no problems, it's perfect. And sil stuff is working fine too, with that one exception I mentioned, which I might be able to reproduce I think, though it's also a fairly nasty hang if it happens, the system basically endlessly complains about missing IO device for action. To reproduce, if I remember right, this is with sil, and it was in kde, not console: automount comes up, cancel, then use manual mount command to try to mount esata. Something internally in the system just doesn't register that the drive is gone, and for hours will keep trying to write to the missing device. Can't say if this is hal, but I do know it was flashing this kernel error line in console continuously, so quickly I wasn't able to do anything, and it lasted for about 5 hours before I gave up and rebooted. Again, this EH does not strike me as robust at all, sorry.
The Silicon Image card issue will be something else, a separate bug entry should be opened for that problem as it is just confusing the issue otherwise. (More details would be needed as well). I'm not seeing a conclusive difference between 32-bit and 64-bit kernels from your output. It all comes back to the same thing I concluded before, you're somehow not getting hotplug or hot-unplug interrupts when the port is in ADMA mode, and whether or not it will be in that mode depends on the last command issued to the drive. When the drive is connected or disconnected, it should be detected immediately. If not, there's little point comparing what happens after that point since it's clearly not working properly. I'm still entirely unable to reproduce this, also - hotplug and hot-unplug seem to work perfectly fine for me regardless of which mode the controller is in. I can't really explain how that would be, unless it's somehow related to the particular mainboard or chipset revision or something. I'm adding Jeff Garzik to the cc list, he has the docs for this controller which might be able to shed some light on this.
Yeah, robustness my ass. I'll try to reproduce the infinite retries on device removal. I've seen that behavior with mplayer which just never gives up no matter how many times it was told the requested IO operation failed. I don't think we have in-kernel user which is that persistent but who knows. When the infinite message occurs, what were you doing exactly? Does fuser show anything accessing the device when the problem occurs? I'll prep a patch to shut up the messages after certain times. Robert, hotplug event delivery working or not, I think it's really important to avoid at least oopsing. Would that be difficult with the controller?
You're right, it shouldn't be oopsing. It's not clear where that is happening though, the EIP they mentioned looks like it's in evdev somewhere?
Harald, can you please build the kernel with frame pointer and capture the full oops using either serial or netconsole (Documentation/networking/netconsole.txt)? Thanks.
< When the infinite message occurs, what were you doing exactly? I think what happened was: I hot installed esata drive, popup automounter appeared, I clicked mount (which I usually don't do with esata), it gave a mount error (you can't actually mount esata as user in kde unless I change some permissions I believe), then I turned it off because a normal command line root mount failed, then turned it back on, mounted it and did some work with it, but the whole time the system was running at 100% or so, hot, and I finally realized that the initial process had gotten stuck in some loop of failed IO operations, didn't write down the message, but it was streaming by in console so fast I couldn't actually see what I was typing. I checked in htop, and the job that was eating up the cpu was a mount operation, I believe the second one I tried after I got the mount error from automount. I couldn't kill the job at all, no matter what I tried, so I finally gave up and rebooted. However, since this involves kde and maybe other things I agree it doesn't belong in this bug report, except maybe to indicate that the kernel error handling isn't all it could be... <<< can you please build the kernel with frame pointer and capture the full Not this week unless I find a few free hours, maybe next week, sorry, bad time for me this week, but I'll see. oops using either serial or netconsole
I'm worried that we might have some in-kernel retry logic which enters infinite loop on failure. If that's not the case, the only fault in kernel is not throttling the IO-rejected-to-offline-device messages. A process which can't be killed mean that the process is stuck in the kernel tho.
Well, this is getting fun... brand new machine, old intel pentium 3, i810 era chipsets, HP, with TI firewire card going to external firewire drive, data storage. This one looks remarkably similar to my recent looping issue. Kernel 2.6.22-rc7 in both cases. 2.6.22-rc7 is now off my main box, I do not consider it stable enough to run, and with this latest failure I'm even less enthusiastic. Attached dmesg log, looks scary, right? Looks like a dead drive, with lots of damanged sectors, in fact. Turn off drive, let system stabilize, turn back on, works fine, no issues. I did realize one common thread however, both times the drive is mounted in a directory shared via nfs, in other words, the nfs-kernel-server is running on these two machines. I haven't had time to test 2.6.22 on a non shared directory, and unfortunatley I won't be able to get to this, but since this one just happened, thought I'd post it since it looks like this might actually have something in common, failure of hot plugged drives, but this case is firewire. Overall it seems like only some things trigger this, I can't say what yet, not enough facts, I wish I had time to test non nfs mounted stuff to see if that pinpoints it further but won't for a while.
Created attachment 11945 [details] Some logs this time, and no kernel crash... Since this output reminded me of what I saw with the esata issue, and the event itself suggests hardware failure where none appears to be, this is sort of interesting. Of course, these 3 events could all be totally unrelated, stranger things have happened.
The message is missing the interesting part. What happened is.. 1. For some reason the drive went offline (could be a hardware issue, could be a problem in firewire stack) 2. The disk device is offlined. 3. Each user of the filesystem gets errors. Some will retry but they eventually give up. 4. Device turned back on and reattached as different device. Note that #3 and #4 are separate events, so you turning off the device and turning it back on didn't really affect #3. Or did you turn off the device before #1? Please attach full log.
drive went offline, correct, first. After an hour or so I realized it had in fact gone offline, and turned it off. This has happened before on this test box, but I didn't register it last time, this time I noticed it and realized what was happening. First attempt to turn on was just a quick off/on, with the posted results. Then I left it off for about 10 minutes, started it, mounted it, was fine. You're right, it reinstalled a new device, and as that new device, worked fine, and is still working fine, which to me rules out or makes less likely hardware issues. I'll have to wait to post more, sorry.
That doesn't necessarily rule out hardware issue but in fact we're getting a new firewire stack and you might be hitting a bug there. It could be that the new driver isn't properly dealing with transient hardware failures or what not. Anyways, please post full kernel log when you have time. Thanks.
Created attachment 11946 [details] kernel log, i810 firewire issue kernel log from july 4, intel i810 box.
I'll be trying the fix later, but I wanted to post this, to keep this alive: Since we now have 2.6.22.1 kernels, I ran some tests again. First, and I think foremost, I want to really urge the bug handler here to stop assuming that using 64 bit systems to test a primarily 32 bit issue is a good idea. So can we either get someone assigned who can install a 32 bit test system, or who runs one, or get the current bug person to do that? The assumption that the bugs would be the same, as I found today, was utterly unwarranted. This point is to me so utterly basic that I have little hopes of this bug ever getting resolved if such a simple thing cannot be done. So, without further ado, here's today's basic findings, with 32 and 64 bit kernels. First, with 64 bit, latest Debian Sid system, 2.6.22.1 kernel, smp: hotplug sata drive: system sees the event right away, all appears good and normal, no lag on detection, drive is present and accounted for. All is good. hot unplug drive: system sees nothing, but does not crash. Recalling the advice here, I decide to give system a kick, and try to cfdisk /dev/sdd I wait quite a while, maybe a minute or two, then the remove events scroll by and cfdisk complains no drive is present. So after the system is given the kick, it does discover the missing drive, then implements the error handling, which works. But the discovery itself does not work. All, except for failing to detect automatically the drive removal, is working reasonably well. So that's it for 64 bit. Some small issues, but I assume issues that are handleable. Now, getting back to the initial bug report, 32 bit kernel, latest Debian Sid, 2.6.22.1 smp kernel. Boot into kernel, hotplug in sata. System immediately freezes. Fully. Instantly. As you can see, hopefully, the behavior in 64 bit and 32 bit is not the same, so let's test 32 bit, and then worry about fixing the smaller 64 bit issues. Installing a 32 bit system on an amd64 box takes a few minutes. I would really like to get this issue resolved one day, but we're going to need to run the same kernels I think. Examining the freeze, I give up, and start the rseiub sequence, but then notice that at RS, suddenly the kernel sees the drive, with this message: ata6: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen then, about 40 of these in a row: vt: argh, driver_data is NULL! then the normal sata discovery data below that. Curious at this point, I check to see if I can remove the disk again, but, sadly, no joy, full freeze, rseiub. So that's the current status with the current kernels. Nothing much has changed, on 32 bit the situation is a bit worse than when I first reported this issue, on 64 bit, a bit better. Meanwhile, having given up on any near term solution in kernel, I'm still happily running my esata stuff via a pci-e sil 3112 card, which hotplugs happily. I'll see about running the next test building a kernel, as requested, but can we at least agree that 32 and 64 bit is not the same thing? and then maybe start looking at the same system for testing and debugging purposes? This does not strike me as an unreasonable request.
First of all, I think we need to clarify what is a freeze and what is disk activity stalling. If SysRq, etc. works then I don't call that a freeze, though if disk activity has stalled then processes may appear to be hung up. In this situation it may still un-hang after commands time out and trigger a reset. If it was truly frozen (interrupt storm, hardware stall, etc.) then all that would work would be reset button. Given that ambiguity, I'm still not entirely convinced that there is any consistent difference between 32 and 64 bit. The fact still remains that in neither case are you seeing all of the expected hotplug/hot-unplug interrupts showing up, whereas for me, in 64-bit mode, it consistently works. Without explaining why that is, there's little point in running other tests. Maybe it is a different chipset revision? Can you post the output of "lspci -v" ?
Created attachment 12312 [details] lspci -v output There's the lspci -v output. Why you won't test your claims re 32/64 bit, validate them empirically as correct or incorrect by actually running the stuff for real, is beyond me. If the result is the same, great, then you know. If it's not, great, then you know that too. Isn't it better to KNOW than to believe when it comes to questions like this? From what I can see, the guesses are making this bug worse on 32 bit and better on 64 bit. I find it odd to even have to say this, to be honest. Anyway, the 64 bit gives me some hope, since it's quite close to working, so that's something.
I'll give the theory that this is an event that should time out a try, and post back, however, that never was the case in the past. I'll give it... 10 minutes? Longer? then post back with the results. If I understand you right, if it's not a freeze, it will at some point recover. .
I gave it 30 minutes, nothing happened, it stays crashed. Whatever semantics you want to apply to this event is fine with me, when you hotplug sata in 32 bit kernel it's crashed and it doesn't recover. You can still use rbseiub, but that's all you can do. You are technically correct to note that the kernel is not fully frozen, gpm for example still works, but that's all that works, keyboard is dead, however you want to describe this is up to you, the kernel has crashed and does not recover on its own. For real world users, that's it, dead. Also noteworthy in my opinion is the fact that while on hotplug after doing RS you get the event of sata install registered, output etc, on removal of the drive, nothing happens, you just have to rseiub your way out of the situation, it's fully locked up. I can't really tell you more than this, when I have time I'll try to make one of the kernels with framepointer to see if I can get some more output. By the way, hotplug works great in windows 2000 on this box, mobo, etc, can't remember if I mentioned that.
Created attachment 12315 [details] root version of lspci-v, has some more data Just noticed that the first version was not complete, this one has all the output, no 'access denied', sorry.
Harald, please lemme clarify a few things. In, say, 95% of cases, 64 and 32 bit kernels act the same as far as libata drivers are concerned. I guess this one might fall in the 5% region tho. Also, distinguishing between 'system freeze' and 'disk stall' is very important primarily because it gives a good clue about what's going on in the machine. System freeze, as Robert said, indicates IRQ storm or ATA controller dying while holding PCI BUS, etc whereas 'disk stall' indicates missing EH events, transmission errors or other driver bugs, so getting this right is important for debugging. Robert isn't trying to tell you it's less broken that way. A simple way to test is pressing numlock and see whether the LED responds. And, yeah, I suppose windows drivers are in better shape. Till this day, no one in Linux land has docs for nv adma. All we have is quasi-working sample driver from nvidia and Robert has been reverse engineering a lot, so let's give him some credit for that. I understand that debugging process can be very frustrating and appreciate the time and energy you put into fixing this problem. Bug reporters who can test patches and report the result are really valuable to the development process. I just wanted to clarify a few things so that there's no misunderstanding. Thanks.
yes, I understand all this stuff, but I still fail to understand the logic, if there is any, behind simply spending 10 minutes to install a 32 bit system then confirming that yes, in fact, it's the same. This is what I find frustrating more than anything else in this process. I do it, and I'm not some elite hacker, so what is the problem here? The refusal to do this basic step I find almost absurd, given that it takes about 10 minutes to install a distro, far far far less time than I'm spending on each one of these steps, so other than sheer poverty, a 100% full hard disk, or whatever other reason, I simply cannot see any valid reason to not do this simple thing. The freeze issue I understand fine, but since I'm describing the phenomenae as they occur, it's fairly clear what's happening. I pointed out that windows worked to show that the hardware itself is probably not at fault, nothing more. Maybe I should start posting on the nv forums. The problem here is that a large number of higher end consumer mobos are still running nforce4 stuff, I was surprised recently to find how many are running this. having to reverse engineer is obviously no fun at all, and it's not to nvidia's credit that they are not providing better code, why that is is hard to understand. However, it does seem that this process is basically not getting anywhere, so it strikes me as a good time to try different things at this point.
Well, your nVidia Corporation CK804 Serial ATA Controller is revision f3, same as mine, so that doesn't explain any difference. Looking at some of your earlier output, I see this line on a hot-unplug case: ata2: EH in ADMA mode, notifier 0x0 notifier_error 0x0 gen_ctl 0x1581000 status 0x402 next cpb count 0x0 next cpb idx 0x0 This indicates that the controller is showing interrupt pending and ADMA hot-unplug in status, but apparently it didn't actually raise the interrupt. These controllers have this funky notifier mechanism for acknowledging interrupts which is a real pain, honestly. Can you try making a change in sata_nv.c, this line is on line 898 in the current git tree, should be nearby in similar versions: if(notifier_clears[0] || notifier_clears[1]) { Can you change that to just: { and let me know if that changes anything? That will make it write to the notifier clear registers regardless of whether we think we need to or not.
That patch seems to have undone whatever damage other patches did since I first filed this bug report. So it's an improvement. In 32 bit, now, when I hotplug the sata drive, it acts as expected, it registers the event, detects the drive, everything is normal, pretty much the same as in 64 bit, maybe a bit slower, I'd have to really look at both to compare, my impression was that 64 bit registered and reacted a bit faster, but I'd have to really time it to be sure, neither took very long. Sadly, the initial problem remains the same, 32 bit crashes on hotunplug, rseiub required to get out. I don't mind running these patches if they are simple like this, it's fairly easy to do, albeit time consuming. So that's promising, or something anyway, which is nice to see, your guess looks like it was right for hotplug, not for hotunplug.
*** Bug 7545 has been marked as a duplicate of this bug. ***
*** Bug 6665 has been marked as a duplicate of this bug. ***
*** Bug 8787 has been marked as a duplicate of this bug. ***
Small update, some more facts. I'm hoping that one day, one of these facts is going to set off a bell in someone's head, and the fix will become obvious: # dmidecode | grep -A4 Base Base Board Information Manufacturer: MICRO-STAR INTERNATIONAL CO., LTD Product Name: MS-7125 Version: 1.0 Serial Number: I was talking to a developer, and he was trying to reproduce the bug on an MSI neo4 platinum nforce4 board, but he realized that there are several main revisions, he was testing on revision 3.0, while mine is the first, Version: 1.0, the one with the mobo dual nvidia+marvel gigabit networking. So that's the first thing, this bug appears to strike nforce4 PCB version 1.0 but not 3.0. I don't know if there is a 2.0, or if it is affected. This might account for why this is so difficult to reproduce. Second, and I thought much more interesting. Since my last tests, I had forgotten to change my esata drive from the mobo attached ports to the sil pci-x card, I started it up, having forgotten this, and of course the system instantly froze, as expected with the stock kernels, unpatched. I swore to myself, then thought, oh well, might as well try to move the esata cable from the mobo connected port to the pcix card port before I give up and do a hard reboot. And guess what! The kernel instantly de-crashed, unlocked, unfroze, pick your term... not after after a few seconds, it was instant. I think this has to give some meaningful clue, I can I suspect reproduce this with some kernel and dmesg output, but this has to be meaningful, an event exists that will uncrash the kernel instantly when applied, and that event is removing the connection and creating a new one to a non problem sata chip.
It would certainly appear that the board type/revision is somehow related here.. Possibly it's a different revision of the chipset or something (though we already checked that the PCI revision code is the same as mine, that doesn't necessarily mean anything though). I'm not really sure how we can tell what is going on here, though, without getting some assistance from NVIDIA and/or MSI..
for me she showed up one time in the floor of hour, however as a result of this problem I lost a mysql database - mysql innodb not start - "Accertion error" - did not help even "innodb_force_recovery = 4", backup was an a week remoteness - the works of whole department lost data for a few days, the management simply in shock - I going to discharge from job ((( this problem already whole year: ----------- I'm stumped trying to track down the below intermittent problem..... I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21. ----------- http://lkml.org/lkml/2007/6/14/154 http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765 http://kerneltrap.org/node/16175 "System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300 "sata hotplug removal of drive freezes all 2.6.21 kernels" http://bugzilla.kernel.org/show_bug.cgi?id=8421 "(sata_via) system freeze in random time" http://bugzilla.kernel.org/show_bug.cgi?id=9115 "kernel freezes with on clockevent warning" http://bugzilla.kernel.org/show_bug.cgi?id=9834 "[pata_ali] Unspecified hang on Acer laptop" http://bugzilla.kernel.org/show_bug.cgi?id=9898 "System freezes after I/O on pata_jmicron device" http://bugzilla.kernel.org/show_bug.cgi?id=10296 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437 https://bugs.launchpad.net/ubuntu/+bug/226600
kiev, I think you are referring to a bunch of entirely unrelated problems that happen to have some similar symptoms..