This is a Lenovo ThinkPad R61 7733A82 with a hotpluggable "Ultrabay" DVD±RW/CD-RW drive. If I remove the drive before booting and if I never insert it, suspend/resume works. However, once I insert the drive (or if I start up with the drive inserted), all future resumes will fail unless I insert the drive before resuming. Resume failures come in the form of a freeze about a second into the resume process. The screen stays black, and the system is unresponsive to input - even the magic sysreq keys don't work. It's worth noting that hotplug is otherwise usable. I can even suspend the system after inserting and then removing the drive, and resume will work as long as I re-insert the drive before resume. Latest working kernel version: never known to work Earliest known failing kernel version: 2.6.26.3 Distribution: Ubuntu 8.04 amd64 Hardware Environment: lspci describes my ATA controller like so: > 00:1f.0 ISA bridge: Intel Corporation 82801HBM (ICH8M-E) LPC Interface > Controller (rev 03) > 00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE > Controller (rev 03) > 00:1f.2 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA > AHCI Controller (rev 03) > 00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev > 03) Software Environment: sysfs tells me that the optical drive is under ata_piix, and the hard drive is under ahci. > $ readlink /sys/block/sr0/device > ../../devices/pci0000:00/0000:00:1f.1/host3/target3:0:0/3:0:0:0 > $ readlink /sys/devices/pci0000:00/0000:00:1f.1/driver > ../../../bus/pci/drivers/ata_piix > $ readlink /sys/block/sda/device > ../../devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0 > $ readlink /sys/devices/pci0000:00/0000:00:1f.2/driver > ../../../bus/pci/drivers/ahci I have libata's ACPI support enabled (/sys/module/libata/parameters/noacpi == 0), and it appears that libata is managing Ultrabay hotswap. When I reproduced the freeze with TRACE_RESUME (/sys/power/pm_trace), I got this: > [ 5.853095] Magic number: 0:983:889 > [ 5.853098] hash matches drivers/base/power/main.c:350 > [ 5.853194] pci 0000:00:1f.2: hash matches My interpretation: it crashes when trying to resume the ahci.ko-controlled Intel ICH8M-E SATA controller. I will attach my full dmesg and lspci outputs. I'm very willing to post additional info or test patches. (And please tell me if this report was too verbose.) Thanks much, Daniel Gnoutcheff
Created attachment 18166 [details] Kernel bootup messages
Created attachment 18167 [details] Kernel messages showing hotplug For what its worth ...
Created attachment 18168 [details] lspci output
I've found that if I rmmod ata_piix, this problem disappears. Although, unsurprisingly, /dev/sr0 disappears when I do this. Re-loading ata_piix does not cause the drive to reappear.
can you please try latest acpi-test tree, there are some dock related fixes.
(In reply to comment #5) > can you please try latest acpi-test tree, there are some dock related fixes. Done, though I needed to follow Zhang Rui's instructions at bug 11563 comment # 21 (http://bugzilla.kernel.org/show_bug.cgi?id=11563#c21) in order to get the thing to compile. This bug is still present in acpi-test. I tried using /sys/power/pm_trace again, and I'm still getting the same results: > [ 6.030187] Magic number: 0:983:889 > [ 6.030189] hash matches drivers/base/power/main.c:350 > [ 6.030288] pci 0000:00:1f.2: hash matches I.E. still failing when resuming the ahci.ko-controlled SATA controller.
Can you try with init=/bin/bash? Just to make sure that it's not triggered through userspace somehow. Thanks.
By throwing TRACE_RESUME(0); in various places, I have been able to do a quasi-backtrace of the freeze: drivers/ata/ahci.c:2296 (ahci_pci_device_resume) drivers/ata/libata-core.c:5909 (ata_pci_device_do_resume) drivers/pci/pci.c:585 (pci_set_power_state) drivers/pci/pci.c:494 (pci_raw_set_power_state) msleep (which hangs forever) msleep is supposed to delay for pci_pm_d3_delay = 10 milliseconds. Instead, it hangs forever - but only in the case where the optical drive has been removed from the system. I've confirmed (by inserting debug messages) that pci_raw_set_power_state also calls msleep when resuming while the drive *is* plugged in, and apparently, it works correctly in that case. Why the Ultrabay device would have any effect on msleep is beyond me. But in any case, it appears that this might not really be a libata problem. All of this, BTW, is against git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6.git test. Please tell me if I should try out a different repository. Holger Macht, I'll try your test when I have time again.
Created attachment 18236 [details] TRACE_RESUME markers that reveal msleep as culprit
OK, I've finally gotten back to this. :D The problem remains on Ubuntu 8.10 amd64 with the master branch at git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git (At least as of commit d9d060a98f) More work with TRACE_RESUME allows me to extend the trace a little further: ahci_pci_device_resume ata_pci_device_do_resume pci_set_power_state pci_raw_set_power_state msleep schedule_timeout_uninterruptible schedule_timeout schedule (which never returns) By the time I reached the heart of the scheduling system, I decided that as a novice at C, I had to stop. ;) I'm currently looking around for other ways to debug this, looking to see if I can get anywhere with kdump. FWIW, some other Ubuntu users are reporting a problem that looks very similar: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/292144
Holger Macht, I've done a test similar to the one you suggested. I've booted the kernel and loaded a shell inside an initrd, and I loaded only the modules relevant to this bug, such that /proc/modules listed only: - ahci - ata_piix - libata - scsi_mod (libata dependency) Note that the SCSI disk and cdrom drivers were not loaded, so "sda" and "sr0" didn't show up. I was still able to reproduce the bug, with no noticeable changes in behavior. FYI, init=/bin/bash doesn't work for me because I'm basing my configuration on Ubuntu's, and Ubuntu depends on the initrd to load the disk drivers. Without the initrd, the root partition is inaccessible. So instead, I passed break=top, which tells Ubuntu's initrd init script to drop to a shell after mounting /proc, mounting /sys, making a tmpfs at /dev, and doing a depmod -a. HTH.
Re-confirming with the latest commit (a3dd15444b) of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git master. I've managed to get some hopefully-relevant kernel core dumps with kdump. I kludged drivers/ata/ahci.c:ahci_pci_device_resume and /kernel/timer.c:schedule_timeout to force a panic right before the schedule() call that never returns. (I'll attach the patch.) With this, I used kdump to get two dumps; one when trying to resume with the drive removed, the other with the drive inserted. I'm now going to go teach myself how to actually use and compare these dumps. If anyone happens to have any ideas on what I should look at first, I'd find that helpful. TTFN.
Created attachment 19345 [details] Kludge to force a panic right before the ill-fated schedule() call
Hmmm... I have no idea here. Looks like setting the power state somehow disturbs the machine. What happens if you comment out the msleep() there? Rafael, any ideas?
Call pci_restore_state() before setting the power state to D0? I think the pci_set_power_state(pdev, PCI_D0) in ata_pci_device_do_resume() can be removed altogether, but I'm not sure if that helps fix the problem at hand. If that's the case, the patch from http://lkml.org/lkml/2009/1/20/287 on top of 2.6.29-rc2 should help too.
The power calls have been there from the beginning and nobody really looked into whether it's correct or not. Anyways, Daniel, can you please comment out the pci_set_power() call and see whether anything changes?
> The power calls have been there from the beginning and nobody really looked > into whether it's correct or not. Anyways, Daniel, can you please comment > out > the pci_set_power() call and see whether anything changes? I will make that test (and some others that have been suggested) once I have time. Thanks so much for following up on this, I had almost given up.
OK, I've checked out linus' v2.6.29-rc3. This bug still exists in that version. I then commented out the "pci_set_power_state(pdev, PCI_D0)" in ata_pci_device_do_resume and recompiled. As far as I can tell, this has caused no change in behavior. Suspend/resume still works if the drive stays plugged in all the time, and hotplug still works, but remove -> suspend -> resume is still deadly. So, for this system at least, that line is unnecessary, but it's not the cause of this bug. ;) When I get back to this (probably this weekend), I'll do another bout of TRACE_RESUME stuff to see where it's crashing now. Again, thanks for your attention to this bug!
On Thursday 29 January 2009, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=11703 > ------- Comment #18 from gnoutchd@union.edu 2009-01-29 11:42 ------- > OK, I've checked out linus' v2.6.29-rc3. This bug still exists in that > version. Would it be possible to check where _exactly_ it fails in 2.6.29-rc3?
Reconfirming with (git-describe == v2.6.30-rc3-323-g9f5a691). (In reply to comment #19) > Would it be possible to check where _exactly_ it fails in 2.6.29-rc3? I think I may *finally* be able to get that to you ;) In a nutshell, I *think* the system is hardlocking at ata_sff_check_status. Here's my pseudo-backtrace: in kernel thread "scsi_eh_0" (which corresponds to the DVD-drive in question): * scsi_error_handler derefs shost->transportt->eh_strategy_handler to call: * ata_scsi_error derefs shost->hostdata[0] to get ata_port* ap derefs ap->ops->error_handler to call: * ata_sff_error_handler * ata_do_eh * ata_eh_recover * ata_eh_reset derefs parameter prereset (which comes from ap->ops->prereset) to call: * piix_pata_prereset * ata_sff_prereset * ata_sff_wait_ready * ata_wait_ready derefs parameter check_ready, calling: * ata_sff_check_ready derefs ap->ops->sff_check_status to call: * ata_sff_check_status * ioread8(ap->ioaddr.status_addr) * HARD LOCKUP! I should note that the above trace reflects my system as it is with SMP off and with ahci unloaded. I'm not ready to hurt my brain with those complicating factors yet. ;) Also, since I'm limited to TRACE_RESUME, there is a fair amount of inference on my part, so there's about a 15% chance I got something wrong here. For reference, here's how I decided to look at scsi_error_handler: After removing the pci_set_power_state, I noticed that the apparent crash location became at lot less consistent. TRACE_RESUME showed the system going a little farther into the resume process before hardlocking, and the exact failure point varied. I decided to kludge-up drivers/base/power/main.c to do a msleep after each device resume - and I discovered that msleep starts to "fail" right after the ata_piix-controled device is resumed. I later confirmed that the bug still occurs when ahci is removed, so ahci does not seem to have anything to do with this bug. I then noticed that piix_pci_device_resume, among other things, causes its scsi error handler thread to wake up. It occurred to me that the hardlocks might actually be occurring in that thread - and the msleep was just providing a good excuse for the scheduler to switch to the lethal thread. So I started tracing that read with TRACE_RESUME, and I got the results shown above. Thanks again guys, hope this is useful! Later, Daniel
Created attachment 21123 [details] ata_piix-pcs-hardreset.patch Does the attached patch make any difference?
Created attachment 21124 [details] ata_piix-pcs-hardreset.patch Oops, please try this one.
Created attachment 21126 [details] ata_piix module load w/ ata_piix-pcs-hardreset.patch (In reply to comment #22) > Created an attachment (id=21124) [details] > ata_piix-pcs-hardreset.patch > > Oops, please try this one. I'm afraid that did not help. With that patch applied, I get a BUG immediately upon ata_piix module insertion. (FWIW, the shell from which I run "modprobe ata_piix" never returns, but the system is otherwise responsive.) I've attached the kernel messages that correspond to the module load and failure. I've played a bit with objdump, and my best guess is that it's failing at drivers/ata/ata_piix.c:957 (with patch applied, of course ;). Thanks again for following up on this!
Oops, yeah, I got confused. PCS hardreset of course doesn't work on PATA. :-( Hmmm... Can you please make piix_pata_prereset() do ssleep(1) and then return 0 without doing anything else and see whether anything changes?
(In reply to comment #24) > Oops, yeah, I got confused. PCS hardreset of course doesn't work on PATA. > :-( > > Hmmm... Can you please make piix_pata_prereset() do ssleep(1) and then return > 0 without doing anything else and see whether anything changes? OK, I tried that. It produced no noticeable changes in system behavior - i.e. it's still crashing. I then did some TRACE_RESUME foolery again, and it looks like with that change, it's crashing just a bit later in the error handler. Updated backtrace, covering what's changed: ata_eh_reset (most likely in scsi_eh_0, as earlier) * ata_eh_freeze_port * __ata_port_freeze dereferences ap->ops->freeze to call: * ata_sff_freeze * iowrite8(ap->ctl, ioaddr->ctl_addr) * HARD LOCKUP (Same disclaimers about possible inaccuracy apply ;)
Hmm.. access to ctl freezes too? That means SRST can't be issued at all. Maybe it requires full controller reset, which we currently can't do... Does it work if you remove ata_piix before suspend and reload it after resume?
(In reply to comment #26) > Hmm.. access to ctl freezes too? That means SRST can't be issued at all. > Maybe it requires full controller reset, which we currently can't do... Does > it work if you remove ata_piix before suspend and reload it after resume? OK, I've tried that and a few other combinations, with interesting results: Scenario 1: 1. Boot with drive removed 2. Load ata_piix 3. Insert drive 4. Remove drive 5. Suspend 6. Resume (successfully) 7. Reload ata_piix Result: Lockup (shell unresponsive, but kernel still seems to respond to Alt+Sysrq+c) Scenario 2: 1-6: same as with scenario 1 7: Reinsert drive (ata_piix is not loaded) Result: Kernel OOPS, shell unresponsive, responsive Alt+Sysrq+c. After a long delay, a kernel panic. Scenario 3: 1. Boot with drive removed 2. Load ata_piix 3. Insert drive 4. Unload ata_piix 5. Remove drive Result: Lockup! Scenario 4: 1. Boot with ata_piix unloaded and drive removed 2. Insert drive 3. Remove drive 4. Suspend 5. Resume (successfully) 6. Reinsert drive (successfully!) 7. load ata_piix Result: system still running with no obvious problems! I've got some ideas of how to further investigate some of these crashes, but it'll be time consuming, particularly because (last time I checked) David Anderson's crash utility isn't working with current git kernels - and the older kernels that are debugable with crash produce significantly different behavior in some of these scenarios. Any other combos I should try? Thanks, Daniel
Sorry, I'm out of ideas. This is currently the only open bug report where ata_piix causes lock up during suspend/resume. Argh... weird. It could be that we're barking up the wrong tree. Maybe what we need is some acpi magic. Is there any bios update available for the machine?
(In reply to comment #28) > Is there any bios update available for the machine? There was. I was running on the version the machine came with, which was quite old. I've just upgraded it to the very latest version (released this past January). So far, I've seen no change; this bug remains, both with current git and with 2.6.29. FWIW, here is the BIOS changelog. (I went from 1.14-1.06 to 2.25-1.08) http://www-307.ibm.com/pc/support/site.wss/document.do?lndocid=MIGR-68451 Next weekend, I'll try investigating some of the new (possibly related) crashes we've run into. Perhaps that will get us some ideas. Thanks, Daniel
On the piix PATA controller, there are bits to tristate the interface which is to be used for PATA hotplugging and also is used for ultrabay. Maybe the driver needs to handle tristating but I'm basically shooting in the dark. If you're comfortable with massaging the driver, it would be something worth investigating. Good luck.
Thanks go to "Nils" at linux-thinkpad.org for finding a workaround! http://mailman.linux-thinkpad.org/pipermail/linux-thinkpad/2008-March/042891.html Adapted for my system, the workaround consists of: 1. Remove drive (with ata_piix loaded) 2. echo "0 0 0" | sudo tee /sys/class/scsi_host/host3/scan # (host3 corresponds to the ultrabay it seems) 3. Suspend 4. Resume 5. Profit! When I do this, the only thing that still fails is that I still get a hardlock when I try to yank out the drive while the system is suspended. Otherwise, as far as I can tell, hotplug and suspend work perfectly together. The system even notices the drive automatically if I re-insert it while suspended. This works for both v2.6.29 and for v2.6.30-rc4-336-g8c9ed89. Interesting thing is, that "rescan" produces no other noticeable effects: even without it, udev, hal and friends still notice the disappearance of the drive, and at least some of the relevant sysfs entries disappear as appropriate. But I haven't really investigated this yet, there might be something something anomalous that I haven't noticed. Perhaps I could take Anderson's crash utility to v2.6.29 and try to see what changes when I "rescan". Please tell me if this is indeed worth investigating. ([Google] + [some lucky search terms] == [awesomeness!]) Thanks again, Daniel
Heh.... I have no idea what's going on. ;-P Can you please post kernel log from boot to successful resume?
Created attachment 21494 [details] startup>suspend>resume w/ rescan workaround Attaching requested log (from v2.6.30-rc6-168-g6a44587) I've inserted lines (all starting with a '#') into the log to reflect what I was doing when. The TOC is: # 0: starting system # 1: loading ata_piix # 2: yanking drive # 3: rescan # 4: suspend/resume # 5: reinsert My exploring with crash hasn't really gotten me anywere; as far as I can tell, the softreset is the only really notable thing that happens when I rescan.
Created attachment 21558 [details] resume-verbose.patch Thanks for testing. Can you please apply the attached patch and repeat once more?
Created attachment 21578 [details] startup>remove>suspend>resume w/ resume-verbose.patch, rescan Done. Same set of embedded comments as before.
Sorry but I have no idea what different the extra rescan makes. From driver's POV, the resume operation is identical with or without the manual rescan. It's great that you've found a workaround but -ENOIDEA. :-( Anyone with ideas?
Interesting new find: this bug is actually a *regression*. Last known working version: 2.6.24-rc1 First known failing version: 2.6.24-rc2 With 2.6.24-rc1 and earlier, it's possible undock like so: 1. echo 1 | sudo tee /sys/class/scsi_device/3:0:0:0/device/delete 2. echo 1 | sudo tee /sys/devices/platform/bay.0/eject 3. Yank drive And suspend/resume works with that. It's not pretty (I'm glad I don't have to do steps 1-2 anymore), but it doesn't crash. But by 2.6.24-rc2, *in addition* to the above procedure, I have to do a manual rescan after yanking the drive. Otherwise, I get a hardlock on resume, just like with current kernels. Thanks go to Constantine Gavrilov for his report of a very similar-looking bug. His report gave me the idea to look at earlier versions: https://bugzilla.redhat.com/show_bug.cgi?id=473219 I'm pretty much in the middle of a git-bisect on this. It might become tricky, since at some point in the -rc2 development cycle suspend was completely broken, that might make it difficult to determine where *this* bug was introduced.
git bisect says: cdeab1140799f09c5f728a5ff85e0bdfa5679cd2 is first bad commit commit cdeab1140799f09c5f728a5ff85e0bdfa5679cd2 Author: Tejun Heo <htejun@gmail.com> Date: Mon Oct 29 16:41:09 2007 +0900 libata: relocate forcing PIO0 on reset Forcing PIO0 on reset was done inside ata_bus_softreset(), which is a bit out of place as it should be applied to all resets - hard, soft and implementation which don't use ata_bus_softreset(). Relocate it such that... * For new EH, it's done in ata_eh_reset() before calling prereset. * For old EH, it's done before calling ap->ops->phy_reset() in ata_bus_probe(). This makes PIO0 forced after all resets. Another difference is that reset itself is done after PIO0 is forced. Signed-off-by: Tejun Heo <htejun@gmail.com> Acked-by: Alan Cox <alan@redhat.com> Signed-off-by: Jeff Garzik <jeff@garzik.org> :040000 040000 cd76c03ceb1aa0e777cef62d31b80c805fbd529f d480c62e5a6d971c63dd10451fbf855bfb20f296 M drivers
Well, yeah, it can be called a regression but it's just getting affected by side effects of mostly unrelated changes. The bisected commit doesn't really reveal the root cause of the problem. It only adds to the controller's erratic behavior. It looks like the only thing we can do here is try not to touch the controller at all when the hotplug events are driven by ACPI. I wonder whether there's a way to determine device presence using ACPI. Holger, is there? Thanks.
Created attachment 21592 [details] acpi-link-disable.patch Can you please apply this patch on top of resume-verbose.patch and retest?
Created attachment 21595 [details] dmesg w/ resume-verbose.patch, acpi-link-disable.patch, rescan Done. Bug is still present, rescan workaround still necessary. Attaching dmesg of suspend/resume with rescan, TOC: # 0: starting system # 1: loading ata_piix # 2: yanking drive # 3: rescan # 4: suspend/resume # 5: reinsert # 6: suspend/resume # 7: suspend/resume, again
Created attachment 21596 [details] ata-devchk-kludge.patch More interesting news: the attached kludge seems to fix this bug when applied on top of v2.6.30-rc7-97-gb5c42bc. Basically, I noticed that before the bisected commit, ap->ops->set_piomode is called on an ata_device only if ata_devchk succeeds on that device (roughly speaking). After the commit, ata_devchk is not consulted. I decided to see what happens when the original, more conservative behavior is restored. And it works in that there are no more hardlocks, ever. Some related issues remain, however. If I reinsert the drive while the system is suspended, the it does not detect it on resume; a manual rescan is needed for that. I've also had a few other odd behaviors, but I can't reproduce any of them; I'll have to keep watch. No crashes though, not yet! Hope this helps!
(In reply to comment #39) > Well, yeah, it can be called a regression but it's just getting affected by > side effects of mostly unrelated changes. The bisected commit doesn't really > reveal the root cause of the problem. It only adds to the controller's > erratic > behavior. It looks like the only thing we can do here is try not to touch > the > controller at all when the hotplug events are driven by ACPI. I wonder > whether > there's a way to determine device presence using ACPI. Holger, is there? Sure there is. However, we should always keep 'git show fc5a9f8841ee87d93376ada5d73117d4d6a373ea' in mind. In some BIOS, calling _STA triggers a DEVICE_CHECK, not sure if this would hurt us much in this case. /** * dock_present - see if the dock station is present. * @ds: the dock station * * execute the _STA method. note that present does not * imply that we are docked. */ static int dock_present(struct dock_station *ds) { unsigned long long sta; acpi_status status; if (ds) { status = acpi_evaluate_integer(ds->handle, "_STA", NULL, &sta); if (ACPI_SUCCESS(status) && sta) return 1; } return 0; }
I just discovered PCI config spaces. ;) I notice now that there is a change in the output of 'lspci -xxxx -s 00:1f.1' when doing a rescan after removing a device (at least under v2.6.29): --- pci-config-removed 2009-05-28 12:10:28.000000000 -0400 +++ pci-config-rescan 2009-05-28 12:11:48.000000000 -0400 @@ -1,18 +1,18 @@ 00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 03) 00: 86 80 50 28 05 00 88 02 03 8a 01 01 00 00 00 00 10: 01 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 20: 01 1c 00 00 00 00 00 00 00 00 00 00 aa 17 a6 20 30: 00 00 00 00 00 00 00 00 00 00 00 00 ff 03 00 00 -40: 00 e3 00 40 00 00 00 00 01 00 02 00 00 00 00 00 +40: 00 c0 00 40 00 00 00 00 00 00 02 00 00 00 00 00 50: 00 00 00 00 10 00 01 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 86 0f 05 00 00 00 00 00 In short, 0x41 changes from 0xe3 to 0xc0, and 0x48 changes from 0x01 to 0x00. Not sure if this is significant ...
Created attachment 21621 [details] reg-fiddle.patch Interesting excerpt from the ICH8 datasheet, pg 195: (http://www.intel.com/Assets/PDF/datasheet/313056.pdf) > > During a PATA swap bay operation, if the operating system executes > cycles to the IDE interface after it has been powered down it will > cause the ICH8 to hang the system that is waiting for IORDY to be > asserted from the drive. > > To correct this issue, the following BIOS procedures are required for > performing an IDE swap: > > 1. Program IDE SIG_MODE (Configuration register at offset 54h) to 10b > (drive low mode). > > 2. Clear IORDY Sample Point Enable (bits 1 or 5 of IDE Timing reg.). > This prevents the ICH8 from waiting for IORDY assertion when the > operating system accesses the IDE device after the IDE drive powers > down, and ensures that 0s are always be returned for read cycles > that occur during swap operation. Since it seems like specified clearing and setting is not occurring on my system, I decided to see what happens when I force it. This is done in the attached kludge. This patch is successful in preventing the hardlocks, even in the "yank while suspended" case. However, once the drive is yanked, it is never recognized on reinsertion - I get messages about the corresponding ACPI event, but /dev/sr0 doesn't reappear, even after a rescan. I'm guessing this is because I don't set SIG_MODE back to 00b at any point. Hope this helps!
Nice spotting, Daniel. Hmmmm.... does it still work if you only change the timing register but leave the SIG_MODE alone? Thanks.
(In reply to comment #46) > Nice spotting, Daniel. Hmmmm.... does it still work if you only change the > timing register but leave the SIG_MODE alone? Thanks. I'm afraid that does not work. (That's actually what I tried first, I should have mentioned that.) For the record, I have not yet tried fiddling with just SIG_MODE, but I would think we should "fix" both if only because the spec says so ...
I thought the latter could be integrated into the current reset protocol easily, so... Anyways, it would be great if you can try SIG_MODE only. Thanks.
(In reply to comment #48) > I thought the latter could be integrated into the current reset protocol > easily, so... Anyways, it would be great if you can try SIG_MODE only. > Thanks. Done. SIG_MODE by itself works.
Created attachment 21644 [details] acpi-lflag-disable.patch Can you please test this patch and post kernel log? Thanks.
Created attachment 21738 [details] dmesg with acpi-lflag-disable.patch (In reply to comment #50) > Created an attachment (id=21644) [details] > acpi-lflag-disable.patch > > Can you please test this patch and post kernel log? Thanks. Done. Rescan workaround still needed, attaching log, TOC: # 0: startup # 1: load ata_piix # 2: normal suspend/resume # 3: yank drive # 4: rescan # 5: suspend/resume # 6: reinsert drive (Sorry about the delay on this one, busy with schoolwork this week and the next.)
Why is LFLAG_DISABLED being ignored? I'm missing something. I'll prep another patch soon. Please standby a bit. Thanks.
OK, I think I've found the root cause of this bug! I decided to investigate piix_set_piomode, since we found earlier (comment 42) that avoiding calls to that function successfully avoided the crashes. In that function, I found two very suspicious lines: > if (ata_pio_need_iordy(adev)) > control |= 2; /* IE enable */ This turns on the IORDY bit. But the spec tells us that this bit *must* be off when the drive is removed, or we get hardlocks (comment 45). Since this function seems to be run very early in the resume process, this looks bad to me. Commenting out those lines KILLS this bug! No more crashes, perfect hotplug! This neatly explains why my earlier attempts to force IORDY off didn't work - this function overrode my fiddling. Hope that helps!
Created attachment 21836 [details] no-iordy-during-reset.patch
Thanks a lot for tracking it down. You're a much better at debugging than I'm. :-) Can you please try the above attached patch?
(In reply to comment #55) > Thanks a lot for tracking it down. You're a much better at debugging than > I'm. :-) Aw, thanks! I really must give credit to Constantine Gavrilov, though, his reporting really helped me out (comment 37). > Can you please try the above attached patch? Done. It works. Grand! Random FYI, there are some other hotplug-related bugs that I haven't yet reported that remain unresolved with that patch (e.g. drive not getting recognized when inserted while suspended). I haven't looked into them yet because they're relatively minor, but I think I will pursue them now that this crasher's under control. I just wanted to mention this since those bugs *might* affect the way we resolve this one.
Okay, I'll push the patch upstream. Hmm... libata does probing reset after resume so the device should get detected and it works with sata devices just fine. Yeap, getting that fixed would be great. Thanks.
This bug seems to be fixed in 2.6.31-rc1.
Fixed by 0d9e6659a1bde3733cfd0072adbb3514b579e383