Kernel Bug Tracker – Bug 11178
Secondary hard drive fails during both hibernation and resume.
Last modified: 2008-12-19 19:31:10 UTC
Latest working kernel version: 2.6.26
Earliest failing kernel version: 2.6.27-rc1
Distribution: Ubuntu Hardy (8.04)
Hardware Environment:Intel C2D. Two hard drives.
Software Environment:64 bit kernel & userspace.
During hibernation I get errors on the second hard drive, including
ata3 revalidation failed errno=-5
Hibernation continues anyway (the swap partition is on the first hard drive). I think these messages don't show up in dmesg afterwards because they happen too late in the hibernation process.
I get much the same errors on resume as well, which do show up in dmesg.
Steps to reproduce:
Boot in "rescue mode" (didn't try init=/bin/bash) and use "echo disk > /sys/power/state"
Created attachment 17027 [details]
Dmesg including errors during resume
Created attachment 17028 [details]
Output of lspci
lspci may help identify my chipset.
Created attachment 17029 [details]
dmesg showing errors after suspend to ram
The same errors also occur on resume from STR. Here's a dmesg of that. I booted with init=/bin/bash this time, so this log may be less noisy. I used s2ram --force --acpi_sleep=3.
(My machine can't be whitelisted by s2ram because it's DMI identification is uselessly generic).
Aha, a lucky guess.
The commit responsible is 24920c8a6358bf5532f1336b990b1c0fe2b599ee.
("AHCI: speed up resume").
Sorry for slandering AHCI; that's not it. I tried to test using s2ram again and I must have got the options wrong. "speed up resume" is not the problem.
(In reply to comment #5)
> Sorry for slandering AHCI; that's not it. I tried to test using s2ram again
> and I must have got the options wrong. "speed up resume" is not the problem.
I assume this means the problem occurs, but is not related to the "AHCI: speed up resume" commit.
Well, nothing obvious comes to mind and I'm unable to reproduce this.
The problem happens, but not in that commit. This is the right (or rather, wrong :-) commit:
ce6fce4295ba727b36fdc73040e444bd1aae64cd is first bad commit
Author: Matthew Wilcox <firstname.lastname@example.org>
Date: Fri Jul 25 15:42:58 2008 -0600
PCI MSI: Don't disable MSIs if the mask bit isn't supported
David Vrabel has a device which generates an interrupt storm on the INTx
pin if we disable MSI interrupts altogether. Masking interrupts is only
a performance optimisation, so we can ignore the request to mask the
Signed-off-by: Matthew Wilcox <email@example.com>
Signed-off-by: Jesse Barnes <firstname.lastname@example.org>
Since the above commit makes a difference, my hardware must trip the "mask bit
isn't supported" case.
$ grep MSI /proc/interrupts
222: 225 0 PCI-MSI-edge ahci
This MSI interrupt stuff is a mystery to me. I guess something breaks the
assertion that masking interrupts was only a performance optimisation...
Oops. I forgot that there is something different about the second hard drive.
The second one is actually a PATA drive (the other one is SATA). Sorry, that
was a stupid omission.
So this is almost certainly a bad interaction between ahci and ata_piix.
Sorry, another mistake.
The drive that fails is not the "second" drive, the PATA one. The failure is in the original _SATA_ drive. The original drive became "sdb" after I added the PATA one.
# PATA devices - sda, sr0
$ ls /sys/bus/pci/drivers/ata_piix/*/host*/target*/*:*/
block:sda device_blocked generic ioerr_cnt model queue_type scsi_device:0:0:0:0 scsi_level timeout vendor
bus driver iocounterbits iorequest_cnt power rescan scsi_disk:0:0:0:0 state type
delete evt_media_change iodone_cnt modalias queue_depth rev scsi_generic:sg0 subsystem uevent
block:sr0 device_blocked generic ioerr_cnt model queue_type scsi_device:0:0:1:0 state type
bus driver iocounterbits iorequest_cnt power rescan scsi_generic:sg1 subsystem uevent
delete evt_media_change iodone_cnt modalias queue_depth rev scsi_level timeout vendor
# SATA devices - sdb
$ ls /sys/bus/pci/drivers/ahci/*/host*/target*/*:*/
block:sdb device_blocked generic ioerr_cnt model queue_type scsi_device:2:0:0:0 scsi_level timeout vendor
bus driver iocounterbits iorequest_cnt power rescan scsi_disk:2:0:0:0 state type
delete evt_media_change iodone_cnt modalias queue_depth rev scsi_generic:sg2 subsystem uevent
Created attachment 17043 [details]
Matthew's original commit missed a second piece of logic that does the same thing on resume. The attached patch fixes this on my computer.
Hopefully Pavel's boot-time suspend self-test will get this sort of thing noticed earlier :-).
Handled-By : Alan Jenkins <email@example.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=17043&action=view
I'm not sure what is meant by Handled-By. I'm still hoping Matthew will accept ownership of this bug (and my proposed patch).
I tried correcting the Product/Component to Drivers/PCI, but Bugzilla said I had to change the Version to 2.5 first :-(.
Prod me if there's anything more I can do.
Handled-By means that you have identified the problem and proposed a patch to fix it.
*** Bug 11232 has been marked as a duplicate of this bug. ***
I meet a similar problem on my IBM T61.
it report Input/output error. after back from resume in memory.
the "proposed fix" patch works for me. So hope it can be merged in upstream.
Solves it for me, too.
I couldn't resume on my system either; this fixed it. Surprised the fix didn't find its way into RC2 ...
The fix looks good, I just want to get willy's ack before merging it. I'll ask Linus to pull it asap after that...
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=abad2ec98f2ef357d62026cbc3989dabf33f2435
*** Bug 11283 has been marked as a duplicate of this bug. ***
*** Bug 11214 has been marked as a duplicate of this bug. ***
I think I've got this same bug on an Acer Aspire 4350, but both of the Ubuntu 2.6.27-9-generic and the 2.6.28-rc8 kernel.org kernels I've built have this patch in them yet I still see this problem.
I will post some attachments with the particulars of my machine.
I find that if I boot either of these kernels with pci=nomsi, then the problem does not occur.
Created attachment 19394 [details]
dmesg from suspend/resume without pci=nomsi
This is the dmesg after a resume where I have not booted the kernel with pci=nomsi and the disk is not working.
Created attachment 19395 [details]
lspci from acer aspire 4350
lspci from Acer Aspire 4350
Created attachment 19396 [details]
dmesg from suspend/resume *with* pci=nomsi
Successful (albeit slow) resume when using pci=nomsi.