|Summary:||Secondary hard drive fails during both hibernation and resume.|
|Product:||Drivers||Reporter:||Alan Jenkins (alan-jenkins)|
|Severity:||normal||CC:||alex.shi, brian, bunk, djtm, jbarnes, m.s.tsirkin, matthew, rjw, tj|
|Bug Depends on:|
|Bug Blocks:||7216, 11167|
Dmesg including errors during resume
Output of lspci
dmesg showing errors after suspend to ram
dmesg from suspend/resume without pci=nomsi
lspci from acer aspire 4350
dmesg from suspend/resume *with* pci=nomsi
Description Alan Jenkins 2008-07-30 04:53:56 UTC
Comment 1 Alan Jenkins 2008-07-30 04:55:13 UTC
Created attachment 17027 [details] Dmesg including errors during resume
Comment 2 Alan Jenkins 2008-07-30 04:56:46 UTC
Created attachment 17028 [details] Output of lspci lspci may help identify my chipset.
Comment 3 Alan Jenkins 2008-07-30 05:16:36 UTC
Created attachment 17029 [details] dmesg showing errors after suspend to ram The same errors also occur on resume from STR. Here's a dmesg of that. I booted with init=/bin/bash this time, so this log may be less noisy. I used s2ram --force --acpi_sleep=3. (My machine can't be whitelisted by s2ram because it's DMI identification is uselessly generic).
Comment 4 Alan Jenkins 2008-07-30 06:28:54 UTC
Aha, a lucky guess. The commit responsible is 24920c8a6358bf5532f1336b990b1c0fe2b599ee. ("AHCI: speed up resume").
Comment 5 Alan Jenkins 2008-07-30 07:19:11 UTC
Sorry for slandering AHCI; that's not it. I tried to test using s2ram again and I must have got the options wrong. "speed up resume" is not the problem.
Comment 6 Rafael J. Wysocki 2008-07-30 08:08:08 UTC
(In reply to comment #5) > Sorry for slandering AHCI; that's not it. I tried to test using s2ram again > and I must have got the options wrong. "speed up resume" is not the problem. I assume this means the problem occurs, but is not related to the "AHCI: speed up resume" commit. Well, nothing obvious comes to mind and I'm unable to reproduce this.
Comment 7 Alan Jenkins 2008-07-30 12:40:24 UTC
The problem happens, but not in that commit. This is the right (or rather, wrong :-) commit: ce6fce4295ba727b36fdc73040e444bd1aae64cd is first bad commit commit ce6fce4295ba727b36fdc73040e444bd1aae64cd Author: Matthew Wilcox <firstname.lastname@example.org> Date: Fri Jul 25 15:42:58 2008 -0600 PCI MSI: Don't disable MSIs if the mask bit isn't supported David Vrabel has a device which generates an interrupt storm on the INTx pin if we disable MSI interrupts altogether. Masking interrupts is only a performance optimisation, so we can ignore the request to mask the interrupt. Signed-off-by: Matthew Wilcox <email@example.com> Signed-off-by: Jesse Barnes <firstname.lastname@example.org>
Comment 8 Alan Jenkins 2008-07-30 14:07:18 UTC
Since the above commit makes a difference, my hardware must trip the "mask bit isn't supported" case. $ grep MSI /proc/interrupts 222: 225 0 PCI-MSI-edge ahci $ This MSI interrupt stuff is a mystery to me. I guess something breaks the assertion that masking interrupts was only a performance optimisation... Oops. I forgot that there is something different about the second hard drive. The second one is actually a PATA drive (the other one is SATA). Sorry, that was a stupid omission. So this is almost certainly a bad interaction between ahci and ata_piix.
Comment 9 Alan Jenkins 2008-07-31 03:15:08 UTC
Sorry, another mistake. The drive that fails is not the "second" drive, the PATA one. The failure is in the original _SATA_ drive. The original drive became "sdb" after I added the PATA one. # PATA devices - sda, sr0 $ ls /sys/bus/pci/drivers/ata_piix/*/host*/target*/*:*/ /sys/bus/pci/drivers/ata_piix/0000:00:1f.1/host0/target0:0:0/0:0:0:0/: block:sda device_blocked generic ioerr_cnt model queue_type scsi_device:0:0:0:0 scsi_level timeout vendor bus driver iocounterbits iorequest_cnt power rescan scsi_disk:0:0:0:0 state type delete evt_media_change iodone_cnt modalias queue_depth rev scsi_generic:sg0 subsystem uevent /sys/bus/pci/drivers/ata_piix/0000:00:1f.1/host0/target0:0:1/0:0:1:0/: block:sr0 device_blocked generic ioerr_cnt model queue_type scsi_device:0:0:1:0 state type bus driver iocounterbits iorequest_cnt power rescan scsi_generic:sg1 subsystem uevent delete evt_media_change iodone_cnt modalias queue_depth rev scsi_level timeout vendor # SATA devices - sdb $ ls /sys/bus/pci/drivers/ahci/*/host*/target*/*:*/ block:sdb device_blocked generic ioerr_cnt model queue_type scsi_device:2:0:0:0 scsi_level timeout vendor bus driver iocounterbits iorequest_cnt power rescan scsi_disk:2:0:0:0 state type delete evt_media_change iodone_cnt modalias queue_depth rev scsi_generic:sg2 subsystem uevent
Comment 10 Alan Jenkins 2008-07-31 10:43:08 UTC
Created attachment 17043 [details] Proposed fix Gotcha. Matthew's original commit missed a second piece of logic that does the same thing on resume. The attached patch fixes this on my computer. Hopefully Pavel's boot-time suspend self-test will get this sort of thing noticed earlier :-).
Comment 11 Rafael J. Wysocki 2008-07-31 11:50:27 UTC
Handled-By : Alan Jenkins <email@example.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=17043&action=view
Comment 12 Alan Jenkins 2008-08-01 06:39:55 UTC
I'm not sure what is meant by Handled-By. I'm still hoping Matthew will accept ownership of this bug (and my proposed patch). I tried correcting the Product/Component to Drivers/PCI, but Bugzilla said I had to change the Version to 2.5 first :-(. Prod me if there's anything more I can do.
Comment 13 Rafael J. Wysocki 2008-08-01 13:22:44 UTC
Handled-By means that you have identified the problem and proposed a patch to fix it.
Comment 14 Rafael J. Wysocki 2008-08-02 13:11:36 UTC
*** Bug 11232 has been marked as a duplicate of this bug. ***
Comment 15 Alex Shi 2008-08-05 22:48:10 UTC
I meet a similar problem on my IBM T61. it report Input/output error. after back from resume in memory.
Comment 16 Alex Shi 2008-08-05 23:10:50 UTC
the "proposed fix" patch works for me. So hope it can be merged in upstream.
Comment 17 Dionisus Torimens 2008-08-05 23:48:10 UTC
Solves it for me, too.
Comment 18 David Brownell 2008-08-06 02:18:08 UTC
I couldn't resume on my system either; this fixed it. Surprised the fix didn't find its way into RC2 ...
Comment 19 Jesse Barnes 2008-08-06 09:15:12 UTC
The fix looks good, I just want to get willy's ack before merging it. I'll ask Linus to pull it asap after that... Thanks, Jesse
Comment 20 Rafael J. Wysocki 2008-08-11 12:59:39 UTC
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=abad2ec98f2ef357d62026cbc3989dabf33f2435
Comment 21 Rafael J. Wysocki 2008-08-11 16:09:13 UTC
*** Bug 11283 has been marked as a duplicate of this bug. ***
Comment 22 Rafael J. Wysocki 2008-08-16 12:36:20 UTC
*** Bug 11214 has been marked as a duplicate of this bug. ***
Comment 23 Brian J. Murrell 2008-12-19 19:27:29 UTC
I think I've got this same bug on an Acer Aspire 4350, but both of the Ubuntu 2.6.27-9-generic and the 2.6.28-rc8 kernel.org kernels I've built have this patch in them yet I still see this problem. I will post some attachments with the particulars of my machine. I find that if I boot either of these kernels with pci=nomsi, then the problem does not occur.
Comment 24 Brian J. Murrell 2008-12-19 19:29:28 UTC
Created attachment 19394 [details] dmesg from suspend/resume without pci=nomsi This is the dmesg after a resume where I have not booted the kernel with pci=nomsi and the disk is not working.
Comment 25 Brian J. Murrell 2008-12-19 19:30:19 UTC
Created attachment 19395 [details] lspci from acer aspire 4350 lspci from Acer Aspire 4350
Comment 26 Brian J. Murrell 2008-12-19 19:31:10 UTC
Created attachment 19396 [details] dmesg from suspend/resume *with* pci=nomsi Successful (albeit slow) resume when using pci=nomsi.