Bug 11178

Summary: Secondary hard drive fails during both hibernation and resume.
Product: Drivers Reporter: Alan Jenkins (alan-jenkins)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: CLOSED CODE_FIX    
Severity: normal CC: alex.shi, brian, bunk, djtm, jbarnes, m.s.tsirkin, matthew, rjw, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 11167    
Attachments: Dmesg including errors during resume
Output of lspci
dmesg showing errors after suspend to ram
Proposed fix
dmesg from suspend/resume without pci=nomsi
lspci from acer aspire 4350
dmesg from suspend/resume *with* pci=nomsi

Description Alan Jenkins 2008-07-30 04:53:56 UTC
Latest working kernel version: 2.6.26
Earliest failing kernel version: 2.6.27-rc1
Distribution: Ubuntu Hardy (8.04)
Hardware Environment:Intel C2D.  Two hard drives.
Software Environment:64 bit kernel & userspace.
Problem Description:

During hibernation I get errors on the second hard drive, including

ata3 revalidation failed errno=-5

Hibernation continues anyway (the swap partition is on the first hard drive).  I think these messages don't show up in dmesg afterwards because they happen too late in the hibernation process.

I get much the same errors on resume as well, which do show up in dmesg.

Steps to reproduce:
Boot in "rescue mode" (didn't try init=/bin/bash) and use "echo disk > /sys/power/state"
Comment 1 Alan Jenkins 2008-07-30 04:55:13 UTC
Created attachment 17027 [details]
Dmesg including errors during resume
Comment 2 Alan Jenkins 2008-07-30 04:56:46 UTC
Created attachment 17028 [details]
Output of lspci

lspci may help identify my chipset.
Comment 3 Alan Jenkins 2008-07-30 05:16:36 UTC
Created attachment 17029 [details]
dmesg showing errors after suspend to ram

The same errors also occur on resume from STR.  Here's a dmesg of that.  I booted with init=/bin/bash this time, so this log may be less noisy.  I used s2ram --force --acpi_sleep=3.

(My machine can't be whitelisted by s2ram because it's DMI identification is uselessly generic).
Comment 4 Alan Jenkins 2008-07-30 06:28:54 UTC
Aha, a lucky guess.

The commit responsible is 24920c8a6358bf5532f1336b990b1c0fe2b599ee.
("AHCI: speed up resume").
Comment 5 Alan Jenkins 2008-07-30 07:19:11 UTC
Sorry for slandering AHCI; that's not it.  I tried to test using s2ram again and I must have got the options wrong.  "speed up resume" is not the problem.
Comment 6 Rafael J. Wysocki 2008-07-30 08:08:08 UTC
(In reply to comment #5)
> Sorry for slandering AHCI; that's not it.  I tried to test using s2ram again
> and I must have got the options wrong.  "speed up resume" is not the problem.

I assume this means the problem occurs, but is not related to the "AHCI: speed up resume" commit.

Well, nothing obvious comes to mind and I'm unable to reproduce this.
Comment 7 Alan Jenkins 2008-07-30 12:40:24 UTC
The problem happens, but not in that commit.  This is the right (or rather, wrong :-) commit:

ce6fce4295ba727b36fdc73040e444bd1aae64cd is first bad commit
commit ce6fce4295ba727b36fdc73040e444bd1aae64cd
Author: Matthew Wilcox <matthew@wil.cx>
Date:   Fri Jul 25 15:42:58 2008 -0600

    PCI MSI: Don't disable MSIs if the mask bit isn't supported

    David Vrabel has a device which generates an interrupt storm on the INTx
    pin if we disable MSI interrupts altogether.  Masking interrupts is only
    a performance optimisation, so we can ignore the request to mask the
    interrupt.

    Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Comment 8 Alan Jenkins 2008-07-30 14:07:18 UTC
Since the above commit makes a difference, my hardware must trip the "mask bit
isn't supported" case.

$ grep MSI /proc/interrupts
222:        225          0   PCI-MSI-edge      ahci
$

This MSI interrupt stuff is a mystery to me.  I guess something breaks the
assertion that masking interrupts was only a performance optimisation...

Oops.  I forgot that there is something different about the second hard drive. 
The second one is actually a PATA drive (the other one is SATA).  Sorry, that
was a stupid omission.

So this is almost certainly a bad interaction between ahci and ata_piix.
Comment 9 Alan Jenkins 2008-07-31 03:15:08 UTC
Sorry, another mistake.

The drive that fails is not the "second" drive, the PATA one.  The failure is in the original _SATA_ drive.  The original drive became "sdb" after I added the PATA one.

# PATA devices - sda, sr0
$ ls /sys/bus/pci/drivers/ata_piix/*/host*/target*/*:*/
/sys/bus/pci/drivers/ata_piix/0000:00:1f.1/host0/target0:0:0/0:0:0:0/:
block:sda  device_blocked    generic        ioerr_cnt      model        queue_type  scsi_device:0:0:0:0  scsi_level  timeout  vendor
bus        driver            iocounterbits  iorequest_cnt  power        rescan      scsi_disk:0:0:0:0    state       type
delete     evt_media_change  iodone_cnt     modalias       queue_depth  rev         scsi_generic:sg0     subsystem   uevent

/sys/bus/pci/drivers/ata_piix/0000:00:1f.1/host0/target0:0:1/0:0:1:0/:
block:sr0  device_blocked    generic        ioerr_cnt      model        queue_type  scsi_device:0:0:1:0  state      type
bus        driver            iocounterbits  iorequest_cnt  power        rescan      scsi_generic:sg1     subsystem  uevent
delete     evt_media_change  iodone_cnt     modalias       queue_depth  rev         scsi_level           timeout    vendor

# SATA devices - sdb
$ ls /sys/bus/pci/drivers/ahci/*/host*/target*/*:*/
block:sdb  device_blocked    generic        ioerr_cnt      model        queue_type  scsi_device:2:0:0:0  scsi_level  timeout  vendor
bus        driver            iocounterbits  iorequest_cnt  power        rescan      scsi_disk:2:0:0:0    state       type
delete     evt_media_change  iodone_cnt     modalias       queue_depth  rev         scsi_generic:sg2     subsystem   uevent
Comment 10 Alan Jenkins 2008-07-31 10:43:08 UTC
Created attachment 17043 [details]
Proposed fix

Gotcha.

Matthew's original commit missed a second piece of logic that does the same thing on resume.  The attached patch fixes this on my computer.

Hopefully Pavel's boot-time suspend self-test will get this sort of thing noticed earlier :-).
Comment 11 Rafael J. Wysocki 2008-07-31 11:50:27 UTC
Handled-By : Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=17043&action=view
Comment 12 Alan Jenkins 2008-08-01 06:39:55 UTC
I'm not sure what is meant by Handled-By.  I'm still hoping Matthew will accept ownership of this bug (and my proposed patch).

I tried correcting the Product/Component to Drivers/PCI, but Bugzilla said I had to change the Version to 2.5 first :-(.

Prod me if there's anything more I can do.
Comment 13 Rafael J. Wysocki 2008-08-01 13:22:44 UTC
Handled-By means that you have identified the problem and proposed a patch to fix it.
Comment 14 Rafael J. Wysocki 2008-08-02 13:11:36 UTC
*** Bug 11232 has been marked as a duplicate of this bug. ***
Comment 15 Alex Shi 2008-08-05 22:48:10 UTC
I meet a similar problem on my IBM T61. 
it report Input/output error. after back from resume in memory. 
Comment 16 Alex Shi 2008-08-05 23:10:50 UTC
the "proposed fix" patch works for me. So hope it can be merged in upstream. 
Comment 17 Dionisus Torimens 2008-08-05 23:48:10 UTC
Solves it for me, too.
Comment 18 David Brownell 2008-08-06 02:18:08 UTC
I couldn't resume on my system either; this fixed it.  Surprised the fix didn't find its way into RC2 ...
Comment 19 Jesse Barnes 2008-08-06 09:15:12 UTC
The fix looks good, I just want to get willy's ack before merging it.  I'll ask Linus to pull it asap after that...

Thanks,
Jesse
Comment 21 Rafael J. Wysocki 2008-08-11 16:09:13 UTC
*** Bug 11283 has been marked as a duplicate of this bug. ***
Comment 22 Rafael J. Wysocki 2008-08-16 12:36:20 UTC
*** Bug 11214 has been marked as a duplicate of this bug. ***
Comment 23 Brian J. Murrell 2008-12-19 19:27:29 UTC
I think I've got this same bug on an Acer Aspire 4350, but both of the Ubuntu 2.6.27-9-generic and the 2.6.28-rc8 kernel.org kernels I've built have this patch in them yet I still see this problem.

I will post some attachments with the particulars of my machine.

I find that if I boot either of these kernels with pci=nomsi, then the problem does not occur.
Comment 24 Brian J. Murrell 2008-12-19 19:29:28 UTC
Created attachment 19394 [details]
dmesg from suspend/resume without pci=nomsi

This is the dmesg after a resume where I have not booted the kernel with pci=nomsi and the disk is not working.
Comment 25 Brian J. Murrell 2008-12-19 19:30:19 UTC
Created attachment 19395 [details]
lspci from acer aspire 4350

lspci from Acer Aspire 4350
Comment 26 Brian J. Murrell 2008-12-19 19:31:10 UTC
Created attachment 19396 [details]
dmesg from suspend/resume *with* pci=nomsi

Successful (albeit slow) resume when using pci=nomsi.