Bug 14442

Summary: resume after hibernate: /dev/sdb drops and returns as /dev/sde
Product: Drivers Reporter: Duncan (1i5t5.duncan)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: florian, rjw, tj, yaa.bta
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc3-52-g0eca52a Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216, 14230    
Attachments: dmesg incl a successful resume (fr init s) and a bad resume (fr normal)
kernel config
dmesg from 2.6.31, hibernate and resume from X.
new log, good, bad with three disks dropping, 2.6.32-rc8-11-ga8a8a66
bisect log
partial log from 2.6.33-rc7-25-g08c4f1b

Description Duncan 2009-10-20 01:52:10 UTC
kernel 2.6.31 works.  The first kernel I tried in the 2.6.32 series, 2.6.32-rc3-52-g0eca52a (linus kernel) has the bug, and it is still there as of 2.6.32-rc5-11-g2fdc246.

I'm running 4-disk md/mdp RAID, md0 is md RAID-1 for /boot, md_d1 is mdp RAID-6 for the main system including rootfs, an emergency rootfs backup snapshotted periodically when the system is working well, and a larger volume setup for LVM2 management, md_d2 is mdp RAID-0 for speed for stuff like the local gentoo tree, ccache and kernel git, which are "backed up to the net" or simply cache in the first place.  md_d1 is kernel command-line assembled in ordered to boot to the rootfs thereon.  The other two RAIDs are assembled from user mode by mdadm.

According to the logs, returning from hibernate the kernel sees /dev/sdb with a different ID and thus fails to reactivate it in that slot, returning it as /dev/sde instead.  Of course, this plays havoc with my RAID, the RAID-0 isn't usable, and the RAID-6 needs hdb/hde re-added with a 2-ish hour recovery.  That 2-ish hour recovery is of course the problem if I try to bisect this thing.

The hardware:  Now aging Tyan s2885 dual Opteron 290 board, currently 6 gig RAM (recently memtest checked, it was 8 gig but one stick was bad, I've not replaced it yet so only 6 gig now).  The drives are all Seagate SATA 300 gig, model ST3300831AS, FwRev 3.03, accessed using the mobo SATA_SIL hardware and kernel driver (built-in).

Critically, only /dev/sdb seems to be unrecognized by the kernel upon resume and be kicked to /dev/sde.  From the logs, it appears there's a model number and/or and sector mismatch for 2 or 3 of the drives, but three of them are apparently sorted out while /dev/sdb fails to sort out and ends up kicked, then reappearing as /dev/sde.  Here's the relevant portion of dmesg, from just as it brings the 4th CPU core online thru the first md_d2 (RAID-0) error kicking it into read-only:

CPU3 is up                                                                                                                           
pata_amd 0000:00:07.1: restoring config space at offset 0x1 (was 0x2000001, writing 0x2000005)                                       
sata_sil 0000:01:0b.0: restoring config space at offset 0x1 (was 0x2b00103, writing 0x2b00107)                                       
amd756_smbus 0000:00:07.3: boot interrupts on device [1022:746b] already disabled                                                    
Intel ICH 0000:00:07.5: PCI INT B -> GSI 17 (level, low) -> IRQ 17                                                                   
usb usb2: root hub lost power or was reset                                                                                           
usb usb3: root hub lost power or was reset                                                                                           
sata_sil 0000:01:0b.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17                                                                    
tg3: eth0: Link is down.                                                                                                             
ata6.00: ACPI cmd ef/03:44:00:00:00:a0 (SET FEATURES) filtered out                                                                   
ata6.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out                                                                   
ata6.00: configured for UDMA/66                                                                                                      
ata5.00: ACPI cmd ef/03:44:00:00:00:a0 (SET FEATURES) filtered out                                                                   
ata5.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered out                                                                   
ata5.00: configured for UDMA/66                                                                                                      
usb usb4: root hub lost power or was reset                                                                                           
usb usb5: root hub lost power or was reset                                                                                           
usb usb1: root hub lost power or was reset                                                                                           
radeonfb (0000:05:00.0): resuming from state: 1...                                                                                   
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata1.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)                                                                  
ata1.00: revalidation failed (errno=-5)                                                                                              
ata4.00: n_sectors mismatch 586072368 != 458752                                                                                      
ata4.00: revalidation failed (errno=-19)                                                                                             
ata4.00: limiting speed to UDMA/100:PIO3                                                                                             
ata2.00: model number mismatch 'ST3300831AS' != 'ST3300S                             �'                                              
ata2.00: revalidation failed (errno=-19)                                                                                             
ata2.00: limiting speed to UDMA/100:PIO3                                                                                             
ata3.00: configured for UDMA/100                                                                                                     
ata3.00: configured for UDMA/100                                                                                                     
ata3: EH complete                                                                                                                    
sd 0:0:0:0: [sda] Starting disk                                                                                                      
tg3: eth0: Link is up at 100 Mbps, full duplex.                                                                                      
tg3: eth0: Flow control is on for TX and on for RX.                                                                                  
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata2.00: n_sectors mismatch 586072368 != 268435455                                                                                   
ata2.00: revalidation failed (errno=-19)                                                                                             
ata2.00: disabled                                                                                                                    
ata1.00: model number mismatch 'ST3300831AS' != '/'                                                                                  
ata1.00: revalidation failed (errno=-19)                                                                                             
ata1.00: limiting speed to UDMA/100:PIO3                                                                                             
ata4.00: configured for UDMA/100                                                                                                     
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)                                                                               
ata2.00: ATA-7: ST3300831AS, 3.03, max UDMA/133                                                                                      
ata2.00: 586072368 sectors, multi 16: LBA48 NCQ (depth 0/32)                                                                         
ata2.00: configured for UDMA/100                                                                                                     
ata1.00: configured for UDMA/100                                                                                                     
ata2.00: detaching (SCSI 1:0:0:0)                                                                                                    
sd 1:0:0:0: [sdb] Synchronizing SCSI cache                                                                                           
sd 1:0:0:0: [sdb] Stopping disk                                                                                                      
scsi 1:0:0:0: Direct-Access     ATA      ST3300831AS      3.03 PQ: 0 ANSI: 5                                                         
scsi 1:0:0:0: parent target1:0:0 should not be sleeping                                                                              
sd 1:0:0:0: [sde] 586072368 512-byte logical blocks: (300 GB/279 GiB)                                                                
sd 1:0:0:0: [sde] Write Protect is off                                                                                               
sd 1:0:0:0: [sde] Mode Sense: 00 3a 00 00                                                                                            
sd 1:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA                                              
 sde:                                                                                                                                
usb 3-1: reset low speed USB device using ohci_hcd and address 2                                                                     
 sde1 sde2 sde3 sde4 < sde5 >                                                                                                        
sd 1:0:0:0: [sde] Attached SCSI disk                                                                                                 
sd 2:0:0:0: [sdc] Starting disk                                                                                                      
sd 3:0:0:0: [sdd] Starting disk                                                                                                      
Restarting tasks ... done.                                                                                                           
Adding 4000176k swap on /dev/sda3.  Priority:100 extents:1 across:4000176k                                                           
Adding 4000176k swap on /dev/sdc3.  Priority:100 extents:1 across:4000176k                                                           
Adding 4000176k swap on /dev/sdd3.  Priority:100 extents:1 across:4000176k                                                           
raid5: Disk failure on sdb2, disabling device.                                                                                       
raid5: Operation continuing on 3 devices.                                                                                            
RAID5 conf printout:                                                                                                                 
 --- rd:4 wd:3                                                                                                                       
 disk 0, o:1, dev:sda2                                                                                                               
 disk 1, o:0, dev:sdb2                                                                                                               
 disk 2, o:1, dev:sdc2                                                                                                               
 disk 3, o:1, dev:sdd2                                                                                                               
RAID5 conf printout:                                                                                                                 
 --- rd:4 wd:3                                                                                                                       
 disk 0, o:1, dev:sda2                                                                                                               
 disk 2, o:1, dev:sdc2                                                                                                               
 disk 3, o:1, dev:sdd2                                                                                                               
agpgart-amd64 0000:04:00.0: AGP 3.0 bridge                                                                                           
agpgart-amd64 0000:04:00.0: putting AGP V3 device into 8x mode                                                                       
radeonfb 0000:05:00.0: putting AGP V3 device into 8x mode                                                                            
REISERFS error (device md_d2p1): vs-13070 reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [4293628 6795656 0x0 SD]                                                                                                                           
REISERFS (device md_d2p1): Remounting filesystem read-only                                                                           

I'll attach my kernel config and a full dmesg later, I'm headed into work momentarily and don't have time right now...
Comment 1 Duncan 2009-10-20 14:54:35 UTC
Created attachment 23477 [details]
dmesg incl a successful resume (fr init s) and a bad resume (fr normal)

Here's dmesg.

Around line 863 a first hibernate, from single user mode, begins.  Given the problems I had with it with earlier git kernels this cycle, I decided to try a suspend from single user mode with as much shut down as possible.  The RAID-6 was still up, as the rootfs is on it, but I believe I had the rootfs mounted read-only.

That hibernate resumes just fine, so I hope it's fixed, and dare to go back to normal operation.

Unfortunately, during a resume from a second hibernate from normal mode, hibernate starting around line 1002, the system kicks sdb and reloads it as sde.  The excerpt I posted above is lines 1059-1148.  I had just started the RAID-6 recovery, adding sde2 that should have been sdb2, when I took the dmesg.
Comment 2 Duncan 2009-10-20 15:01:39 UTC
Created attachment 23478 [details]
kernel config
Comment 3 Duncan 2009-10-30 15:08:51 UTC
Confirming it's still there as of 2.6.32-rc5-338-g2e2ec95.

In fact, I rebooted to the above kernel, had the system startup fine, but didn't boot into X (I don't have it start by default, here), and tested the hibernate.  It worked.

So I logged in and started X/KDE and went to work.  A few hours later I tested again, this time from inside X.  Now, my hibernate script does not stop X, but /does/ switch to an empty text VT so X isn't active at least, and I can see the text output of hibernate script as it stops NTP and writes the time to the hardware clock, before hibernating, and starts them back up upon resume, before returning to X.

So it flipped to the assigned VT and started the hibernate sequence... then almost immediately started switching stuff back on.  Apparently something running in X wouldn't let it hibernate.  I've not figured out what except that it might be related to a running konsole, but that's not unusual, when I'm in X.  Normally when it does that I either leave it running or quit whatever and try again.  Sort of like crashes in MS for some people I guess, I've just taken it as a fact of life that sometimes I can't hibernate until I quit X or at least konsole.  (Maybe after KMS...).

But, this time it triggered the bug!  This is the FIRST time I've seen it trigger WITHOUT a full hibernate, shutdown and restart.  It couldn't hibernate, but did get as far as freezing tasks and shutting down all but the boot CPU/Core.  Upon resume since it couldn't hibernate, the bug triggered.  The log reveals the same pattern of revalidation failure and ata model number miss-match, etc, but for different drives, and while it was /dev/sdb that dropped out last time, it was /dev/sdd that dropped out this time.

So:  We now know that it does NOT require the full hibernate sequence, only the freeze/thaw.  Upon thaw, model numbers don't match, revalidation fails, and a device drops out.

We also know that it doesn't happen all the time.  The last couple times, it has been the SECOND hibernate attempt, the first was fine, but I'm not sure whether that's chance or not.  I THINK it was triggering on the first attempt before, but can't say for sure.  So it might be irregular, or might only happen on attempts after the first one.  Naturally, with my data at stake and a 2+ hour RAID recovery every time it triggers, I'm hesitant to test it too much...

Finally, we know that it's not always the same drive that drops out, thus reducing the likelihood of hardware failure issues and probably eliminating any possibility of an interplay between the choice of suspend device and the device that goes down.

Meanwhile, I can confirm that the disk comes back as /dev/sde, and that removing the various /dev/sdX partitions from the RAIDs and re-adding them as /dev/sde partitions works fine... of course with the 2+ hour re-sync on the RAID-6... but the RAID-1 (which is normally active but without anything mounted, so no writes to require re-syncing) and RAID-0 come back without issues, using /dev/sde.

It can also be noted that the new /dev/sde retains its old ataX.00 and
sd Y:0:0:0 numeric identities (where Y=X-1, the ata number apparently being 1-based while the sd number is apparently 0-based), even while the sd-letter changes to e from one of a-d, but I suspect that's as expected for a revalidation failure such as this.
Comment 4 Tejun Heo 2009-10-30 15:57:17 UTC
Can you please attach working hibernation / resume logs from 2.6.31?  It looks like your hard drives fail initial probing attempts after resuming from hibernation and succeeding or failing resume is simply dependent on whether the hard drive successfully recovers while libata exception handler is retrying.  It could just be that recent kernels are causing problems because they're faster at resuming.

Thanks.
Comment 5 Duncan 2009-10-31 18:19:19 UTC
Created attachment 23610 [details]
dmesg from 2.6.31, hibernate and resume from X.

Here's the hibernate/resume dmesg from 2.6.31, as asked.  It's from within X, but as mentioned, my hibernate script switches to a text VC, leaving X running.

I've upgraded to xorg-server 1.7.1 since the previous log, so this one may have slightly different display logging related to that, but the sata disk logging should be the same.  Oh, and I've lost the RAID-0 apparently for good, due to the drives dropping out on it with this bug.  No big deal, I only had my Gentoo local repo, kernel git, and ccache on there, nothing I valuable as I knew the RAID-0 risks.  But that's going to be a slight difference in the dmesg as well.
Comment 6 Tejun Heo 2009-11-01 09:21:41 UTC
Hmmm.... no revalidation failure at all.  I can't think of anything from libata which can cause this.  Can you please try to bisect the problem and find out which commit makes those revalidation failures appear (resume failure itself is a side effect and its exact success or failure is most likely insignificant).

Thanks.
Comment 7 Duncan 2009-11-01 14:31:38 UTC
The problem with bisecting is that my rootfs is on mdraid -- on those same disks one of which drops out when the bug is triggered.  It's a two-plus hour RAID re-sync every time that happens, plus, I already appear to have lost the RAID-0 due to the problem.  (I was able to recover the RAID-0 a few times, as the first bad access would force it to read-only, but apparently there was a write to one of the remaining spindles before it was detected this last time, and I couldn't figure out how to bring it back online.  No big issue as I knew the risks of non-redundant striping and had only redownloadables and local ccache on it, but still...

I can try bisecting at least some of the way, to narrow it down, perhaps, but I need to double-check backups and that they're actually bootable to full operational status, etc, before I start deliberately triggering repeated rootfs RAID-fails, which is what bisecting means in this case.  I'll try to get to it, but I guess you can understand why I'm a bit more reluctant than usual...

Meanwhile, on those verify failures, where do the two values (the model number being at least part of the value) it's comparing come from?  Is it possible one of them is being accessed before it's written due to a race condition between cores or sockets?  Is one of them the value from pre-hibernate and one read "live" on resume, and either the pre-hibernate value isn't restored properly, or the new value is being read from hardware that wasn't reset to query-ready, so they don't compare equal?  It's obvious from the logs that one of the two values is getting seriously corrupted, but where are they coming from?  (I don't do C so can't really look at the source myself to see.)  That's why I put it as PM to begin with, as I thought it was corruption of the existing state over the hibernate, or at least freeze/thaw, since we now know it doesn't take a full hibernate.
Comment 8 Tejun Heo 2009-11-02 04:30:10 UTC
Hello, Duncan.

Probably the easiest way to bisect would be

* Disable raid-0 array.

* Take out one of the drives from root raid-1 array.

And then, you can bisect with the degraded root array and unassembled r0 drives.  That way you still have two working copies of the rootfs and you can rebuild either way after bisection is done and the r1 array wouldn't be affected in any way.

And the corruption is coming from the device.  After resuming, libata queries the drive to make sure the same drive is still there but your drives are replying with garbage for first few tries.  The query itself is single threaded and uses PIO so there shouldn't be any race there.  I have no idea what could be causing it at the moment, so the request for bisection.

Thanks.
Comment 9 Duncan 2009-11-02 08:35:12 UTC
As the first bit of the initial report says, rootfs is raid-6.  If it was raid-1, the bisect would indeed be much simpler, but it's raid-6.  The only thing on raid-1 is /boot, because grub can't boot off of raid-6.

But I have a nice new 1 TB external/USB (Fry's had them on sale for $80, better than pricewatch.com!) sitting beside me ATM.  I don't even know if this beast boots off of USB, but I'm about to find out.  If it does, I'll soon have a USB drive to boot off of if necessary.  If not, I can boot off the raid-1 /boot to rootfs on the USB drive once I get everything copied over.  Either way, I should soon be able to bisect this properly, without risking loss of the RAID-6 data should something go wrong when it's down a disk.
Comment 10 Duncan 2009-11-23 23:48:33 UTC
OK, after some time back on 2.6.31.5, away from kernel testing while I rearranged md/raids (no more raid-6 now, all raid-1 save for the single raid-0 with rebuildable/redownloadable only data) and took care to ensure I had sufficient backups, I'm back.

I'll be switching back to current linus/mainline HEAD momentarily, and trying to recreate the suspend issue.  If it still occurs, I'll bisect, with the raid-0 disabled as suggested (if possible, I've noted it doesn't seem to always trigger, a complication for the bisect).  The raid-1s are all backed up, so no worries there.
Comment 11 Duncan 2009-11-24 01:51:22 UTC
OK, it's still there as of 2.6.32-rc8-11-ga8a8a66.

This time, it didn't trigger on the first or second hibernate, from text mode (no X running), but did on the third, the first time from X.  So it's NOT the second suspend that does it, but it MIGHT be some interaction with X.

Also FWIW, my setup no longer uses LVM, and as I mentioned above, the RAID-6 personality, so md-raid456-personality and device-manager are now configured OUT of the kernel (tho md-raid0 and md-raid1 personalities are still configured), but the bug still occurs.  Not that I expected those to matter, but just keeping the bug updated on my changed kernel config.

More tests coming...  I'm going to try to confirm X as a trigger before I try to bisect, as a solid trigger definitely will help with the bisect.
Comment 12 Duncan 2009-11-24 03:14:27 UTC
Unfortunately, doesn't happen /every/ time I hibernate from X, but it does seem reasonably reliable.  Maybe it's the /second/ hibernate or more, but only from X.  Whatever.  That's getting iffy.  But it should be reliable enough a trigger to bisect with, which is what counts.

BTW, I managed to get it to drop THREE of the four drives this time.  That's the first time I've had more than one drop out at once.
Comment 13 Tejun Heo 2009-11-24 05:48:52 UTC
Can you please attach the logs?  Thanks.
Comment 14 Duncan 2009-11-24 20:31:23 UTC
Created attachment 23923 [details]
new log, good, bad with three disks dropping, 2.6.32-rc8-11-ga8a8a66

New logs as requested.
Comment 15 Duncan 2009-11-26 11:55:01 UTC
Progress!

I've been busy bisecting, but it has been slow going as I ran into a series of bisect-goods, which took longer to confirm as I don't have a firm test-case, plus several bisect-skips due to compile failure (yeah, I know, but I thought I'd try...).

But I just ran into a bisect-bad after the long run of bisect-goods, and my remaining just dropped from 1400+ to 16! (32ish inclusive, I guess.)

They're all sched:

5a9b86f647a56862cdc0a1362bfb015ae921af7f
5158f4e4428c6b8d52796b3b460e95796123a114
3b6408942206f940dd538e980e9904e48f4b64f8

5158f4e4428c6b8d52796b3b460e95796123a114
3b6408942206f940dd538e980e9904e48f4b64f8
7c423e98856df9b941223a7e7845b2502ad84b00

51e0304ce6e55a6e59658558916b4f74da085ff0
59abf02644c45f1591e1374ee7bb45dc757fcb88
e69b0f1b41c0e57bb1e29100b5810a5914efcb45

63859d4fe4c97b737e7adbfe60acb1c2b2e668cb
a7558e01056f5191ff2ecff53b075dcb9e484188
7d47872146398dbede13223299fe1cb368ebc781

0763a660a84220cc3900fd32abdd7ad109e2278d
8e6598af3f35629c37249a610cf13e73f70db279
47fe38fcff0517e67d395c039d2e26d2de688a60

5cbc19a983141729d716be17197028434127b376
a8303aaf2b2f74714db6d204ab4fcb810942664e
d6a59aa3a2b1ca8411884c833a313b33b5f76e20

b8a543ea5a5896830a9969bacfd047f9d15940b2
0ec9fab3d186d9cbb00c0f694d4a260d07c198d9
6bd7821f905a8d6c471f0d6675f5cb7ea448d791

78e7ed53c9f42f04f9401ada6f7047db60781676
d7c33c4930f569caf6b2ece597432853c4151a45
83f54960c11a14942ab00b54c51e91906b9d8235

ae154be1f34a674e6cbb43ccf6e442f56acd7a70
c88d5910890ad35af283344417891344604f0438
e9c8431185d6c406887190519f6dbdd112641686

5f3edc1b1ead6d9bd45a85c551f44eff8fe76b9f
aaee1203ca52b9db799433c33c9bffc33cdf8909
f5f08f39ee4c5fd0a757d25d9e04d696676b3df7

e26af0e8b2c9916f1e8a12ddaf52057bbe8ff600
3cb63d527f76e25dbccce4f577f21aecfa2abac7
e6b1b2c9c0461c4e0971ed905ce3cda6512ee82a

b78bb868c54bebbf8d8786a3f8320700d6d2b864

Given I'm running NUMA and I've always thought that garbage model number looked like it might be a race condition between CPUs, I'm guessing at the following, or one intimately tied to it.
59abf02644c45f1591e1374ee7bb45dc757fcb88

Hopefully I'll confirm soon enough now.
Comment 16 Duncan 2009-11-26 16:35:36 UTC
Dupped some commits there, but I think it's invalid anyway.  I screwed up the bisect (a couple ways I think) and will need to start over.

The first time thru, I did about three bisect-skips, due to failure to compile.  I'm going to go until I hit them, then either configure that out of the kernel if it's not vital for the test, do a bisect-reset and start over without that component, or ask for specific instructions on what commit to apply and how to apply it without screwing the bisect, thereby making it much more straightforward, this time.
Comment 17 Tejun Heo 2009-11-27 06:10:02 UTC
Thanks for trying to bisect the problem.  Bisecting problems without firm test case is very difficult.  I wish I had more specific ideas but unfortunately I don't have any at this point.  :-(

Please let us know what you find out.
Comment 18 Duncan 2009-11-29 15:05:31 UTC
This doesn't make a lot of sense unless there's some bleedover that shouldn't be happening, because I don't have staging enabled at all, so perhaps I wasn't patient enough to trigger the issue and did a bisect good that was actually bad, but here's what bisect says:

b4362c9c1fe8bec487e275f92fdf57c585ac236a is the first bad commit
commit b4362c9c1fe8bec487e275f92fdf57c585ac236a
Author: Greg Kroah-Hartman <gregkh@suse.de>
Date:   Thu Jul 16 11:50:41 2009 -0700

    Staging: hv: osd: remove Sleep wrapper

    Use the "real" udelay call instead of a wrapper function.

    Cc: Hank Janssen <hjanssen@microsoft.com>
    Cc: Haiyang Zhang <haiyangz@microsoft.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

:040000 040000 48fe002f7771c6de21832dbd6d6cb9b7aed39932 ee5566cf0395d6c81a2bc2ac96eff2e8d8b43062 M      drivers

I guess the way to test it is to revert that commit on a later tested bad version, and see if that fixes it.  I'll post the bisect log, and try the revert later today.  (I'm about to go out, now.)

This one's certainly been a heisenbug here!  I don't like it!
Comment 19 Duncan 2009-11-29 21:05:35 UTC
Created attachment 23972 [details]
bisect log

Here's the bisect log.  It's sure different than the first one I tried, so at least one of them's screwed up.  The bads are definitely bad, v2.6.31 and v2.6.31.5 are definitely good, but all I know for sure on the other goods is that every hibernate I tried on them, typically three, doing a bunch of stuff in between to try and trigger it on the next hibernate, didn't resume with the model number mismatch.

Now the ultimate test, revert the above commit on one of the previously bad builds, and see if I can still trigger a model number mismatch with this commit reverted.
Comment 20 Duncan 2009-12-07 02:54:26 UTC
Status update:

1: I couldn't do a revert on the commit above to test (error about too many files, way more advanced git than my skills), so I started yet /another/ git bisect, this time using the known bad ones from the second bisect, and the known good points of v2.6.31 and v2.6.31.5, but /without/ the other bisect-goods, since I obviously wasn't patient enough waiting for the bug and ended up in the wrong place.

2: Meanwhile, I'm becoming increasingly convinced that the bug has something to do with X, as I don't believe I've /ever/ seen it trigger when X wasn't running in the background, altho I /do/ switch to the Radeon framebuffer console (so X is running, but in the background) in the script before I hibernate.

3: As more evidence to back that up:  I /had/ been running an old Radeon 92xx, X Radeon driver (r200), AGP, but finally went out and bought the hd4650 (r600 driver, rv730 chip) and installed it the other day.  It didn't like the kernel radeonfb and fbcon driver, and booted to vgacon.  Needless to say I don't like vgacon's resolutions (vga=ask wouldn't give me anything but 80xY choices), and can't really switch to KMS only available in 2.6.32 when I'm bisecting a bug in 2.6.32-rc1-land.  So I recompiled with vesafb.

Also, the new card doesn't want to run 3D in X, which was what I expected since I'm running release radeon drivers and it's not really in them yet, and even if it was, I found an xorg bug that said kernel 2.6.32 was needed.  But in ordered to get X to work at all (it starts but I get only a black screen and a mouse pointer otherwise) with the new card, I had to hard-disable OpenGL in xorg.conf.

The clincher to all that is that now I can't seem to trigger the bug at all (with the caveat that maybe it simply hasn't triggered yet, but I've done all sorts of stuff to try to trigger it, and haven't yet), with the new hardware and framebuffer/fbcon or vgacon and X settings, EVEN ON A FORMERLY TESTED BAD KERNEL!

Thus, the bug appears to be inter-related with one of the following: 
a) radeonfb, which I was running but which won't run on the new hardware at all, likely as it interacts with the X radeon driver.
b) Something related to OpenGL on X in the Radeon driver (X or kernel), since I had to disable OpenGL on the new card until I get the appropriate git X drivers and kernel 2.6.32
c) The xorg radeon r200 driver or the hardware itself, since it's changed too.

And, I still think it seems to be a race condition, possible related to the dual-dual-core Opterons I'm running, thus explaining why it's so hard to pin down a firm test-case for the "good" side.

At this point, I still have the old card and could put it in and test some more (including testing the old card with openGL disabled entirely, and with framebuffer disabled entirely, on a previously tested bad kernel), but I've already spent a lot of time on this, and provided it doesn't trigger again as I let it run a few more days or retrigger when I get the git X components lined up and enable OpenGL, and providing now that the kernel is released, we don't get more reports to help pin it down, I'll probably close it as WORKSFORME or whatever, in a few days, since I haven't yet seen it on the new Radeon card.
Comment 21 Tejun Heo 2009-12-07 03:06:29 UTC
Thanks for the extensive testing.  The bug is quite unusual and likely very low level.  I don't think it's software race condition.  Something is going very wrong at much lower layer (like PCI bus or host controller).  It could definitely be something related to video resume is throwing a wrench into the system.  Please keep an eye on the system with the new hardware and share the result.

Thanks.
Comment 22 Duncan 2009-12-12 01:34:23 UTC
Unfortunately, it returned.  With kernel 2.6.32, KMS, and the git xf86-video-ati driver (which will become 6.13 at some point, no released opengl support yet) on the radeon hd4650, the bug returned!

While the nature of the bug makes it uncertain, that's even more evidence that it's X based.  Note that while I had OpenGL active on the old card, it wouldn't support the full resolution I was running, so kwin was set to composite-only mode (one of the reasons for the graphics upgrade).

So it seems to be a mandelbug masquerading as a heisenbug, perhaps involving either the dri/drm stuff or related to fbcon, with both the radeonfb and radeondrifb (or whatever the kms version is called), or maybe even both!

With the new card it's even more difficult to bisect that combination, since radeonfb doesn't seem to work on the new card, at least in 2.6.32, and kms for it was only a 2.6.32 addition.  However, bisecting just was NOT working, anyway.

However, there's still hope. Unlike a lot of hibernation related bugs, I actually come back from hibernation with this one!  That means, if someone can tell me what to configure for tracing it, or provide a custom patch to 2.6.32 to log extra info, I can turn in the dmesg logs /with/ that extra info.

Otherwise, I'll simply have to be very careful to shutdown the RAID-0 and all the RAID-1s I can before hibernating, from now on, and hope for the best... if I hibernate at all.  And now that the kernel is out, I can hope someone else with an enough different setup to narrow down the possibilities runs into it and reports, too.

Meanwhile... Maybe I can setup a script to read back into cache my usual working set, the reason I was avoiding rebooting before, and that and the fact that I have faster booting now due to parallel access, will obviate the need for hibernate...  It'd certainly be simpler and more reliable.
Comment 23 Tejun Heo 2009-12-12 02:32:15 UTC
Rafael, who should we bug for video related suspend/resume issues?  I have no idea where to dig at this point.  Thanks.
Comment 24 Rafael J. Wysocki 2009-12-12 17:24:12 UTC
Drivers/DRM/non-Intel in this particular case.
Comment 25 Duncan 2010-02-15 10:15:10 UTC
Created attachment 25043 [details]
partial log from 2.6.33-rc7-25-g08c4f1b

> The following bug entry is on the current list
> of known regressions introduced between 2.6.31
> and 2.6.32.  Please verify if it still should
> be listed and let the tracking team know (either way).

Unfortunately, it's still here, as of 2.6.33-rc7-25-g08c4f1b.

It should be noted that I had **NOT** started X when this one triggered (I had just fresh-booted, hadn't run much besides system services a bash login, sudo, and a couple file edits), but I'm running KMS now, so it could still be related to whatever graphics drivers were only activated with X before, but are now activated at boot with KMS.  With KMS and all drivers built-in (no module loading active, now), DRM activates at the KMS framebuffer console, right?  The log seems to indicate that. (Re)Confirming the graphics card as a Radeon hd4650, AGP, dual 1920x1200 LCDs attached via DVI.  Still running the git xf86-video-ati driver, altho I hadn't started X so /it/ shouldn't have activated. Again, it's a dual Opteron 290 (so 2x dual-core), still six gigs of DDR 1 registered RAM.
Comment 26 Tejun Heo 2010-02-16 07:45:07 UTC
Any chance you can try a video card from the other vendor?

Summary of the bug for new comers.

* After resuming from STR, disks attached via sata_sil temporarily return bogus identification data causing libata to detach the disks and re-attach them.  During retrials, the disks succeed to return valid identification data and successfully get reprobed.

* The ATA identification data is read directly by the CPU via mmio region.  ie. No DMA is involved.  The weird thing is that CPU can talk to the controller well enough to successfully execute reset protocol and a few commands but block data transfer is failing miserably for brief period.  The data transfer corruption is partial, in one case, 'ST3300831AS' showed up as 'ST831AS                             <80>^P'.

So, something between the CPU and the controller isn't working correctly *temporarily* (few tens of seconds period).  Too aggressive async wakeup might cause behaviors like this, I suppose.

Thanks.
Comment 27 Oleksandr Yermolenko 2010-02-23 14:15:02 UTC
I have similar problem with reappearing of HDD with different /dev/sdX name after hibernate.
Interesting details:
1) NVIDIA graphics card running with proprietary drivers (190.53) instead of Radeon.
2) only one of 3 HDDs reappears with different name
3) system logs have "n_sectors mismatch" message.
4) experienced this problem on kernels 2.6.31.12 and 2.6.32.7 on Debian squeeze amd64. Kernels are from kernel.org, not from repositories.
5) This drive is Seagate ST3500320AS with SD1A firmware, sometimes it changes name from /dev/sdb to /dev/sdd after resume  The frequency of the event is relatively low (about one event for 5-10 hibernate/resume cycles). Other HDDs (SAMSUNG HD321KJ (/dev/sda) and WDC WD15EARS-00Z5B1 (/dev/sdc)) never change their /dev/sdX names after resume.

From /var/log/messages:
...
sd 2:0:0:0: [sda] Starting disk
sd 3:0:0:0: [sdb] Starting disk
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata4.00: n_sectors mismatch 976773168 != 976771055
ata4.00: disabled
sd 3:0:0:0: [sdb] START_STOP FAILED
sd 3:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
sd 5:0:0:0: [sdc] Starting disk
usb 4-1: reset low speed USB device using uhci_hcd and address 2
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata4.00: HPA detected: current 976771055, native 976773168
ata4.00: ATA-8: ST3500320AS, SD1A, max UDMA/133
ata4.00: 976771055 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata4.00: configured for UDMA/133
ata4.00: detaching (SCSI 3:0:0:0)
usb 4-2: reset low speed USB device using uhci_hcd and address 3
usb 5-1: reset low speed USB device using uhci_hcd and address 2
Restarting tasks ... done.
sd 3:0:0:0: [sdb] Synchronizing SCSI cache
sd 3:0:0:0: [sdb] Stopping disk
scsi 3:0:0:0: Direct-Access     ATA      ST3500320AS      SD1A PQ: 0 ANSI: 5
sd 3:0:0:0: Attached scsi generic sg1 type 0
sd 3:0:0:0: [sdd] 976771055 512-byte logical blocks: (500 GB/465 GiB)
sd 3:0:0:0: [sdd] Write Protect is off
sd 3:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd: sdd1 sdd2 < sdd5 sdd6 >
sd 3:0:0:0: [sdd] Attached SCSI disk
...

Note that numbers in "n_sectors mismatch" message:
  ata4.00: n_sectors mismatch 976773168 != 976771055
are the same as in message "HPA detected" message:
  ata4.00: HPA detected: current 976771055, native 976773168

What does it mean? Is this something interesting?

Thanks.
Comment 28 Tejun Heo 2010-02-24 01:50:12 UTC
Oleksandr, can you please open a separate bug report, attach full dmesg and assign it to me.  You're seeing a different problem which is supposed to be fixed in recent kernels.  Your BIOS is forgetting to re-lock HPA on resume and libata is supposed to work around that but failed to.

Thanks.
Comment 29 Oleksandr Yermolenko 2010-02-25 15:27:16 UTC
(In reply to comment #28)
> Oleksandr, can you please open a separate bug report, attach full dmesg and
> assign it to me.  You're seeing a different problem which is supposed to be
> fixed in recent kernels.  Your BIOS is forgetting to re-lock HPA on resume
> and
> libata is supposed to work around that but failed to.
> 
> Thanks.

Tejun, I have opened separate bug report (# 15396), but assigned it to drivers_other@kernel-bugs.osdl.org by mistake. How can I fix this?

Thanks.
Comment 30 Florian Mickler 2010-11-04 19:35:25 UTC
Duncan, any news on this? Does this still happen as of 2.6.36?
Comment 31 Duncan 2010-11-05 02:11:06 UTC
I'm not sure as I quit even trying hibernate for awhile.  (I lost and had to rebuild a RAID-0 due to this; it was mostly local cache of net stuff like my git kernel, thus nothing unrecoverable, but it's still a hassle when you lose one, incentive to quit doing whatever triggered it.)

But I expect to have some time by early next week to update the kernel again  (I'm a few commits shy of final 2.6.36 ATM) and test both this and bug #19002 (a DRI/Radeon issue from early in 2.6.36 AND 2.6.35 stable series).

Meanwhile, I've been watching disk behavior since this bug report, and wonder if the faster/parallel bootstrap sequence is just too fast for the drives, now, and they're simply not quiesced (they detect, but apparently not to the point they detect as the same drives as before) yet.  I'm thinking something like the rootdelay= option, but different.

But best test and see if it's still an issue, before trying to solve a problem I'm not sure is still there.  I'll try to get updated results by Tuesday or so.
Comment 32 Duncan 2010-11-10 13:26:44 UTC
Well, I have results, but they're not good.

Since I last tried suspend on this machine, suspend has apparently broken in general.  As of 2.6.37-rc1-170-g-whatever, I can't even get it to reliably suspend now, even straight after boot, without starting X or even logging in (I have gpm setup with a triple-click hibernate action, so I can trigger it without being logged in.)

I /did/ get it to suspend once, directly after boot, but then went to try it again from within X, and not only would it not work from there, but (after a reboot), it wouldn't work again from a KMS-text console, either, even with a fresh boot without even logging in.  It seems to jam up in the freeze phase and never initialize the write-to disk -- nor successfully back out of the attempt and come back up without suspending.  It just sits there with the monitors asleep, no disk activity or anything.

So I can't really test the status of the resume, 'cause I can't even get it to (reliably) suspend properly.

I guess that means closed/NEEDINFO is where it's probably going to stay, as I've pretty much given up on getting hibernate working on this thing again, at this point.  It was nice while I had it, and it's /very/ nice on the netbook (where both suspend2ram and hiberante have always worked GREAT), but I'm tired of needlessly fighting with it on this one.
Comment 33 Florian Mickler 2011-01-16 13:15:19 UTC
Thx for you time. Generally, you don't need to wade into the early -rc kernels. They are normally quite unstable. It is sufficient if you test it on the stable kernels. Did you give it a try on 2.6.36.y? (or the final 2.6.37?)
Comment 34 Duncan 2011-02-23 08:45:14 UTC
AFAICT, Resolved with 2.6.37 release, so updating as such.

I finally got some time to test with a stable kernel, and as of 2.6.37, as best I can tell (this one bedeviled me to the point I'm reluctant to say it's for sure fixed now, without a YEAR of it not coming back!), it's fixed.

The kernel comes back up and from CLI (there's major X corruption preventing use of it after restore, but that'd be a different bug), I can dmesg, which clearly shows (1) an initial failure to read the drive's ID info properly, as before, but now, (2) instead of kicking it out to sde (with sd[a-d] normal) or whatever as it did before, it apparently retries and comes up with the proper ID, thus returning the drive to the proper sd[a-d] letter and its partitions to the proper place in their assigned md/RAIDs.

On a slightly different note, a big THANKS! to whoever introduced and continues maintaining magic-sysreq.  I had a hardware issue for awhile and at first thought it was yet another pre-release kernel bug.  A bit of circuit-trace paint later (expensive keyboard) and it's working fine again, but as often the case, I didn't realize how much I took that feature for granted until it failed.

And of course, thanks for getting the device-id detect and if necessary retry working, too. =:^)

Now to see how 2.6.38 is coming along.  Maybe the radeon graphics restore-state-on-resume works better now.  We'll see.
Comment 35 Florian Mickler 2011-02-23 10:02:40 UTC
Where is the patch that fixes the issue? 

You can choose between 3 alternatives here:

- Resolved|PatchAlreadyAvailable means there is a patch that is not yet merged in Linus tree,  this always needs a line "Patch: url_to_patch" in order for Rafaels scripts to work nicely. (Not strictly  needed here, as he doesn't produce reports for old kernels ... but.. anyway)

- Closed|CodeFix means, a commit was merged into linus tree that has fixed the issue (always give the sha1 id of that commit 

- Closed|Unreproducible means, it is not reproducible on current linus tree but it's not clear what fixed it. (i.e. you can not provide a sha1 of the fix)


I guess, the last one is what's applicable here?
Comment 36 Duncan 2011-02-23 10:30:57 UTC
Yes, thanks for the clarification.