Bug 13501

Summary: Resuming from Suspend To Ram (S3) Causes firewire disk to be redected.
Product: Drivers Reporter: Hakan Bayindir (hbayindir)
Component: IEEE1394Assignee: drivers_ieee1394
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, hbayindir, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.26, 2.6.29, 2.6.30-RC8 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output taken from 2.6.29 after a suspend and resume cycle
dmesg output taken from 2.6.30-rc8 after a suspend and resume cycle
lspci -vvv output taken from 2.6.29
dmesg output of test case: resume with unmounted disk, no workarounds
dmesg output of test case: resume with unmounted disk, w/workaround 0x20
dmesg output of test case: resume with mounted disk, w/workaround 0x20
dmesg output of test case: resume with unmounted disk, w/workaround 0x20 & debug level 7

Description Hakan Bayindir 2009-06-10 18:51:37 UTC
First, the hardware:

M/B: MSI 7345 (P35-Platinum with Intel P35 & ICH-9R, BIOS version 1.9)
CPU: Intel Core2Quad 6600
F/W Controller: VIA Technologies, Inc. VT6306 Fire II IEEE 1394 OHCI Link Layer Controller (driven by firewire_ohci, O/B to the mainboard).
F/W Disk: Maxtor One Touch 4 Plus 1TB


What happens:
After resume system re-detects my firewire disk as a new one, gives a new /dev/sdX entry, kills old mountpoint. Writing to that mountpoint generate I/O to dead device errors. Unmounting, unplugging, replugging, mounting is required to to solve the problem. Problem always reproducible.

How to reproduce:
- suspend system using somehow (S2RAM, kpowersave in kde 3.5, leave menu in kde 4, doesn't matter)
- System suspends normally.
- Some time passes (as expected)
- wake the system and system comes back as expected.
- System detects firewire disk as a new device, gives it a different /dev/ entry, drops the old mountpoint so it appears dead. Making any operations on old mountpoint generate kernel messages about a dead I/O device.
- To fix the situation, umount the dead point, remove the disk and re-plug it and mount everything back (sometimes I need to run an fsck to verify the huge 1TB disk)

What should happen:
System should treat disk as an existing device and old mountpoint should work like nothing happened

Notes:
- Was not happening before but I cannot remember the exact version. should be 24 or 25.

- The .tar.gz files contains dmesgs for 2.6.29, 2.6.30-rc8 and lspci-vvv outout for 2.6.29

- If you resume the system before external disk powers off (~10 seconds), bug doesn't occur.
Comment 1 Hakan Bayindir 2009-06-10 18:52:36 UTC
Created attachment 21840 [details]
dmesg output taken from 2.6.29 after a suspend and resume cycle
Comment 2 Hakan Bayindir 2009-06-10 18:53:02 UTC
Created attachment 21841 [details]
dmesg output taken from 2.6.30-rc8 after a suspend and resume cycle
Comment 3 Hakan Bayindir 2009-06-10 18:53:33 UTC
Created attachment 21842 [details]
lspci -vvv output taken from 2.6.29
Comment 4 Stefan Richter 2009-06-13 15:44:35 UTC
> - Was not happening before but I cannot remember the exact version.
> should be 24 or 25.

Do you remember whether you used the old drivers (ohci1394 + ieee1394 + sbp2) or the new drivers (firewire-ohci + -core + -sbp2) at the time when it still worked?

Thanks for the logs.  What's going on is that the firewire drivers are able to re-attach the disk as the very same SCSI device like before suspend.  But when the "Start Stop Unit" command to turn on the spindle motor is sent, it somehow fails and the kernel drops the existing SCSI device.  I'm not yet sure why these latter two things happen.

Please try the following:
# echo 0x20 > /sys/module/firewire_sbp2/parameters/workarounds
Then unmount the disk and unplug the disk.
Plug it back in.  (You don't need to mount it.)
Suspend and resume.
Check whether it still changed the SCSI device number after resume, and if so, whether it was with the same pattern of messages like in your attached logs.

Furthermore, if the 0x20 workaround ( = enable extra parameters in the SCSI Start Stop Unit command) does not work, please generate a debug log:
Attach the disk.  (You don't need to mount it.)
# echo 7 > /sys/module/firewire_ohci/parameters/debug
Suspend and resume.
Attach the resulting dmesg here.
You can disable the debug logging again after that with:
# echo 0 > /sys/module/firewire_ohci/parameters/debug
Comment 5 Hakan Bayindir 2009-06-21 19:59:52 UTC
Hi,

Sorry for the late reply. I've ran the tests and found that the workaround didn't work either. I've executed following scenarios:
- Disk unmounted, no workarounds enabled.
- Disk unmounted, workaround 0x20 enabled.
- Disk mounted, workaround 0x20 enabled.
- Disk mounted, workaround 0x20 enabled with debug level 7.

I also noted some strange behaviour of the disk which is not software related but related to design of the disk. When disk is plugged in it spins up briefly and then stops (self test and boot?), then spins up again. This is its behaviour since day one. It suggests me that disk waits a bit to become ready and kernel is too impatient about its command replies.

I also tend to think that this was not happening before because firewire drives are scanned last and this behaviour was deterministic. Now firewire devices are probed first on boot and resume so on resume probing occurs before disk is ready (this brings another problem of changing resume device if the disk is not present but this is another matter).
Comment 6 Hakan Bayindir 2009-06-21 20:01:39 UTC
Created attachment 22036 [details]
dmesg output of test case: resume with unmounted disk, no workarounds
Comment 7 Hakan Bayindir 2009-06-21 20:02:26 UTC
Created attachment 22037 [details]
dmesg output of test case: resume with unmounted disk, w/workaround 0x20
Comment 8 Hakan Bayindir 2009-06-21 20:03:03 UTC
Created attachment 22038 [details]
dmesg output of test case: resume with mounted disk, w/workaround 0x20
Comment 9 Hakan Bayindir 2009-06-21 20:04:10 UTC
Created attachment 22039 [details]
dmesg output of test case: resume with unmounted disk, w/workaround 0x20 & debug level 7
Comment 10 Hakan Bayindir 2009-06-21 20:07:00 UTC
Correction:

I've run the last test with disk unmounted, so the last case is
- Disk unmounted, workaround 0x20 enabled with debug level 7.
Comment 11 Stefan Richter 2009-10-06 14:57:15 UTC
A similar report arrived at linux1394-user:
http://marc.info/?t=125481515600002

Hardware: RaidSonic Icy Dock MB-559UEA-1S

Kernel: 2.6.31
(worked with some 2.6.31-rc, but failed with 2.6.31-rc when tested again)