Bug 30912

Summary: Suspend/Resume with rootfs on USB, causes filesystem corruptions and kernel oops on mount attempt
Product: Drivers Reporter: Eugene San (eugenesan)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: ASSIGNED ---    
Severity: blocking CC: aaron.lu, alan, dmonakhov, drivers_other, eugenesan, fs_vfs, funtoos, io_other, mhieu.trinh, power-management_other
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: suspend resume cycle with debug
reconstruction script
/var/log/messages with Ext4-fs errors on resume

Description Eugene San 2011-03-11 07:28:01 UTC
Created attachment 50602 [details]
suspend resume cycle with debug

Suspend of root filesystem on USB, causes FS corruptions which crashes kernel
on mount attempt and making system unbootable.

Affects (at least) RedHat6.0+, Fedora13+ and Ubuntu10+.

Reconstruction (Native):
---------------
a) Install RedHat/Fedora/Ubuntu on USB drive.
b) Boot into user session
c) Suspend
d) Resume
e) Repeat c,d (up to 4 times) until kernel informs about filesystem corruption.
f) Reset the system -> Observe system inability to boot and kernel panic due to
corrupted FS.

Reconstruction (Forced):
---------------
a) Install any RedHat/Fedora/Ubuntu on USB drive.
b) Boot into user session
c) Suspend
d) Hard reset the system (Before performing resume) -> Observe system inability
to boot and kernel panic due to corrupted FS.

Actually report describes two severe issues:
--------------------------------------------
a) FS corruption during suspend/resume
b) Kernel panic when trying to mount affected FS (linux<2.6.38).

Facts:
------
*) Happens on systems with rootfs on USB.
*) When system returns from suspend, there is a chance of ~25% for rootfs on
USB to be remounted as readonly. Due to short inability to read from USB device
(device settle delay required) or/and corruption caused by non-finished commit before entering suspend.
*) Corruptions are silent, FS marked as clean, that's what causes Kernel oops
when it tries to mount corrupted FS marked as clean.
*) An attempt to mount affected EXT4 (on Linux<2.6.38) will results a crash.
*) Usually corruption might be fixed on another Linux system with fsck, with auto-mount de-activated.
*) Tested Kernels: Mainline (Ubuntu Natty mainline Daily), Ubuntu 2.6.3x, Fedora 2.6.3x and Redhat6 2.6.3x.
Comment 1 Eugene San 2011-03-11 07:30:18 UTC
Created attachment 50612 [details]
reconstruction script
Comment 2 devsk 2013-12-06 06:43:05 UTC
This still happens with latest kernel (3.12.3).

What's happening is that USB PERSIST is not working properly. I have CONFIG_USB_DEFAULT_PERSIST=y in the .config.

Despite that, after resume the device comes up in a different slot. Diff of the lsusb before and after:

# diff lsusb.{before,after}
1275c1275
< Bus 002 Device 002: ID 0781:5580 SanDisk Corp. SDCZ80 Flash Drive
---
> Bus 002 Device 003: ID 0781:5580 SanDisk Corp. SDCZ80 Flash Drive

Looks like the USB device is not there after resume. A comparison of the lsscsi before and immediately after resume:

# diff lsscsi.{before,after}
3d2
< [6:0:0:0]    disk    SanDisk  Extreme          0001  /dev/sdb

From the dmesg, we notice that USB device is disconnected after resume and reconnected and since the old slot has not been freed as yet, it uses the next device number.

> [  636.015458] PM: resume of devices complete after 2345.620 msecs
> [  636.015733] PM: Finishing wakeup.
> [  636.015752] ACPI: \_SB_.PCI0: Bus check notify on
> _handle_hotplug_event_root
> [  636.015734] Restarting tasks ... done.
> [  636.026432] usb 2-1: USB disconnect, device number 2 
> <===================================disconnect
> [  636.029393] sd 6:0:0:0: [sdb] Unhandled error code
> [  636.029399] sd 6:0:0:0: [sdb]  
> [  636.029402] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [  636.029406] sd 6:0:0:0: [sdb] CDB: 
> [  636.029408] Write(10): 2a 00 06 9b 09 28 00 00 08 00
> [  636.029421] end_request: I/O error, dev sdb, sector 110823720
> [  636.029439] EXT4-fs (sdb3): Delayed block allocation failed for inode
> 1747860 at logical offset 0 with max blocks 1 with error 5
> [  636.029445] EXT4-fs (sdb3): This should not happen!! Data will be lost
> 
> [  636.029513] EXT4-fs warning (device sdb3): ext4_end_bio:316: I/O error
> writing to inode 1746743 (offset 1347584 size 12288 starting block 14646750)
> [  636.029516] Buffer I/O error on device sdb3, logical block 11325658
> [  636.029519] Buffer I/O error on device sdb3, logical block 11325659
> [  636.029520] Buffer I/O error on device sdb3, logical block 11325660
> [  636.029522] Buffer I/O error on device sdb3, logical block 11325661
> [  636.029589] Aborting journal on device sdb3-8.
> [  636.029603] JBD2: Error -5 detected when updating journal superblock for
> sdb3-8.
> [  636.029614] journal commit I/O error
> [  636.030904] EXT4-fs error (device sdb3): ext4_journal_check_start:56:
> Detected aborted journal
> [  636.030907] EXT4-fs (sdb3): Remounting filesystem read-only
> [  636.030909] EXT4-fs (sdb3): previous I/O error to superblock detected
> [  636.036804] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [  636.238696] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd
> [  636.251121] usb-storage 2-1:1.0: USB Mass Storage device detected
> [  636.251178] scsi7 : usb-storage 2-1:1.0
Comment 3 devsk 2013-12-06 06:54:38 UTC
> [  636.026432] usb 2-1: USB disconnect, device number 2

Disconnect after resume.

> [  636.238696] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd

That's where the same device has come up with device number 3.
Comment 4 devsk 2013-12-07 06:47:33 UTC
Looks like I found what's happening. Its exact same thing as debugged in http://marc.info/?l=linux-usb&m=137714769606183&w=2
Comment 6 mhtrinh 2014-01-18 08:09:37 UTC
Created attachment 122461 [details]
/var/log/messages with Ext4-fs errors on resume

+1
Not sure if it is a re-numbering USB pronlem or not but I have similar configuration and symptoms :
I have a HP mini 311, with SD Card reader integrated (ums_realtek and usb_storage module)
My rootfs is on a SD Card, on /dev/sdb1
My system : OpenSuse 13.1, Linux hpmini 3.11.6-4-desktop #1 SMP PREEMPT Wed Oct 30 18:04:56 UTC 2013 (e6d4a27) i686 i686 i386 GNU/Linux

When I run :
echo "============ SUSPEND NOW ================ " >> /var/log/messages ; pm-suspend ;

The system suspend normally, but when it resume, it freeze randomly after 5-20s.
Sometime it have time to save kernel error message to /var/log/messages (which I attached)

It looks like it failed to read/write on the rootfs, on the SD card on resume.

What weird is : I didn't have this problem with OpenSuse 12.2, kernel 3.4.63
Comment 7 devsk 2014-04-09 17:06:31 UTC
Anybody got any inputs on this issue?
Comment 8 devsk 2014-04-21 21:55:18 UTC
Finally, updated to 3.14.1. And same old same old. /dev/sda becomes /dev/sdc after resume and all hell breaks loose.
Comment 9 Greg Kroah-Hartman 2014-05-04 15:08:32 UTC
For USB issues, please post to linux-usb@vger.kernel.org, we don't do
bugzilla.
Comment 10 devsk 2014-05-04 15:25:35 UTC
> we don't do bugzilla.

But you can read, right?

I am a developer myself. If this was the code written/maintained by me, I would take pride in fixing my code, no matter where the damn bug came from. I will ask for questions, search for clues whether it was mailing list or bugzilla or jira or google groups.

Honestly, after 3 years of the bug being ignored, I don't care about this bug anymore. So, I am not registering with some high frequency mailing list to provide you pretty much all the information I/we have already provided above.

You may close this bug now for all I care.
Comment 11 Greg Kroah-Hartman 2014-05-04 15:36:44 UTC
On Sun, May 04, 2014 at 03:25:35PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=30912
> 
> --- Comment #10 from devsk <kernel-bugs.dev1world@spamgourmet.com> ---
> > we don't do bugzilla.
> 
> But you can read, right?

Through email, yes.

But I don't scale, which is why the USB developers don't use bugzilla.
Send to the mailing list and all USB developers get told about the
issue, not just one little person like me.

> I am a developer myself. If this was the code written/maintained by me, I
> would
> take pride in fixing my code, no matter where the damn bug came from. I will
> ask for questions, search for clues whether it was mailing list or bugzilla
> or
> jira or google groups.

I am not the only developer of this code, and I maintain a few million
lines of code in the kernel tree.  Doing it all by myself is an
impossible task, so the community works together, which is how Linux
works.

> Honestly, after 3 years of the bug being ignored, I don't care about this bug
> anymore. So, I am not registering with some high frequency mailing list to
> provide you pretty much all the information I/we have already provided above.

You don't have to register with anything, just send an email, in
non-html form, and it will go through.