Created attachment 50602 [details] suspend resume cycle with debug Suspend of root filesystem on USB, causes FS corruptions which crashes kernel on mount attempt and making system unbootable. Affects (at least) RedHat6.0+, Fedora13+ and Ubuntu10+. Reconstruction (Native): --------------- a) Install RedHat/Fedora/Ubuntu on USB drive. b) Boot into user session c) Suspend d) Resume e) Repeat c,d (up to 4 times) until kernel informs about filesystem corruption. f) Reset the system -> Observe system inability to boot and kernel panic due to corrupted FS. Reconstruction (Forced): --------------- a) Install any RedHat/Fedora/Ubuntu on USB drive. b) Boot into user session c) Suspend d) Hard reset the system (Before performing resume) -> Observe system inability to boot and kernel panic due to corrupted FS. Actually report describes two severe issues: -------------------------------------------- a) FS corruption during suspend/resume b) Kernel panic when trying to mount affected FS (linux<2.6.38). Facts: ------ *) Happens on systems with rootfs on USB. *) When system returns from suspend, there is a chance of ~25% for rootfs on USB to be remounted as readonly. Due to short inability to read from USB device (device settle delay required) or/and corruption caused by non-finished commit before entering suspend. *) Corruptions are silent, FS marked as clean, that's what causes Kernel oops when it tries to mount corrupted FS marked as clean. *) An attempt to mount affected EXT4 (on Linux<2.6.38) will results a crash. *) Usually corruption might be fixed on another Linux system with fsck, with auto-mount de-activated. *) Tested Kernels: Mainline (Ubuntu Natty mainline Daily), Ubuntu 2.6.3x, Fedora 2.6.3x and Redhat6 2.6.3x.
Created attachment 50612 [details] reconstruction script
This still happens with latest kernel (3.12.3). What's happening is that USB PERSIST is not working properly. I have CONFIG_USB_DEFAULT_PERSIST=y in the .config. Despite that, after resume the device comes up in a different slot. Diff of the lsusb before and after: # diff lsusb.{before,after} 1275c1275 < Bus 002 Device 002: ID 0781:5580 SanDisk Corp. SDCZ80 Flash Drive --- > Bus 002 Device 003: ID 0781:5580 SanDisk Corp. SDCZ80 Flash Drive Looks like the USB device is not there after resume. A comparison of the lsscsi before and immediately after resume: # diff lsscsi.{before,after} 3d2 < [6:0:0:0] disk SanDisk Extreme 0001 /dev/sdb From the dmesg, we notice that USB device is disconnected after resume and reconnected and since the old slot has not been freed as yet, it uses the next device number. > [ 636.015458] PM: resume of devices complete after 2345.620 msecs > [ 636.015733] PM: Finishing wakeup. > [ 636.015752] ACPI: \_SB_.PCI0: Bus check notify on > _handle_hotplug_event_root > [ 636.015734] Restarting tasks ... done. > [ 636.026432] usb 2-1: USB disconnect, device number 2 > <===================================disconnect > [ 636.029393] sd 6:0:0:0: [sdb] Unhandled error code > [ 636.029399] sd 6:0:0:0: [sdb] > [ 636.029402] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK > [ 636.029406] sd 6:0:0:0: [sdb] CDB: > [ 636.029408] Write(10): 2a 00 06 9b 09 28 00 00 08 00 > [ 636.029421] end_request: I/O error, dev sdb, sector 110823720 > [ 636.029439] EXT4-fs (sdb3): Delayed block allocation failed for inode > 1747860 at logical offset 0 with max blocks 1 with error 5 > [ 636.029445] EXT4-fs (sdb3): This should not happen!! Data will be lost > > [ 636.029513] EXT4-fs warning (device sdb3): ext4_end_bio:316: I/O error > writing to inode 1746743 (offset 1347584 size 12288 starting block 14646750) > [ 636.029516] Buffer I/O error on device sdb3, logical block 11325658 > [ 636.029519] Buffer I/O error on device sdb3, logical block 11325659 > [ 636.029520] Buffer I/O error on device sdb3, logical block 11325660 > [ 636.029522] Buffer I/O error on device sdb3, logical block 11325661 > [ 636.029589] Aborting journal on device sdb3-8. > [ 636.029603] JBD2: Error -5 detected when updating journal superblock for > sdb3-8. > [ 636.029614] journal commit I/O error > [ 636.030904] EXT4-fs error (device sdb3): ext4_journal_check_start:56: > Detected aborted journal > [ 636.030907] EXT4-fs (sdb3): Remounting filesystem read-only > [ 636.030909] EXT4-fs (sdb3): previous I/O error to superblock detected > [ 636.036804] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready > [ 636.238696] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd > [ 636.251121] usb-storage 2-1:1.0: USB Mass Storage device detected > [ 636.251178] scsi7 : usb-storage 2-1:1.0
> [ 636.026432] usb 2-1: USB disconnect, device number 2 Disconnect after resume. > [ 636.238696] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd That's where the same device has come up with device number 3.
Looks like I found what's happening. Its exact same thing as debugged in http://marc.info/?l=linux-usb&m=137714769606183&w=2
https://bugzilla.kernel.org/show_bug.cgi?id=30912#
Created attachment 122461 [details] /var/log/messages with Ext4-fs errors on resume +1 Not sure if it is a re-numbering USB pronlem or not but I have similar configuration and symptoms : I have a HP mini 311, with SD Card reader integrated (ums_realtek and usb_storage module) My rootfs is on a SD Card, on /dev/sdb1 My system : OpenSuse 13.1, Linux hpmini 3.11.6-4-desktop #1 SMP PREEMPT Wed Oct 30 18:04:56 UTC 2013 (e6d4a27) i686 i686 i386 GNU/Linux When I run : echo "============ SUSPEND NOW ================ " >> /var/log/messages ; pm-suspend ; The system suspend normally, but when it resume, it freeze randomly after 5-20s. Sometime it have time to save kernel error message to /var/log/messages (which I attached) It looks like it failed to read/write on the rootfs, on the SD card on resume. What weird is : I didn't have this problem with OpenSuse 12.2, kernel 3.4.63
Anybody got any inputs on this issue?
Finally, updated to 3.14.1. And same old same old. /dev/sda becomes /dev/sdc after resume and all hell breaks loose.
For USB issues, please post to linux-usb@vger.kernel.org, we don't do bugzilla.
> we don't do bugzilla. But you can read, right? I am a developer myself. If this was the code written/maintained by me, I would take pride in fixing my code, no matter where the damn bug came from. I will ask for questions, search for clues whether it was mailing list or bugzilla or jira or google groups. Honestly, after 3 years of the bug being ignored, I don't care about this bug anymore. So, I am not registering with some high frequency mailing list to provide you pretty much all the information I/we have already provided above. You may close this bug now for all I care.
On Sun, May 04, 2014 at 03:25:35PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=30912 > > --- Comment #10 from devsk <kernel-bugs.dev1world@spamgourmet.com> --- > > we don't do bugzilla. > > But you can read, right? Through email, yes. But I don't scale, which is why the USB developers don't use bugzilla. Send to the mailing list and all USB developers get told about the issue, not just one little person like me. > I am a developer myself. If this was the code written/maintained by me, I > would > take pride in fixing my code, no matter where the damn bug came from. I will > ask for questions, search for clues whether it was mailing list or bugzilla > or > jira or google groups. I am not the only developer of this code, and I maintain a few million lines of code in the kernel tree. Doing it all by myself is an impossible task, so the community works together, which is how Linux works. > Honestly, after 3 years of the bug being ignored, I don't care about this bug > anymore. So, I am not registering with some high frequency mailing list to > provide you pretty much all the information I/we have already provided above. You don't have to register with anything, just send an email, in non-html form, and it will go through.