Am now using kernel 3.1 without problems. echo disk > /sys/power/state works fine. It does NOT work in any of the 3.2.x series. Attached is my syslog from the latest experiment with 3.2.4. Using Slackware 13.37 on a Lenovo ThinkPad T510.
Created attachment 72283 [details] tail of syslog shows errors
Forgot to include this. The failure ONLY occurs when a USB 2.0 flash drive is in a socket. The flash drive filesystems are NOT mounted. Without a flash drive, hibernation is ok. But, once again, nothing like this happens in 3.1 kernel.
Follow the instructions in the kernel source file Documentation/usb/usbmon.txt to collect a usbmon trace showing what happens when the flash drive is first plugged in. Attach the trace to this bug report; it should reveal the reason for the problem.
Created attachment 72313 [details] 3.2.5 config with debugfs enabled
debugfs fs NOT created. No dir /sys/kernel/debug. Cannot do the tests, of course. Please advise.
Your config file attached to comment #4 contains this line: # CONFIG_DEBUG_FS is not set Fix that and you should be all set.
Created attachment 72326 [details] Output from 0u after flash drive is inserted
Created attachment 72327 [details] Output from 0u when inserted flash drive is removed
Created attachment 72328 [details] /var/log/debug after boot with flash drive present before start
Created attachment 72329 [details] /var/log/dmesg after boot with drive present
Alan Thanks for suggestion on config. It now shows debugging info. I collected 0u in 2 different scenarios, described above, & attached debug & dmesg logs just after boot into a console using a 3.2.5 kernel. I do not know how to read errors in ou files. Hopefully, you do. As an aside, I tried Fedora 16 (has a 3.2.x kernel) last week. It, too, did not hibernate. Richard PS Are you the maintainer of the hibernate module?
Created attachment 72330 [details] /var/log/dmesg after boot with FD
Oddly, the files all look quite normal. There's no sign that any process is hung while trying to communicate with the flash drive and no errors in the usbmon traces. A few seconds after you plug in the drive, does Alt-SysRq-W show anything interesting? Note: You should type this in a VT console, not a graphical window. Also, the output will be more meaningful if you enable CONFIG_FRAME_POINTER in your kernel. (Instead of sending the contents of /var/log/whatever, in the future it will be better if you send the output from the "dmesg" command. In fact, a dmesg dump following a hibernation failure with CONFIG_FRAME_POINTER enabled would be a good thing to see.) No, I'm not the maintainer of the hibernation code (which is part of the Power Management subsystem). But I am one of the USB maintainers.
Created attachment 72331 [details] dmesg file with FRAME_POINTER set after failed s2disk (hibernation) Alan Can you decipher this traceback? Richard
> does Alt-SysRq-W show anything interesting? Nope
Created attachment 72343 [details] Patch to log SCSI scanning and INQUIRY commands Yes, indeed I can decipher it. It says that a process named usb-stor-scan was running and could not be frozen. That process should show up when you run "ps -Awf" or something similar, after plugging in the drive -- even if you don't try to hibernate. It also should show up in the Alt-SysRq-W output. The process couldn't be frozen because it was waiting for a SCSI command to be executed. Although it doesn't say explicitly, it sure looks like the command in question is an INQUIRY. But the usbmon trace showed the INQUIRY command was executed normally! I don't know what's going wrong. Maybe this patch will help. Let's see what it adds to the dmesg output when you plug in the flash drive.
Hold on a minute! I just noticed that the problem occurred when you were trying to resume from hibernation, not when you were trying to hibernate! That makes a big difference. This means that the usb-stor-scan thread couldn't be frozen because it was waiting for some other thread to carry out the INQUIRY command, and that other thread had already been frozen. This explains everything -- except that I don't know what the other thread is. Let me look into this further.
Question: On your system, what shows up in /sys/module/scsi_mod/parameters/scan ?
Created attachment 72354 [details] Check for freezing in SCSI scanning threads This isn't exactly the correct solution, but it's worth trying to see if it helps at all.
Created attachment 72355 [details] dmesg after resume with your patch
cat /sys/module/scsi_mod/parameters/scan produces "sync" on both 3.1 & 3.2.5 Re: Your patch on scsi_scan.c Good news: It worked! (in a console) Not so good news: screen flashed "garbage" as it hibernated. v 3.1 does not. Now why is this suddenly a problem when it all functioned correctly in 3.1 & the preceeding 2.6.x series? Have you been able to replicate this on any of your hardware? Why am I the only Linux user to date that observes this? Usually IBM/Lenovo ThinkPad machines are among the most Linux friendly. Attached is the dmesg output after resume; looks pretty good to me. What next?
Created attachment 72368 [details] Diagnostics for SCSI scanning and the freezer Okay, so we're on the right track. I don't know about garbage flashing on the screen; that may be totally unrelated. I can't answer why the problem appeared in 3.2 until I know exactly what the problem is. That's what the attached patch is for. You should remove the previous patch, so that the error occurs again, and apply this one so that we can get more information about what's happening. And while you're at it, please enable CONFIG_PRINTK_TIME so that the dmesg log contains timestamps. I have not been able to reproduce your problem on my system (although I did uncover a couple of other problems with hibernation while trying). And then today an update to systemd prevented my test system from booting! So I have to solve that before I can run more tests over here. Still, the output you get from this diagnostic patch will help.
Created attachment 72369 [details] dmesg + PRINTK_TIME + latest patch
Attached above is dmesg using your second patch only after resume from hibernation. Of course, it failed. Hopefully, you'll garner enough info for the next phase. I did set CONFIG_PRINTK_TIME as is obvious. In my experience, flashes of garbage in suspend indicate a flurry of error messages to stdout. As an aside, I've looked at & played with systemd (via Fedora 16). I can not see the point. An awful lot of non-intuitive complexity for little benefit, at least on my machine. I still use the old BSD/ SysVinit scheme (rc.S, rc.M, ...). The only time 'hog' is dhcpcd, so I just launch it early in the background (... &). I'm up & running in a few seconds. Good enough for me. And really simple to maintain. What is your principal hardware & Linux distro?
Created attachment 72377 [details] Revised diagnostics for SCSI scanning and the freezer Unfortunately that log didn't contain the information I was looking for, i.e., the errors during resume from hibernation. It was overwhelmed by the normal activity of your primary disk drive. Here's a revised version of the patch that attempts to print out only the information of interest. I regularly use a couple of different systems, an HP desktop and an ASUS laptop. They're both running Fedora 16 32-bit.
Created attachment 72378 [details] dmesg - cold boot with patch 3
Created attachment 72379 [details] dmesg - after resume from hibernation with patch 3
Applied patch 3 (no patch 1 or 2). dmesg output from before & after hibernation are above. Any clues there? Is it in any way useful to run diff on the relevant files in 3.1 & 3.2.x? (When you installed Fedora 16, did you use GUI installer interface, or Anaconda kickstart cfg file, or?)
Created attachment 72381 [details] Freezer test while scanning This helped a little. Apparently the initial INQUIRY command doesn't complete properly. The freezer's timeout is 20 seconds and the command's timeout is a little more, so the freezer gives up first and aborts the hibernation attempt. The question is why does the INQUIRY fail in this way. I have no idea -- it worked perfectly during the cold boot. The relevant parts of the two logs are here: Cold boot: [ 3.912853] scsi 6:0:0:0: sdev req 12 [ 3.914546] scsi 6:0:0:0: sdev cmd 12 [ 3.916550] usb-storage 2-1.2:1.0: usb cmd 12 [ 3.918240] scsi 6:0:0:0: Direct-Access Patriot Memory PMAP PQ: 0 ANSI: 0 CCS Here the INQUIRY succeeded in less than 2 milliseconds. Hibernation resume: [ 3.541509] scsi 6:0:0:0: sdev req 12 [ 3.541571] scsi 6:0:0:0: sdev cmd 12 [ 3.541621] usb-storage 2-1.2:1.0: usb cmd 12 [ 23.498301] [ 23.498326] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0): [ 23.498408] usb-stor-scan D f461dcb0 0 51 2 0x00800004 Here the INQUIRY didn't succeed at all. But then later on, after a reset, it did work normally. Here's another test patch for you to try. This is a little different from the others. Boot with the flash drive unplugged, and start by doing "echo freezer >/sys/power/pm_test". Then type "echo disk >/sys/power/state" but don't press the Return key; just leave the cursor sitting at the end of the line. Then plug in the flash drive. The new patch will cause a 10-second delay before scanning begins, so after plugging it in, wait about five seconds and then hit the Return key. The question is whether the same sort of error will occur. (And by the way, for testing it's a good idea always to leave both CONFIG_PRINTK_TIME and CONFIG_FRAME_POINTER enabled.) (My Fedora 16 is an upgrade from Fedora 14, which in turn was an upgrade from Fedora 12, which in turn ...)
Created attachment 72383 [details] debug - patch 4
Created attachment 72384 [details] dmesg - patch 4
Created attachment 72385 [details] syslog - patch 4
Followed your instructions for patch 4. Waited 10 sec after FD insert then hit CR on "echo disk > /sys/power state". After many sec, got "write error: device or resource busy". I've attached above the subsequent logs (debug, dmesg, & syslog). How can device (sdb?) be busy? It's not even mounted. FRAME_POINTER is enabled.
You're supposed to hit Return after about 5 seconds, not 10 seconds. If you run the test from a VT console, you should see the "10-second test delay..." message appear when you plug in the flash drive; hit Return less than 10 seconds after it shows up. I can't tell what these attachments are. Some of them look like files from /var/log. Forget about those files, they aren't very useful for kernel debugging. All I care about is the output from the dmesg command. The file attached to comment #31 _is_ dmesg output, but it looks like the output from before you ran the test rather than after. It doesn't contain any important information. One thing you can do to help shrink the dmesg output is, before you start a test, to run "dmesg -C". That will clear all the old entries out of the kernel's log buffer so that afterwards you'll get only the messages that were generated during the test, as opposed to everything that has happened since boot-up.
Created attachment 72391 [details] dmesg.txt using patch 4 Repeated the test. dmesg.txt shows output. dmesg -c used to clear kernel log before the test. "10 second test delay..." was NOT echoed to screen after inserting FD. Waited 5 sec or so before doing the s2disk command. System hangs for about 20 sec, then I get "write error: device or resource busy" onthe VT console. Next I grabbed the dmesg stuff. (Nice small file now.) I repeated the test several times & always saw the same result.
Okay, that's great! It did just what I hoped -- the bug showed up, and you didn't have to go through a full hibernation cycle to do it. This will make further testing a lot easier. The reason for the message not being echoed to the screen is probably just that the console's log level was set too low for it to show up. You can increase the log level by typing Alt-SysRq-7 or doing "echo 7 >/proc/sys/kernel/printk". For the next test, I'd like you to do the same thing as here, but this time collect a usbmon trace while the test is running. Instructions for usbmon are in Documentation/usb/usbmon.txt. The 0u file will be fine; start the "cat" process before plugging in the flash drive.
Created attachment 72403 [details] 0u.txt from usbmon test inserting FD Here is 0u.txt while inserting FD. NB: This kernel still has the 10 sec delay patch in it. Also attached is dmesg.txt & script I use to run the test.
Created attachment 72404 [details] dmesg.txt output
Created attachment 72405 [details] bash script to run this test
You didn't do the "echo freezer >/sys/power/pm_test" and "echo disk >/sys/power/state" steps. This should be exactly the same as the test in comment #35 (including the 10-second delay) except for the addition of the usbmon tracing.
Command sequence Cold boot, no FD In VT1, start usbmon script (patch4test2) In VT2, run 10s delay script (patch4test1) Insert FD, wait 5s, CR Wait ... Kill usbmon in VT1 Get output (0u.txt dmesg.txt), attached.
Created attachment 72406 [details] 0u.txt
Created attachment 72407 [details] dmesg2.txt
Created attachment 72408 [details] patch4test1 script
Created attachment 72409 [details] patch4test2 script
Reran your tests using only one bash script, patch4test. Attached is 0u.txt & dmesg.txt. Looks pretty much as before.
Created attachment 72411 [details] 0u.txt
Created attachment 72412 [details] dmesg.txt
Created attachment 72413 [details] patch4test bash script (usbmon + 10s delay)
The usbmon trace in comment #42 holds the key. These two lines: f45ce240 539815062 S Ci:1:004:0 s a1 fe 0000 0000 0001 1 < f45ce240 539815267 C Ci:1:004:0 -2 0 That -2 status code was unexpected. Now I think I know the cause of the problem. Let me do some testing over here; I'll get back to you later on.
Okay, I think it's all set. While working on this, I found two other suspend-related bugs, so I'm going to attach three patches. The first two are fixes for those other bugs, and the third will contain the usb-storage fix (along with the 10-second delay so you can test it easily). These patches are for the 3.2.x kernel version. The usb-storage fix, in particular, will have to be changed for 3.3. But that's okay; I can take care of it now that I know what the problem is. For this, just do the same test as before. Attach the dmesg and usbmon output so I can make sure they are okay. If everything works right, I'll take the 10-second delay out of the third patch so you can try it with real hibernation.
Created attachment 72425 [details] Freeze the block-device polling workqueue
Created attachment 72426 [details] Implement the prepare stage for SCSI suspend
Created attachment 72427 [details] usb-storage scanning thread freezing fix and test Note: The previous usb-storage patch should be removed before applying this one.
Created attachment 72428 [details] 0u.txt
Created attachment 72429 [details] dmesg.txt
I did a fresh rebuild from scratch using ONLY the last 3 patches you posted. Output from the test is above. It looks good! No obvious errors (no -2 in 0u.txt & dmesg says all was frozen). So what's your assessment? Ready for a full hibernation test? In next round, please be specific on which patches to apply. I'm accumulating a large collection. Could you look & see what was damaged in the evolution from 3.1.x -> 3.2.x? Did someone accidentally delete something? Was an algorithm altered? Or..? Thanks.
Created attachment 72430 [details] Fix freezing of usb-storage scanning thread Here you go. This patch replaces the previous one, so it's meant to go along with the second- and third-to-last patches from before (i.e., the block-device polling and SCSI prepare-stage patches). Go ahead and try it with a full hibernation. What changed between 3.1 and 3.2 is the way the usb-storage scanning thread got frozen. You can see the actual change here: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=f02fe890ece7d695a5744b20 I okay'ed that change, not realizing that enabling signals to be delivered to the scanning thread would mess up its I/O delays. In addition, your flash drive has a bug of its own -- it refuses to carry out a SCSI command while waiting for a control transfer to complete. Normally this wouldn't matter, but since the signal delivery caused the scanning thread to cancel its initial control transfer rather than completing it (as shown by the -2 in the usbmon trace), the flash drive just hung. That in turn hung up the scanning thread long enough to abort the freezer and ruin the resume from hibernation. Anyway, my patch now does things the right way. It does not allow signals to be delivered, and it makes the scanning thread freezable at the place where it matters, that is, just before the I/O begins instead of at the start of a potentially long delay.
It worked in a console (some flashing text not readable). But, it completely failed in X. (Hibernated & powered down. Crashed during resume while extracting the swap image, I think.) Just rebooted the machine. I did this with the config for debugging mode.
That's bad. On the other hand, if it got as far as trying to read in the memory image from swap then it's doing better than before -- it means the original bug has been fixed. The new problem is likely to be something completely different. Is there any way you can collect any information about the crash? For example, can you boot with a network or serial console? Even if you can't, it might help to boot with "no_console_suspend" and "loglevel=8" on the grub command line. Also, can you verify that hibernation is reliable when you don't start it from within X? For example, if X is running but you switch back to a VT, does it work? And even if X isn't running, does hibernation occasionally fail?
So weird. Now it works. Running mc in an xterm & FireFox, I did s2disk & it resumed properly a few times. It also functioned when I switched from a running X to a VT & hibernated. I do not have access to network boot nor a serial console. I use lilo, not grub(2). So I added append="loglevel=8 no_consle_suspend" to the per image section & reran lilo. I think we need more testing of these mods. I'll try building 3.2.6 with your 3 patches & NO debugging set. (As I normally build & use a kernel.) Stay tuned.
Except for extraneous flashes on the screen (broken diagonal lines) during hibernation AND resume, 3.2.6 with your 3 new patches functions, sans debugging in config, as it should. s2ram is ok, too. Others should check this out & verify reliability. When might the patches be committed to a stable release? 3.3.0? Thanks for your dedicated effort in resolving this issue.
Created attachment 72453 [details] Convert the scanning routine from a kthread to a delayed-work It turns out that last patch still has potential problems (it could hang if you unplug the flash drive at the wrong moment just when starting to hibernate). Here's a different solution which should always work. Please try it out. The second patch has already been accepted and should appear in a 3.2 stable release soon. The first should be accepted before long. I'll wait to submit this third patch until you have tested it.
latest patch revision has errors. please advise. below is output from patch. patching file block/genhd.c patching file include/linux/workqueue.h patching file kernel/workqueue.c 1 patching file drivers/scsi/scsi_pm.c patching file drivers/scsi/scsi_priv.h 2 patching file drivers/usb/storage/usb.h patching file drivers/usb/storage/usb.c Hunk #2 FAILED at 827. Hunk #3 succeeded at 919 (offset -1 lines). Hunk #4 succeeded at 950 (offset -1 lines). Hunk #5 succeeded at 990 (offset -1 lines). 1 out of 5 hunks FAILED -- saving rejects to file drivers/usb/storage/usb.c.rej 3
I'm lost! Last patch has an error (? It's serious, no kernel is compiled.). In any case, can you possibly post ONE single patch file that contains all 3 patches in correct sequence, after having fixed latest snafu? Thanks.
Created attachment 72457 [details] Convert the scanning routine from a kthread to a delayed-work, for 3.2 kernel My fault, sorry. I forgot that you are using a 3.2 kernel, whereas my development and testing uses 3.3. The newly attached patch is the same as before but back-ported to 3.2, so it should apply with no problem. I don't like to distribute combined patches. It makes changes more difficult; if something is wrong with only one patch, tracking it down is that much harder. However, if you want, you can easily combine the three patches into a single file yourself simply by concatenating them.
It finally works, 3.2.5. Hibernates & resumes as expected. Will 3.3.0 stable release have all these patches incorporated?
Probably it will. Ask again a few weeks from now.
closing, as this is fixed upstream.
A patch referencing this bug report has been merged in Linux v3.3-rc5: commit bb94a406682770a35305daaa241ccdb7cab399de Author: Alan Stern <stern@rowland.harvard.edu> Date: Tue Feb 21 13:16:32 2012 -0500 usb-storage: fix freezing of the scanning thread