|Summary:||Suspend doesn't work when SD card is inserted|
|Product:||Power Management||Reporter:||Rafael J. Wysocki (rjw)|
|Component:||Hibernation/Suspend||Assignee:||Rafael J. Wysocki (rjw)|
|Severity:||normal||CC:||dbrownell, pierre-bugzilla, stern, xhejtman, zdenek.kabelac|
|Bug Depends on:|
INFO trace after resume
Suspend with inserted SD card
Trace without inserted SD card
Debug patch to try
error trace with inserted SD
error trace with inserted SD after first suspend
Patch to fix unbalanced mutex unlocking in dpm_suspend()
Experimental patch to fix the problem
screenshot of a dead machine
Patch to fix the (first) problem
screenshot of lock with more upper lines
trace shows some inhibit errors
Make MMC workqueue freezable
Description Rafael J. Wysocki 2008-02-17 13:14:03 UTC
Comment 1 Rafael J. Wysocki 2008-02-17 13:17:11 UTC
Problem exposed by: commit 775b64d2b6ca37697de925f70799c710aab5849a Author: Rafael J. Wysocki <email@example.com> Date: Sat Jan 12 20:40:46 2008 +0100 PM: Acquire device locks on suspend
Comment 2 Alan Stern 2008-02-17 19:56:57 UTC
Can you attach the dmesg log showing what happens when you try to suspend and it fails?
Comment 3 Zdenek Kabelac 2008-02-18 04:15:00 UTC
Hmmm I'll try in the evening harder to get any log - except I don't know how do I get the log, when I do not have serial console available - but maybe I'll be able to take same screen snaps - the machine stayes locked with empty text screen while the sleep led of T61 stays flashing - so it's not possible to get any dmesg - I had to turn off/on computer at this moment. Anyway during successful resumes without SD card I've got couple of traces which might be also worth to check so I'm attaching them - they are from various stages of kernels during bisect games - so maybe some of them do not apply to the latest git tree. I'll attach them
Comment 4 Zdenek Kabelac 2008-02-18 04:18:03 UTC
Created attachment 14883 [details] INFO trace after resume Actually it looks like the traces are roughly similar and show the same problem all the time - so just one log is attached generated with the latest kernel
Comment 5 Rafael J. Wysocki 2008-02-18 04:34:04 UTC
The issue shown in the log is most likely unrelated to the problem that you're experiencing and should be reported separately. To debug the SD problem, please compile the kernel with CONFIG_DEBUG_INFO set. Then, with this kernel installed and with the SD card inserted, please switch to the text console, login as root and do # echo 8 > /proc/sys/kernel/printk # echo devices > /sys/power/pm_test # echo mem > /sys/power/state This will probably hang, but it should dump a stack trace before. We'll need a copy of this trace (eg. if you have a digital camera, you can take a picture of the screen and attach it).
Comment 6 Zdenek Kabelac 2008-02-18 06:17:18 UTC
Created attachment 14884 [details] Suspend with inserted SD card Ok I've created two trace with kernel compiled with PM debug. Once I've run PM without inserted SD and once with it (this trace was takend after echo device - no futher loggin possible) The last visible text was on console was: PM: Entering mem sleep ACPI: Preparing to enter system sleep state S3 Suspending console(s) In the full log (without SD) its visible, that suspend continues with: drm card0: suspend - somewhat unrelated to MMC I guess - unless MMC damages some structures somewhere. BTW - am I correct when I assume that PM debugging actually immediately resumes and turns the screen into a green color (as this happened to me) ?
Comment 7 Zdenek Kabelac 2008-02-18 06:19:46 UTC
Created attachment 14885 [details] Trace without inserted SD card Here is full trace with resume when no SD card is inserted into the card reader.
Comment 8 Rafael J. Wysocki 2008-02-18 07:23:49 UTC
[Well, I'm sorry I should have told you before.] Please boot the kernel with the "no_console_suspend" command line argument. Please also compile it with CONFIG_PM_VERBOSE unset, to reduce the noise. Can you produce a suspend trace with SD inserted from the kernel booted with "no_console_suspend", please?
Comment 9 Zdenek Kabelac 2008-02-18 07:56:03 UTC
With no_console_suspend & SD inserted the log on screen ends just two lines afterward: drm_sysfs_suspend ACPI: PCI interrupt for device 0000:00:02.0 disabled Here no futher output is visible - thought I'd would have guessed that maybe something still happens as the led diod stops flashing a bit later (unsure - but looks like this) FYI:00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA controller]) Zdenek
Comment 10 Alan Stern 2008-02-18 08:05:33 UTC
Apart from whatever problem we find with the SD driver, it looks like there's going to be another problem. In drivers/mmc/core/core.c:mmc_suspend_host(), if host->bus_ops->resume isn't set then the call to host->bus_ops->remove(host) is likely to end up reaching device_del().
Comment 11 Rafael J. Wysocki 2008-02-18 15:54:31 UTC
Created attachment 14889 [details] Debug patch to try That's bad. Apparently, someone attempted to avoid defining .resume() for some drivers by reusing the mmc_stop_host() code. I'm not sure what's the right fix for that yet. In the meantime, however, we can check if this is what happens in the present case. Zdenek, can you please if the attached patch has any effect? If the issue pointed out by Alan is triggered, it will just fail the suspend without hanging.
Comment 12 Zdenek Kabelac 2008-02-19 02:01:59 UTC
Created attachment 14892 [details] error trace with inserted SD I've got two results - I've also cut the begging part (as it could be seen in my prev post) - so just the suspend stage (using pm-suspend - to get proper console back when it succeeds) This trace shows one kind of mutex bug - system was started with inserted SD card.
Comment 13 Zdenek Kabelac 2008-02-19 02:04:44 UTC
Created attachment 14893 [details] error trace with inserted SD after first suspend Here is another trace - among the already shown INFO trace (where should I actually post this - to lkml or create here another bug for this - in which section?) - you could see what happens when I insert SD card and try second suspend which fails - again - this kernel is with your attech debug patch.
Comment 14 Rafael J. Wysocki 2008-02-19 15:23:38 UTC
Thanks a lot for the debugging work! First, the patch triggers, which means that the problem discovered by Alan is troubling us. [Alan, do you have an idea how to fix that cleanly?] Second, it seems we have a bug in the suspend error path in drivers/base/power/main.c .
Comment 15 Rafael J. Wysocki 2008-02-19 15:32:25 UTC
Created attachment 14907 [details] Patch to fix unbalanced mutex unlocking in dpm_suspend() Well, the second issue was easy to fix. :-)
Comment 16 Rafael J. Wysocki 2008-02-20 17:11:10 UTC
Created attachment 14921 [details] Experimental patch to fix the problem Please check if the attached patch helps. I haven't had the time to test it myself, so please be careful.
Comment 17 Zdenek Kabelac 2008-02-21 02:41:53 UTC
Created attachment 14926 [details] surviving suspend Great we are moving forward :) - with this patch it is better - yet still not rock stable. Suspend looks to work good when I leave the card as it as - in the slot or outside. Problem appears when I add/remove SD during suspend - in this attachment you can see the the traditional INFO report and around 47sec extra add message bactrace. In my next post I'll attach png of the locked machine.
Comment 18 Zdenek Kabelac 2008-02-21 02:46:28 UTC
Created attachment 14927 [details] screenshot of a dead machine Here comes the picture of locked machine - hope this trace is enough. Happens during SD card insert/removal during suspend.
Comment 19 Zdenek Kabelac 2008-02-21 03:08:43 UTC
Just for curiosity I've checked device_pm_remove - and I'm unsure why the dev->sem is taken twice - but is it normal to let the user know, kernel will deadlock ? I would expect maybe failure during suspend - but IMHO causing deadlock and even knowing about it is weird...
Comment 20 Rafael J. Wysocki 2008-02-21 07:46:20 UTC
(In reply to comment #18) > Created an attachment (id=14927) [details] > screenshot of a dead machine > > Here comes the picture of locked machine - hope this trace is enough. > Happens during SD card insert/removal during suspend. This seems to be a trace from device_pm_remove(). Did the machine lock up hard after that?
Comment 21 Zdenek Kabelac 2008-02-21 08:24:51 UTC
Yes - this is the screen snapshot of deadlocked machine - I could try later in home this with different resolution to get more line on display for snapshot - but maybe you are able to see the reason of deadlock without it?
Comment 22 Rafael J. Wysocki 2008-02-21 08:42:54 UTC
(In reply to comment #21) > Yes - this is the screen snapshot of deadlocked machine - I could try later > in > home this with different resolution to get more line on display for snapshot > - Yes, please. > but maybe you are able to see the reason of deadlock without it? No, I'm not.
Comment 23 Rafael J. Wysocki 2008-02-21 08:46:57 UTC
Pierre, do you use a separate workqueue for handling the removal of a card by the user?
Comment 24 Rafael J. Wysocki 2008-02-21 14:52:28 UTC
Created attachment 14941 [details] Patch to fix the (first) problem Attached is a simplified version of the patch from Comment #16. Zdenek, please give it a try if you can.
Comment 25 Zdenek Kabelac 2008-02-21 14:58:21 UTC
Created attachment 14942 [details] screenshot of lock with more upper lines I've been wrong in my previuos post - the machine was not completely dead - my magic-sysrq was not properly enabled at that time - thus I've thought the machine was dead. The machine stops - but MagicSysrq works and also I could scroll buffer a bit - so no better resolution (because with vga=0x36d after resume I get broken screen) -but you can see the lines above - mainly 'Suspicious device_pm_remove during suspend' the call trace - when I hit ShowPC it looks like CPU spends its time acpi_idle_enter_bm - but machine will not suspend (and sleeping led stays flashing)
Comment 26 Zdenek Kabelac 2008-02-21 15:03:59 UTC
Created attachment 14943 [details] trace shows some inhibit errors This trace shows some inhibit problem from MMC0: Unsure how it's related with suspend ?? And as I can see you have posted new patch for test - I'll check it.
Comment 27 Zdenek Kabelac 2008-02-21 15:27:03 UTC
Tested new patch - works good - I could suspend - and as expected device_pm_remove suspend stop stays there - but otherwise it sleeps :)
Comment 28 Alan Stern 2008-02-22 12:52:05 UTC
Evidently there's a workqueue involved somewhere, but I don't know where. Also there may be a problem with your patch; subsystems may assume that after device_unregister() has returned, the driver's release() method has run -- which won't be the case. This could mean your patch won't be usable in general, and we'll just have to fix the individual drivers. We really need Pierre's help with this... I don't understand how the MMC subsystem is intended to work.
Comment 29 Lukas Hejtmanek 2008-02-22 16:29:01 UTC
(In reply to comment #1) > Problem exposed by: > > commit 775b64d2b6ca37697de925f70799c710aab5849a > Author: Rafael J. Wysocki <firstname.lastname@example.org> > Date: Sat Jan 12 20:40:46 2008 +0100 > > PM: Acquire device locks on suspend > Rafael, this commit probably triggers this bug: http://bugzilla.kernel.org/show_bug.cgi?id=9874 see my last comment there.
Comment 30 Zdenek Kabelac 2008-02-23 03:40:46 UTC
Created attachment 14957 [details] smp_processor_id deadlock And I've got another deadlock - if that makes any help in this case. I've got camera nearby :)
Comment 31 Zdenek Kabelac 2008-02-23 03:43:25 UTC
Created attachment 14958 [details] config file Just to make clean which options I'm using for kernel compilation - I'm using preemptible kernel which might be the source of some problems ?
Comment 32 David Brownell 2008-02-23 12:51:31 UTC
So the "unbalanced mutex unlock" patch is in mainline but not the "fix the first problem" patch (which looks to have a locking bug, "suspending task" should be assigned with the lock held) ... which I'm guessing is experimental. My first reaction was that this seems similar to the MMC/SD problems I mentioned on 3-Feb in http://marc.info/?l=linux-kernel&m=120204467820656&w=2 It certainly had the same resolution: take SD card out before suspending. I finally had a look at this, and my first conclusion is that the only code with a chance of being correct is the MMC_UNSAFE_RESUME code ... which isn't actually unsafe (!!) except maybe when the host controller driver (a) can't detect card removal during suspend, and (b) is broken in that it doesn't handle that sanely. Although, yet to test this here, the BUG_ON no-card is obviously the wrong premise... I could believe that part of why it's unsafe is that it doesn't work right! It turns out that *ALL* the MMC controllers I have access to work sanely, and will detect such removals. (Actually it's done through the card detect GPIO, which is normally configured as a wakeup-capable IRQ.) So the default behavior is pretty broken. Zdenek, does enabling that MMC_(NOT_REALLY)_UNSAFE_RESUME option behave at all on your system? I'm going to try current GIT on a previously-working system, and see how it behaves. There does seem to be some recent unrelated breakage though ... as in, for reasons unkown to me the system wouldn't come out of suspend and the framebuffer never suspended. (And in fact, a console cursor kept blinking, so there seems to have been some problem in the suspend sequence.)
Comment 33 Rafael J. Wysocki 2008-02-23 12:57:07 UTC
(In reply to comment #32) > So the "unbalanced mutex unlock" patch is in mainline but not the "fix the > first problem" patch (which looks to have a locking bug, "suspending task" > should be assigned with the lock held) Well, Alan convinced me that it could be done without the lock. Never mind, though, the patch isn't going anywhere.
Comment 34 Rafael J. Wysocki 2008-02-23 13:01:03 UTC
Zdenek, can you please test with the patch from: http://bugzilla.kernel.org/attachment.cgi?id=14961&action=view applied?
Comment 35 David Brownell 2008-02-23 19:23:38 UTC
Created attachment 14962 [details] mostly-works-for-me OK, the appended patch works on my system against 2.6.25-rc2-git, once I configure the (not-really) "unsafe" resume thing ... with card inserted during suspend, or not (the "not" was previously wrongly flagged as a bug). (Hardware is, for the record, an at91sam9263ek development board.) See patch comments for the unresolved case, and some other issues. I also ended up developing a patch like  because of some oddness I observed.  http://avr32linux.org/archives/kernel/2008-February/000772.html
Comment 36 Rafael J. Wysocki 2008-02-24 12:33:37 UTC
Patch from Comment #34 was incomplete. Please use this one instead: http://bugzilla.kernel.org/attachment.cgi?id=14964&action=view
Comment 37 Rafael J. Wysocki 2008-02-24 15:10:29 UTC
Patch from Comment #36 was still incomplete. The updated version is at: http://bugzilla.kernel.org/attachment.cgi?id=14965&action=view
Comment 38 Rafael J. Wysocki 2008-02-24 17:14:34 UTC
Patch : http://marc.info/?l=linux-acpi&m=120389632114090&w=2
Comment 39 Pierre Ossman 2008-02-25 04:07:28 UTC
(In reply to comment #23) > Pierre, do you use a separate workqueue for handling the removal of a card by > the user? > Not in this path, no. A real eject does go via a (mmc specific) workqueue though. (In reply to comment #32) > It turns out that *ALL* the MMC controllers I have access to work sanely, and > will detect such removals. (Actually it's done through the card detect GPIO, > which is normally configured as a wakeup-capable IRQ.) So the default > behavior > is pretty broken. It is unsafe because for the general case we cannot know that a removal (and reinsertion) cannot go by undetected. If you have a fix for that, then by all means submit a patch (or make the filesystems suspend aware). But as it stands, the default is not changing. For the record, the MMC layer suspend behaviour used to work just fine (both with and without the UNSAFE option). So preferably the ability to remove devices during suspend should be restored. The card device is a child of the controller, so it should not yet be suspended at the point we try to remove it.
Comment 40 Alan Stern 2008-02-25 07:03:26 UTC
Created attachment 14979 [details] Make MMC workqueue freezable Never mind all my earlier patches. The problem of synchronization between the workqueue and the suspend method should be solved by making the workqueue freezable and converting the flush operations to specific cancellations. I haven't even compile-tested this patch, but the idea is sound. Zdenek, please try this patch together with Rafael's latest from comment #38. (Dave, you could try the same thing.)
Comment 41 Zdenek Kabelac 2008-02-25 10:22:39 UTC
ok - I've tested now the patch from comment #38 - it works as previous patches - still softlock with device_pm_remove I'll now check it with patch #40 - I assume together ? Also the kernel after the commit 85b80ebfa4384b8ea30cc1af9617db30319a9ccd is something 'unreal' - why such thing gets in the git tree when there is already a lot of other broken things ??? - I would probably remove any -rc from such kernel....
Comment 42 Zdenek Kabelac 2008-02-25 11:41:59 UTC
Actually where do I get function create_freezable_workqueue for the MMC workque patch
Comment 43 Rafael J. Wysocki 2008-02-25 11:54:37 UTC
It should be create_freezeable_workqueue().
Comment 44 Zdenek Kabelac 2008-02-25 13:10:14 UTC
I've good news :) I've been unable to freeze my box with SD card inserting removing (at least not as easily as before) when I combine #38 + #40 + $43 So this was nice - however now I'm currious how to bisect current merge in -rc3 - this seem to be hard task for bisect - resume will not resume at all. Also in my test commit 85b80ebfa4384 I'm getting some other back traces probably unrelated to suspend changes and possibly fixed in the merged tree - however the merged tree doesn't work with suspend.
Comment 45 Rafael J. Wysocki 2008-02-25 13:32:12 UTC
(In reply to comment #44) > I've good news :) I've been unable to freeze my box with SD card inserting > removing (at least not as easily as before) when I combine #38 + #40 + $43 > > So this was nice - however now I'm currious how to bisect current merge > in -rc3 - this seem to be hard task for bisect - resume will not resume > at all. Please try reverting: git bisect points at this commit: commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2. power_state: get rid of write-only variable in SATA
Comment 46 Lukas Hejtmanek 2008-02-25 14:04:43 UTC
(In reply to comment #45) > Please try reverting: > > git bisect points at this commit: > commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2. > power_state: get rid of write-only variable in SATA seems to be OK now.
Comment 47 Zdenek Kabelac 2008-02-26 12:31:11 UTC
I could also confirm that latest kernel suspend with this reverted commit - and few others- but going to make some more reports about bad locking
Comment 48 Rafael J. Wysocki 2008-03-05 14:59:09 UTC
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a8d37a37380e2b1500592d40b7ec384dbebe7a0