Subject : Suspend doesn't work when SD card is inserted
Submitter : "Zdenek Kabelac" <email@example.com>
Date : 2008-02-17 12:00
References : http://lkml.org/lkml/2008/2/17/81
Handled-By : Rafael J. Wysocki <firstname.lastname@example.org>
This entry is being used for tracking a regression from 2.6.24. Please don't
close it until the problem is fixed in the mainline.
Problem exposed by:
Author: Rafael J. Wysocki <email@example.com>
Date: Sat Jan 12 20:40:46 2008 +0100
PM: Acquire device locks on suspend
Can you attach the dmesg log showing what happens when you try to suspend and it fails?
Hmmm I'll try in the evening harder to get any log - except I don't know how do I get the log, when I do not have serial console available - but maybe I'll be able to take same screen snaps - the machine stayes locked with empty text screen while the sleep led of T61 stays flashing - so it's not possible to get any dmesg - I had to turn off/on computer at this moment.
Anyway during successful resumes without SD card I've got couple of traces which might be also worth to check so I'm attaching them - they are from various stages of kernels during bisect games - so maybe some of them do not apply to the latest git tree.
I'll attach them
Created attachment 14883 [details]
INFO trace after resume
Actually it looks like the traces are roughly similar and show the same problem all the time - so just one log is attached generated with the latest kernel
The issue shown in the log is most likely unrelated to the problem that you're experiencing and should be reported separately.
To debug the SD problem, please compile the kernel with CONFIG_DEBUG_INFO set. Then, with this kernel installed and with the SD card inserted, please switch to the text console, login as root and do
# echo 8 > /proc/sys/kernel/printk
# echo devices > /sys/power/pm_test
# echo mem > /sys/power/state
This will probably hang, but it should dump a stack trace before. We'll need a copy of this trace (eg. if you have a digital camera, you can take a picture of the screen and attach it).
Created attachment 14884 [details]
Suspend with inserted SD card
Ok I've created two trace with kernel compiled with PM debug.
Once I've run PM without inserted SD and once with it (this trace was takend after echo device - no futher loggin possible)
The last visible text was on console was:
PM: Entering mem sleep
ACPI: Preparing to enter system sleep state S3
In the full log (without SD) its visible, that suspend continues with:
drm card0: suspend
- somewhat unrelated to MMC I guess - unless MMC damages some structures somewhere.
BTW - am I correct when I assume that PM debugging actually immediately resumes and turns the screen into a green color (as this happened to me) ?
Created attachment 14885 [details]
Trace without inserted SD card
Here is full trace with resume when no SD card is inserted into the card reader.
[Well, I'm sorry I should have told you before.] Please boot the kernel with the "no_console_suspend" command line argument. Please also compile it with CONFIG_PM_VERBOSE unset, to reduce the noise.
Can you produce a suspend trace with SD inserted from the kernel booted with "no_console_suspend", please?
With no_console_suspend & SD inserted the log on screen ends just two lines afterward:
ACPI: PCI interrupt for device 0000:00:02.0 disabled
Here no futher output is visible - thought I'd would have guessed that maybe something still happens as the led diod stops flashing a bit later (unsure - but looks like this)
FYI:00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c) (prog-if 00 [VGA controller])
Apart from whatever problem we find with the SD driver, it looks like there's
going to be another problem. In drivers/mmc/core/core.c:mmc_suspend_host(), if
host->bus_ops->resume isn't set then the call to host->bus_ops->remove(host) is
likely to end up reaching device_del().
Created attachment 14889 [details]
Debug patch to try
That's bad. Apparently, someone attempted to avoid defining .resume() for some drivers by reusing the mmc_stop_host() code.
I'm not sure what's the right fix for that yet. In the meantime, however, we can check if this is what happens in the present case.
Zdenek, can you please if the attached patch has any effect? If the issue pointed out by Alan is triggered, it will just fail the suspend without hanging.
Created attachment 14892 [details]
error trace with inserted SD
I've got two results - I've also cut the begging part (as it could be seen in my prev post) - so just the suspend stage (using pm-suspend - to get proper console back when it succeeds)
This trace shows one kind of mutex bug - system was started with inserted SD card.
Created attachment 14893 [details]
error trace with inserted SD after first suspend
Here is another trace - among the already shown INFO trace (where should I actually post this - to lkml or create here another bug for this - in which section?) - you could see what happens when I insert SD card and try second suspend which fails - again - this kernel is with your attech debug patch.
Thanks a lot for the debugging work!
First, the patch triggers, which means that the problem discovered by Alan is troubling us. [Alan, do you have an idea how to fix that cleanly?]
Second, it seems we have a bug in the suspend error path in drivers/base/power/main.c .
Created attachment 14907 [details]
Patch to fix unbalanced mutex unlocking in dpm_suspend()
Well, the second issue was easy to fix. :-)
Created attachment 14921 [details]
Experimental patch to fix the problem
Please check if the attached patch helps. I haven't had the time to test it myself, so please be careful.
Created attachment 14926 [details]
Great we are moving forward :) - with this patch it is better - yet still not rock stable. Suspend looks to work good when I leave the card as it as - in the slot or outside.
Problem appears when I add/remove SD during suspend - in this attachment you can see the the traditional INFO report and around 47sec extra add message bactrace.
In my next post I'll attach png of the locked machine.
Created attachment 14927 [details]
screenshot of a dead machine
Here comes the picture of locked machine - hope this trace is enough.
Happens during SD card insert/removal during suspend.
Just for curiosity I've checked device_pm_remove - and I'm unsure why the dev->sem is taken twice - but is it normal to let the user know, kernel will deadlock ?
I would expect maybe failure during suspend - but IMHO causing deadlock and even knowing about it is weird...
(In reply to comment #18)
> Created an attachment (id=14927) [details]
> screenshot of a dead machine
> Here comes the picture of locked machine - hope this trace is enough.
> Happens during SD card insert/removal during suspend.
This seems to be a trace from device_pm_remove(). Did the machine lock up hard after that?
Yes - this is the screen snapshot of deadlocked machine - I could try later in home this with different resolution to get more line on display for snapshot - but maybe you are able to see the reason of deadlock without it?
(In reply to comment #21)
> Yes - this is the screen snapshot of deadlocked machine - I could try later
> home this with different resolution to get more line on display for snapshot
> but maybe you are able to see the reason of deadlock without it?
No, I'm not.
Pierre, do you use a separate workqueue for handling the removal of a card by the user?
Created attachment 14941 [details]
Patch to fix the (first) problem
Attached is a simplified version of the patch from Comment #16.
Zdenek, please give it a try if you can.
Created attachment 14942 [details]
screenshot of lock with more upper lines
I've been wrong in my previuos post - the machine was not completely dead - my magic-sysrq was not properly enabled at that time - thus I've thought the machine was dead. The machine stops - but MagicSysrq works and also I could scroll buffer a bit - so no better resolution (because with vga=0x36d after resume I get broken screen) -but you can see the lines above - mainly 'Suspicious device_pm_remove during suspend' the call trace - when I hit ShowPC it looks like CPU spends its time acpi_idle_enter_bm - but machine will not suspend (and sleeping led stays flashing)
Created attachment 14943 [details]
trace shows some inhibit errors
This trace shows some inhibit problem from MMC0:
Unsure how it's related with suspend ??
And as I can see you have posted new patch for test - I'll check it.
Tested new patch - works good - I could suspend - and as expected device_pm_remove suspend stop stays there - but otherwise it sleeps :)
Evidently there's a workqueue involved somewhere, but I don't know where. Also there may be a problem with your patch; subsystems may assume that after device_unregister() has returned, the driver's release() method has run -- which won't be the case. This could mean your patch won't be usable in general, and we'll just have to fix the individual drivers.
We really need Pierre's help with this... I don't understand how the MMC subsystem is intended to work.
(In reply to comment #1)
> Problem exposed by:
> commit 775b64d2b6ca37697de925f70799c710aab5849a
> Author: Rafael J. Wysocki <firstname.lastname@example.org>
> Date: Sat Jan 12 20:40:46 2008 +0100
> PM: Acquire device locks on suspend
Rafael, this commit probably triggers this bug:
see my last comment there.
Created attachment 14957 [details]
And I've got another deadlock - if that makes any help in this case.
I've got camera nearby :)
Created attachment 14958 [details]
Just to make clean which options I'm using for kernel compilation - I'm using preemptible kernel which might be the source of some problems ?
So the "unbalanced mutex unlock" patch is in mainline but not the "fix the first problem" patch (which looks to have a locking bug, "suspending task" should be assigned with the lock held) ... which I'm guessing is experimental.
My first reaction was that this seems similar to the MMC/SD problems I mentioned on 3-Feb in
It certainly had the same resolution: take SD card out before suspending.
I finally had a look at this, and my first conclusion is that the only code with a chance of being correct is the MMC_UNSAFE_RESUME code ... which isn't actually unsafe (!!) except maybe when the host controller driver (a) can't detect card removal during suspend, and (b) is broken in that it doesn't handle that sanely. Although, yet to test this here, the BUG_ON no-card is obviously the wrong premise... I could believe that part of why it's unsafe is that it doesn't work right!
It turns out that *ALL* the MMC controllers I have access to work sanely, and will detect such removals. (Actually it's done through the card detect GPIO, which is normally configured as a wakeup-capable IRQ.) So the default behavior is pretty broken.
Zdenek, does enabling that MMC_(NOT_REALLY)_UNSAFE_RESUME option behave at all on your system?
I'm going to try current GIT on a previously-working system, and see how it behaves. There does seem to be some recent unrelated breakage though ... as in, for reasons unkown to me the system wouldn't come out of suspend and the framebuffer never suspended. (And in fact, a console cursor kept blinking, so there seems to have been some problem in the suspend sequence.)
(In reply to comment #32)
> So the "unbalanced mutex unlock" patch is in mainline but not the "fix the
> first problem" patch (which looks to have a locking bug, "suspending task"
> should be assigned with the lock held)
Well, Alan convinced me that it could be done without the lock. Never mind, though, the patch isn't going anywhere.
Zdenek, can you please test with the patch from:
Created attachment 14962 [details]
OK, the appended patch works on my system against 2.6.25-rc2-git, once I configure the (not-really) "unsafe" resume thing ... with card inserted during suspend, or not (the "not" was previously wrongly flagged as a bug). (Hardware is, for the record, an at91sam9263ek development board.)
See patch comments for the unresolved case, and some other issues. I also ended up developing a patch like  because of some oddness I observed.
Patch from Comment #34 was incomplete. Please use this one instead:
Patch from Comment #36 was still incomplete. The updated version is at:
Patch : http://marc.info/?l=linux-acpi&m=120389632114090&w=2
(In reply to comment #23)
> Pierre, do you use a separate workqueue for handling the removal of a card by
> the user?
Not in this path, no. A real eject does go via a (mmc specific) workqueue though.
(In reply to comment #32)
> It turns out that *ALL* the MMC controllers I have access to work sanely, and
> will detect such removals. (Actually it's done through the card detect GPIO,
> which is normally configured as a wakeup-capable IRQ.) So the default
> is pretty broken.
It is unsafe because for the general case we cannot know that a removal (and reinsertion) cannot go by undetected. If you have a fix for that, then by all means submit a patch (or make the filesystems suspend aware). But as it stands, the default is not changing.
For the record, the MMC layer suspend behaviour used to work just fine (both with and without the UNSAFE option). So preferably the ability to remove devices during suspend should be restored. The card device is a child of the controller, so it should not yet be suspended at the point we try to remove it.
Created attachment 14979 [details]
Make MMC workqueue freezable
Never mind all my earlier patches. The problem of synchronization between the workqueue and the suspend method should be solved by making the workqueue freezable and converting the flush operations to specific cancellations. I haven't even compile-tested this patch, but the idea is sound.
Zdenek, please try this patch together with Rafael's latest from comment #38. (Dave, you could try the same thing.)
ok - I've tested now the patch from comment #38 - it works as previous patches - still softlock with device_pm_remove
I'll now check it with patch #40 - I assume together ?
Also the kernel after the commit 85b80ebfa4384b8ea30cc1af9617db30319a9ccd is something 'unreal' - why such thing gets in the git tree when there is already a lot of other broken things ??? - I would probably remove any -rc from such kernel....
Actually where do I get function create_freezable_workqueue for the MMC workque patch
It should be create_freezeable_workqueue().
I've good news :) I've been unable to freeze my box with SD card inserting removing (at least not as easily as before) when I combine #38 + #40 + $43
So this was nice - however now I'm currious how to bisect current merge in -rc3 - this seem to be hard task for bisect - resume will not resume at all.
Also in my test commit 85b80ebfa4384 I'm getting some other back traces probably unrelated to suspend changes and possibly fixed in the merged tree - however the merged tree doesn't work with suspend.
(In reply to comment #44)
> I've good news :) I've been unable to freeze my box with SD card inserting
> removing (at least not as easily as before) when I combine #38 + #40 + $43
> So this was nice - however now I'm currious how to bisect current merge
> in -rc3 - this seem to be hard task for bisect - resume will not resume
> at all.
Please try reverting:
git bisect points at this commit:
power_state: get rid of write-only variable in SATA
(In reply to comment #45)
> Please try reverting:
> git bisect points at this commit:
> commit 559bbe6cbd0d8c68d40076a5f7dc98e3bf5864b2.
> power_state: get rid of write-only variable in SATA
seems to be OK now.
I could also confirm that latest kernel suspend with this reverted commit - and few others- but going to make some more reports about bad locking
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a8d37a37380e2b1500592d40b7ec384dbebe7a0