Bug 42728
Description
Oleksij Rempel (fishor)
2012-02-04 18:03:57 UTC
these 1. https://bugs.freedesktop.org/show_bug.cgi?id=40241 2. https://bugzilla.kernel.org/show_bug.cgi?id=42691 seem to be related... -arne hmm... i just found this workaround, for my model and i got suspend without crashes: EHCI_BUSES="0000:00:1a.0 0000:00:1d.0" case "${1}" in hibernate|suspend) # Switch USB buses off for bus in $EHCI_BUSES; do echo -n $bus > /sys/bus/pci/drivers/ehci_hcd/unbind done ;; resume|thaw) # Switch USB buses back on for bus in $EHCI_BUSES; do echo -n $bus > /sys/bus/pci/drivers/ehci_hcd/bind done ;; esac what i still really care is memmory corruptions after failed suspend. why do they present after power off/on cycle? only way to fix it is use start windows and suspend it (just start will not help), or open case of laptop and remove battary. Created attachment 72290 [details]
dmesg after filed suspend and brocken poweron
Created attachment 72291 [details]
normal dmesg
Created attachment 72292 [details]
normal lspci
Created attachment 72293 [details]
normal intel_reg_dump
Created attachment 72294 [details]
brocken lspci
Created attachment 72295 [details]
brocken intel_reg_dump
did this work with a previous version of Linux? If so, which version? Unclear what you mean by "suspend without crashes" when using the EHCI workaround. Exactly what does the suspend/resume failure look like when the EHCI workaround is used? I is not regression. Older kernel, for example 3.0 need EHCI + xHCI workaround. So current kernel is in better shape! :) I mean, with workaround computer able to suspend and resume without noticeable problems. I will do some more memory checks, but it looks promising now. No spontaneous crashes or segfaults are noticed. You probably mean, how failure looks like if EHCI workaround _not_ used? At first place it can't suspend, the laptop will freeze on some stage with black screen and not blinking cursor on left top. I also tried to access this laptop over ssh, without luck. But - there is no build in ethernet, i use usb-lan dongle. So there is not guaranty if some part is still alive and not completely frozen. After filed suspend, i usually need to power off the laptop, by holding power button. After this is laptop almost _unusable_, system crashes... Memtest86+ reports many corrupted memory blocks, starting with ~1GB. Reboots or longer poweroffs will not help.There is only two ways to "repair" this laptop after it: - some how start windows (avoiding the BSODs) and suspend-resume it. - or open the case of laptop, and detach the build in battery. I mean _not_ BIOS battery. Have you tried booting with no_console_suspend on the command line and CONFIG_USB_DEBUG and CONFIG_DEBUG_DRIVER enabled? It sounds like the suspend never finishes. Instead the computer is left in some invalid intermediate state, with devices partially configured. Then when the power is turned back on, the BIOS does not go through a full reset. Created attachment 72316 [details]
oops trace
This trace was made with "nomodeset no_console_suspend vga=791 oops=panic". I also tried to suspend with same setting + workaround, but i got similar trace. probably the reason of this oops is nomodeset. Without "nomodeset" i get only black screen. no ways to get some trace.
Jesse, can you look at this? A suspend problem related to "nomodeset" is preventing us from debugging a different suspend problem! Unfortunately, we don't really support nomodeset on recent hardware. If you want to stay in VGA mode to debug suspend w/usb, it would be best to blacklist the i915 driver altogether... Status update: I blacklisted agp* and i915, to use plain vga. With dmesg i can confirm, agp and i915 not used. With this configuration i was also not able to get oops-trace. I get black screen with cursor. Workaround + blacklist able to suspend. I also tried to remove uvcvideo and btusb instead of workaround, but this not fixed the issue. If nomodeset + suspend do not supported for new devices any more, then modeset+suspend+no_sonsole_suspend should be fixed. Jesse? Created attachment 72351 [details]
debug port
There is some kind of debug port on this device. May be it is just norm serial port, but i do not know where can i find this kind of connector or what wort should i google.
Hmm, i wonder how the EHCI workaround could be related to the i915 oops, anybody? I doubt they are related at all. But then, I have no idea what the underlying problem is, so what do I know? The best approach is probably to boot without nomodeset and to use i915. Also, for testing it will help to do "echo 0 >/sys/power/pm_async" first. If you enable CONFIG_DEBUG_DRIVER and switch to a VT and set the console log level to 8 before suspending, you'll get a lot of debug info on the screen during the suspend. What happens if you do "echo devices >/sys/power/pm_test" before suspending? Or any of the other possible pm_test options (freezer, platform, processors, or core)? Finally, if necessary you can slow down the suspend procedure by adding something like "msleep(100);" to drivers/base/power/main.c:dpm_run_callback(). I will do the test this weekend. If you have some more test suggestions, keep writing. hi I have the memory corruption problem. BSOD in Windows, random kernel panics in Ubuntu and corrupt memory reported by Memtest86. I tried the "repair" advices but none of them worked. BSOD when booting windows and I pulled the battery but still corrupt memory. Any advice before I try to send it in? Something you want med to test? Just curios. This device has "Intel Anti-Theft", which should make computer useless if some special configuration was set. May be, we hit some configuration bit at suspend time? I tested different pm_test with mem > state. All test was ok, it suspended devices,... and then came back. No freezes. But, then i did some discovery. After poweroff/on, memory was corrupted...! After some more testing i found, devices > pm_test is ok, no memory corruption. After processors > pm_test, there is memory corruption. Steps to reproduce: - echo 0 > pm_async - echo processors > pm_test - echo mem > state (here the display will go off, usb mouse powered off too, after one second it returns back to console .... no freezes) - poweroff after poweron, start memtest .... here i will get lots of bad memory blocks. Created attachment 72555 [details]
dmesg after processors > pm_test
This dmesg was grabbed processors > pm_test; mem > state.
If memtest shows bad memory blocks then there's probably something wrong with the firmware or the hardware on the computer's motherboard. Have you tried updating the BIOS? the bios is up to date. And: - memtest shows bad memory blocks _only_ after bad suspend. - there is no bad memory blocks after complete poweroff (battery and ac are off) - after bad suspend+freeze+poweroff, laptop remains in some kind of suspend mode. for example after poweroff (AC stay connected), power indicator may blinking (it mean suspend mode). This can be reason for bad memory blocks. - suspend works perfectly with windows. This processor has build in memory controller... may be this is the reason? Before you said that memtest shows bad memory blocks after a _good_ pm_test=processors suspend test followed by power-off, power-on. No freeze involved. Of course the memory controller _could_ be the reason. I have no way to tell. Maybe you should try posting a message on linux-pm@vger.kernel.org and perhaps also linux-kernel@vger.kernel.org, to see if anybody else has a good idea. Include the information in comment #22. this bug starting to be to match confusing. pm_test=devises is good. pm_test=processors is bad. For me, I didn't get memory corruption even though I did a suspend or two before adding the workaround from comment #2, so the memory corruption might not be completely related to the resume failure. The resuming just failed. With the workaround everything seems fine, but would be of course nicer to have it work out-of-the-box. BIOS version 210. Hi Time, what cpu is in you laptop? RAM? My is i7-2677M, with 4GB RAM. How do you checked for memory corruption? In my comment #22 i said, that i can reproduce corruptions even after pm_test=processors. It seems like usb workaround works, because mmap of usb on my laptop is in affected memory range. Timo, can you please run fallowing test: - cd /sys/power/ - echo 0 > pm_async - echo processors > pm_test - echo mem > state - poweroff - after poweron, start memtest86+ I've an i5-2557M model with 4GB RAM. I let the memtest86+ run for one full pass and then some, without errors. I haven't also experienced hangups. However, I'm using the laptop for work so I wouldn't want to experiment with the memory corruptions right now. Instructions for removing the battery could be useful anyway - there looks like to be 8 screws at the bottom. Confirming that only those need to be unscrewed and then the battery can be safely removed and reattached could help encouraging me or someone else to experiment with this. Obviously testing will be needed at some point whether or not someone comes up with a potential patch. I got the memory corruption also. I left memtest running until the battery was completely drained. Then I plugged it in and started Windows and did a suspend/resume as fast as I could, and now I think it's back to normal. Just letting the battery drain and then starting Linux wasn't enough, the corruption came back after a few minutes (??). Everybody suffering from this problem: Please add a comment containing the output from "lspci -nv -s 1d.0". No need to attach it, the output will be small enough to paste it in directly. Are the affected machines all ASUS, like Oleksij's? Alan: $ lspci -nv -s 1d.0 00:1d.0 0c03: 8086:1c26 (rev 05) (prog-if 20 [EHCI]) Subsystem: 1043:1427 Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at dfe07000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: ehci_hcd AFAIK this is a Asus Zenbook UX21/UX31 specific problem. 00:1d.0 0c03: 8086:1c26 (rev 05) (prog-if 20 [EHCI]) Subsystem: 1043:1427 Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at dfe07000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: ehci_hcd Kernel modules: ehci-hcd sudo lspci -nv -s 1d.0 00:1d.0 0c03: 8086:1c26 (rev 05) (prog-if 20 [EHCI]) Subsystem: 1043:1427 Flags: bus master, medium devsel, latency 0, IRQ 23 Memory at dfe07000 (32-bit, non-prefetchable) [size=1K] Capabilities: [50] Power Management version 2 Capabilities: [58] Debug port: BAR=1 offset=00a0 Capabilities: [98] PCI Advanced Features Kernel driver in use: ehci_hcd Created attachment 73049 [details] Keep EHCI controllers in D0 during suspend on ASUS The numbers are all the same, so this patch ought to work for all of you. It prevents the EHCI controllers on ASUS computers using Intel's 6 Series/C200 Series chipset from being put in low power during system sleep. Evidently that causing some people's systems to crash, and quite likely it's causing the memory corruption on yours. Anyway, please try the patch without using the script mentioned in comment #2. It was written against 3.4-rc4, so it may need slight adjustments to apply to your kernel sources. Cool! it works. Just with echo mem > /sys/power/state. Memtest do not show any error. Just to confirm your suggestion. I checked, what happens with Windows 7. EHCI uses MS driver _and_ AiCharger.sys. If laptop is suspended in windows, ehci continues to provide power. With this patch linux doing the same. After we found what exact the bug is, i will try to reproduce it on windows. I will probably need to remove some of asus software. If it will be possible, it will be good to make some pressure on asus. Thay support only windows, so if it is broken, then they will need to fix it. Stay tuned! I did some windows testing with removed AiCharger. Here is some more info about this software: http://event.asus.com/mb/2010/ai_charger/ Suddenly it didn't made any difference. After syspend, usb ports still prowide power, so i assume they are in D0 state. May be firmware provide correct values? I digget a bit in DSDT and found this part: Device (EHC1) { Name (_ADR, 0x001D0000) OperationRegion (PWKE, PCI_Config, 0x62, 0x04) .... Method (_S3D, 0, NotSerialized) { Return (0x02) } Method (_S4D, 0, NotSerialized) { Return (0x02) } if i understand correctly, EHC1 is "00:1d.0", _S3D is "Methods that return the lowest D-state values" It mean, it will return D2, not D3. Is it correct? Do kernel use this method to get Dstate for ehci? For the same device there is one more method: Method (_PSW, 1, NotSerialized) { If (Arg0) { Store (Ones, PWUC) } Else { Store (0x00, PWUC) } } this method is depressed for by acpi 3.0 but it is responsible by for waking laptop from S3 state by usb event. I say it because in workaround description there is: "However as a side effect, the controller will not respond to remote wakeup requests while the system is asleep. Hence USB wakeup is not functional -- but of course, this is already true in the current state of affairs." The USB ports continue to be powered even in D3, because many USB devices need bus power in order to maintain their state and to report wakeup events. I don't know exactly how the settings are managed. Maybe Linux's PCI core uses both the native PCI PM methods and the ACPI methods. hmm... i just made this patch find D-state used for suspend: + /* + * Some systems crash if an EHCI controller is in D3 during + * a sleep transition. We have to leave such controllers in D0. + */ + if (hcd->broken_pci_sleep) { + retval = pci_prepare_to_sleep(pci_dev); + printk("TTTTT, %i\n", retval); + dev_dbg(dev, "Staying in PCI D0\n"); + return retval; + } but system freezed. Same way like before, with memory corruption. It looks like pci_prepare_to_sleep() is the cause of this issue. No, the cause is the fact that the controller is in D3 when the system suspends. However you are correct that pci_prepare_to_sleep() calls pci_set_power_state(), which calls pci_raw_set_power_state(), which puts the controller into D3. This all power management stuff is new for me, so please be patient with me. Id did some more digging. pci_prepare_to_sleep() calls pci_target_state() to get supported state. pci_target_state() checks if platform_pci_power_manageable(), the last one check if struct pci_platform_pm exist. Since it dos net exist, It trying to calculate supported state by itself. The problem is, acpi_pci_init() do set pci_set_platform_pm(). ACPI _is_ used for PM, but not for EHCI. Why? hmm... platform_pci_power_manageable() returns for all devices pci_platform_pm->is_manageable(dev) = false. Is it possible, that pci_platform_pm is supported and loaded, but none of devices is_manageable? I am able to reproduce this on an Acer 4830TG-6808 also. Using the work around from comment #2 resolves this. Steven, does the patch attached to this email message: http://marc.info/?l=linux-pm&m=133563455220196&w=2 fix your problem? If not, can you provide the output from "lspci -nv -s 1d.0"? I have a non-Asus notebook with similar symptoms --- suspend-to-RAM hangs, but I do *not* see the memory corruption issue. Exactly the same workaround script as in Comment #2 makes things work properly. I note that after resuming, [ 52.899028] usb 1-1: clear tt 1 (9042) error -19 appears in dmesg. I'm currently using kernel 3.3.5, which as I understand it includes a patch to fix this on Asus machines. Is there a way I could test this machine to see if the patch should be broadened? (I've attached my lspci and dmesg below.) Created attachment 73224 [details] lspci -vvvv from Comment #48 notebook Created attachment 73225 [details] dmesg from Comment #48 notebook, across suspend/resume cycle with workaround script You should test this patch instead: http://marc.info/?l=linux-kernel&m=133582194609000&w=2 (In reply to comment #51) > You should test this patch instead: > > http://marc.info/?l=linux-kernel&m=133582194609000&w=2 Just tested it, built against 3.3.5 from Fedora sources -- seems to work! Thanks! Any chance this will make it into the 3.3 series? It will appear in a 3.3-stable release in the not-too-distant future. Probably either the next one or the one after that. (In reply to comment #48) > I'm currently using kernel 3.3.5, which as I understand it includes a patch > to > fix this on Asus machines. Is there a way I could test this machine to see if > the patch should be broadened? (I've attached my lspci and dmesg below.) Oh this has been integrated and released? Why didn't anyone say so? :-) Would be good to have in 3.2.x also. This commit right? http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=1ce9245f5aff46201fa81fdd3f796a6c9f3ad1ab (Guess it's not NEEDINFO anymore then!) Created attachment 73392 [details]
ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't managed by ACPI
Can anyone please check if this patch is sufficient to fix the problem?
Created attachment 73394 [details]
ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't managed by ACPI
Sorry, please try this one instead.
Created attachment 73397 [details]
ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't managed by ACPI
One more adjustment. Please test this one.
To be precise, please revert commit 151b61284776 from the 3.4.0 kernel, apply the patch from comment #57 and see if it still works. Unfortunately, commit 151b61284776 has introduced a regression reported in bug #43278 and we need to find an alternative approach to address this issue. (In reply to comment #57) > Created an attachment (id=73397) [details] > ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't > managed > by ACPI > > One more adjustment. Please test this one. Tried it built against 3.4 (Fedora 17 sources, kernel-3.4.0-1.fc17), didn't work. (The patch from Comment #52 works for me, when used in a 3.3-series kernel. I could try it in 3.4 later on to see if it still works.) (In reply to comment #59) Note that Comment #59 applies to a kernel without commit 151b61284776 reverted. I'm working on rolling a reverted one now... Created attachment 73399 [details]
ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't managed by ACPI
Hmm. What about this one?
(In reply to comment #61) > Created an attachment (id=73399) [details] > ACPI / PCI / PM: Check _SxD/_SxW for devices whose power states aren't > managed > by ACPI > > Hmm. What about this one? With or without 151b61284776 reverted? With commit 151b61284776 reverted, please. reverting 151b61284776 + Patch of Comment #61 works. Ok, this dsdt method looks like in asus ux31e but triggers some new bug. hm.. According to this dsdt if we enter S3 sleeps state, we should put usb controller in acpi_d2 state. Normally we assume acpi_d2 == pci_d2, at least current code do this. But according to controller documentation (for asus ux31e, not for hardware in this case) usb controller do not support D2 state. On my hardware i can pass pci_D2 and it will be just ignored, it will continue to stay in pci_D0. What if this controller do not support pci_D2 and will cause this problem? oops, postet it to wrong bug report OK, thanks for testing! (In reply to comment #67) > OK, thanks for testing! Just want to point out Comment #48, to avoid any confusion -- my notebook's not an Asus, but I think it does have the same USB chips and fails to suspend in similar circumstances without either (revert + above patch) or (workarounds script). [I could file a separate bug report if appropriate.] That's OK. Let's keep things in one place. *** Bug 43064 has been marked as a duplicate of this bug. *** Commit dbf0e4c7257f8d684ec1a3c919853464293de66e (PCI: EHCI: fix crash during suspend on ASUS computers) should be the final, correct fix for this problem. This bug report can be closed out. The 3.5 kernel just landed in Fedora 17, including the fix for this problem. The hibernate situation is *much* improved on my UX31E, but doesn't appear to be 100% fixed yet. Old behaviour: 1. Attempt to hibernate hangs with the screen in console mode and a couple of messages about ALSA shutting down 2. CPU fan spins up to full speed 3. System stays in that state until I press and hold the power button 4. On restart, system cold boots instead of returning from hibernate New behaviour: 1. Attempt to hibernate hangs with the screen in console mode and a couple of messages about ALSA shutting down 2. System stays in that state until I press and hold the power button 3. On restart, system returns from hibernate as expected It appears the system state is now getting saved properly, but there's still something going wrong where the actual "power down now" command isn't being issued properly. I'm not sure if that's a completely separate bug or a continuation of this one, though - happy to file a new report if you prefer. It's almost certainly a separate bug. You're better off starting a new bug report. Hi Alan, this problems seems to be related. See bug #45811, i was able to reproduce working suspend to disk just by unbinding ehci_hcd, like in this bug. A patch referencing this bug report has been merged in Linux v3.4-rc5: commit 151b61284776be2d6f02d48c23c3625678960b97 Author: Alan Stern <stern@rowland.harvard.edu> Date: Tue Apr 24 14:07:22 2012 -0400 USB: EHCI: fix crash during suspend on ASUS computers A patch referencing a commit referencing this bug report has been merged in Linux v3.5-rc3: commit c2fb8a3fa25513de8fedb38509b1f15a5bbee47b Author: Alan Stern <stern@rowland.harvard.edu> Date: Wed Jun 13 11:20:19 2012 -0400 USB: add NO_D3_DURING_SLEEP flag and revert 151b61284776be2 please re-open if problem still present in Linux 3.5 or later. I find linux 3.7.6 will have the same problem on ux31e, but 3.6.6 is ok for me. If you have a new problem then maybe you can use git bisect to determine what commit is responsible. #79, just like the original post.. the memory will be corrupted if I hit the bug (some time grub will just crash due to memory corruption), I have to leave it alone for more than 30 minutes to get it back to live again. (ux31e battery is not removable.) I guess it's kinds of impossible for me to do bisect.. I'm willing to test some single patch. my lspci is same as others. Why is it impossible for you to bisect? It might be slow but it ought to work. You already know that 3.6.6 is okay and 3.7.7 isn't. What about 3.6 and 3.7? What about 3.7-rc1? What about the patch in comment #36? |