Bug 86241
Summary: | Regression: Suspend to RAM randomly fails on Dell XPS 13 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Vincent Petry (PVince81) |
Component: | Other | Assignee: | Tomas Winkler (tomas.winkler) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | alan, conflatulence, gabriele.mzt, jiri.tyr, lenb, mamadontgodaddycomehome, mattia.b89, r087r70, rui.zhang, tomas.winkler, tomasw, vbourachot |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 3.16.4-1.g7a8842b-desktop | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg from simulated suspend
All loaded modules screen hang at shutdown mei dmesg debug message from real s2ram mei dmesg debug message from successful s2ram Disable mei on XPS13 9333 suspend issue possible fix |
Description
Vincent Petry
2014-10-14 12:35:13 UTC
Copying Roberto's environment info here: System: dell-xps13-sputnik-2014 (Intel i7-4650U), 8GB ram, NO SWAP SPACE (maybe relavant?) OS: Kubuntu 14.04 x86_64 with kernel 3.16.0-17 It is likely that the bug is related to kernel 3.16 as we both have this major version. I'm also affected by this bug (same laptop and OS as Vincent). In my case I have also problems in shutting down the system: it freezes with the screen saying something like: "The system will halt NOW..." For the suspend to RAM issue, whenever we know it's likely to happen soon (after a day of laptop use), here are a few things we could try: - Log out from KDE and shut down the X server, then run pm-suspend - Unload vboxdrv and other modules like bluetooth Based on these results we can at least find out whether it's anything in KDE that it causing the issue and give us directions about what to try next. Roberto, do you also have Virtualbox installed ? And obviously something else to try would be to upgrade the kernel to 3.17. I do have virtualbox installed, even though I never use it. And I'm with mainline kernel 3.17. Just had the crash again: unloading all virtualbox modules didn't help. Tried with X-server shutted down, doesn't help. Yesterday I shut down KDE but forgot to shut down Xorg, and also got the crash. So now I guess the next step is trying to stop services, then also manually unload kernel modules. Another idea would be to downgrade the kernel until it works again to find the version in which the regression was introduced (might be tedious). I just found this: https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt Scroll down to "Testing suspend to RAM". It seems it's possible to test different phases of suspend to identify what's going wrong. I'll try this in a few hours when the likeliness of a crash is higher. Created attachment 153951 [details]
dmesg from simulated suspend
I tried the following:
# echo devices > /sys/power/pm_test
# echo platform > /sys/power/disk
# echo disk > /sys/power/state
# s2ram
and later
# echo core > /sys/power/pm_test
# s2ram
all worked. I've attached the result dmesg from the "core" test in case it can contain clues.
However after that I set it back to "none" for the real suspend to ram, and then the system hung again.
There is a suspicious stack trace in that log, but not sure if related to the issue itself.
Today when closing the lid it hung but this time it didn't shut down the screen. When I reopened I could still see the desktop but everything was frozen. Just mentioning this in case it's another subtle clue. I think next up to test would be unloading kernel modules like wifi, usb and others, before running suspend to ram. I had two more unsuccessful tries: 1) rmmod of iwlwifi,mac80211,iwlmvm 2) rmmod of xhci_hcd, sdhci, sdhci_acpi, mmc_core Maybe next time I'll try and remove even more modules. Note. openSUSE Factory upgraded me to kernel 3.16.4-1.g7a8842b-desktop but the problem persists here. Another failed test run where I removed snd, iwlwifi, bluetooth, *hci. Here are the remaining modules: Module Size Used by ctr 13049 0 ccm 17773 0 af_packet 39991 0 fuse 100461 0 6lowpan_iphc 18702 0 binfmt_misc 17468 1 hid_logitech_dj 18469 0 uvcvideo 89131 0 videobuf2_vmalloc 13216 1 uvcvideo videobuf2_memops 13362 1 videobuf2_vmalloc videobuf2_core 63200 1 uvcvideo v4l2_common 15265 1 videobuf2_core videodev 157329 3 uvcvideo,v4l2_common,videobuf2_core hid_multitouch 17419 0 dell_wmi 12681 0 sparse_keymap 13948 1 dell_wmi iTCO_wdt 13480 0 iTCO_vendor_support 13718 1 iTCO_wdt hid_rmi 17528 0 x86_pkg_temp_thermal 14205 0 intel_powerclamp 18823 0 coretemp 13441 0 dcdbas 14978 0 crct10dif_pclmul 14268 0 crc32_pclmul 13133 0 ghash_clmulni_intel 13230 0 aesni_intel 152552 0 aes_x86_64 17131 1 aesni_intel lrw 13286 1 aesni_intel gf128mul 14951 1 lrw glue_helper 13990 1 aesni_intel ablk_helper 13597 1 aesni_intel cryptd 16263 3 ghash_clmulni_intel,aesni_intel,ablk_helper arc4 12608 0 serio_raw 13434 0 lpc_ich 21093 0 mfd_core 13435 1 lpc_ich i2c_i801 22454 0 shpchp 32951 0 wmi 19193 1 dell_wmi thermal 22971 0 battery 23237 0 i2c_hid 18726 0 mei_me 23664 0 ac 13335 0 i2c_designware_platform 12979 0 soundcore 15047 0 8250_dw 13551 0 i2c_designware_core 14768 1 i2c_designware_platform mei 96067 1 mei_me processor 40484 0 dm_mod 111114 0 btrfs 1006855 1 xor 21411 1 btrfs raid6_pq 106004 1 btrfs crc32c_intel 22094 1 i915 983484 1 i2c_algo_bit 13413 1 i915 video 24419 1 i915 drm_kms_helper 65670 1 i915 drm 335594 3 i915,drm_kms_helper button 13971 1 i915 sg 40630 0 Created attachment 154551 [details]
All loaded modules
For reference I've attached the list of ALL loaded modules on the Dell XPS 13 when on the desktop (KDE).
Yesterday I've tested the standby/shutdown issue using a usb-key with a live linux distribution (tried both kali-live and kubuntu-live). They worked flawlessly. But again, it may depend on the time the system is active, so I'll test again this thoroughly in the next days... Does it happen if you fill up ram with loads of stuff then suspend/resume even early on ? With 8GB ram I have to think how to fill it up. I'll try. You could mount ramdisks (tmpfs) and fill them with files. Another thing to try: write down the memory usage after login and before suspend. That could at least confirm whether it's related to the ram usage somehow. Like I said previously, suspend to ram used to work flawlessly in openSUSE 13.1 with kernel 3.11.10-21 so it's not like it never worked before. Roberto, what kernel version did you have on that USB stick ? find /usr -exec grep -H Bananas {} \; or similar with a very large file system of big files will fill memory up with tons and tons of cached file data, that may not be same as tons and tons of dirty app data but its a first test. after boot, pm-suspend worked fine after filling up /dev/shm with some large files cached memory should not be the problem here. This is free -h before the suspend crash after a few hours use: total used free shared buffers cached Mem: 7.3G 7.1G 254M 340M 8K 3.4G -/+ buffers/cache: 3.7G 3.7G Swap: 15G 240K 15G After the crash and directly after the reboot I created a tmpfs virtual drive: sudo mount -t tmpfs -o size=6024M tmpfs temp3 Then I coped a VM inside until it said "no more free space". The used memory was also around 7G (forgot to save it) and the suspend did not crash. I'll try Alan's approach later. Marked as a regression on Intel HW. Can you reproduce using the upstream Linux kernel? Can you identify the latest kernel that worked (and the first that failed), say, by using git-bisect? Hello, as stated in comment #4 I've tried with 3.17 mainline kernel. In my case, everything earlier than the factory-installed ubuntu-12.04 produces the error. git bisect is an idea but will take quite some time as we still don't know how to trigger the issue realiably, apart from using the laptop for a few hours. So would do that only as a last resort, or as soon as we know how to produce the issue quickly. In my case I tried running "free -h" before every suspend, and this is the result so far: Failed suspends: total used free shared buffers cached Mem: 7.3G 4.1G 3.2G 2.2M 452K 4.0G -/+ buffers/cache: 135M 7.2G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 6.5G 910M 329M 880K 4.1G -/+ buffers/cache: 2.4G 5.0G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 6.1G 1.3G 260M 880K 4.0G -/+ buffers/cache: 2.1G 5.3G Swap: 15G 0B 15G Succeeded suspends: total used free shared buffers cached Mem: 7.3G 3.7G 3.7G 268M 880K 2.1G -/+ buffers/cache: 1.6G 5.8G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 4.5G 2.9G 261M 880K 2.0G -/+ buffers/cache: 2.5G 4.9G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 5.0G 2.3G 284M 880K 2.3G -/+ buffers/cache: 2.8G 4.6G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 5.4G 2.0G 273M 880K 2.4G -/+ buffers/cache: 2.9G 4.4G Swap: 15G 0B 15G total used free shared buffers cached Mem: 7.3G 5.4G 2.0G 266M 880K 2.5G -/+ buffers/cache: 2.9G 4.4G Swap: 15G 0B 15G I wonder if there's a limit (MAX_INT?) after which the cached memory will cause suspend to fail. (if it's really memory-related) I tried Alan's technique with: find /usr -exec grep -H Bananas {} \; Firstly, I didn't find any Bananas ;-) Secondly, the cached memory did fill up as expected, but running suspend to RAM after that still worked properly... total used free shared buffers cached Mem: 7.3G 7.2G 131M 299M 8K 5.2G -/+ buffers/cache: 2.0G 5.3G Swap: 15G 2.2M 15G Other failures I had were with cached memory > 4 GB. So it looks like the issue is not memory related. Usually it happens after 3 hours waking time. What else could build up over time ? Additional note: yesterday I had another hanging case where the screen stayed on (this is quite rare), I saw KDE's lock screen but everything was frozen. Which means that the part that causes the system to freeze might sometimes happen even before the screen is suspended. Also I still don't understand why running the suspend tests all work fine but only the actual suspend will freeze. Is there something that is not covered by the tests from here https://bugzilla.kernel.org/show_bug.cgi?id=86241#c9 ? Tried to drop the ram caches with echo 3 > /proc/sys/vm/drop_caches before suspending. Still hung. It confirms that it's probably not memory related. Tried booting 3.18.0-rc2-g7a8842b but had trouble booting. Will try with older versions and see if I can identify which version introduced the issue. Might take several days, stay tuned ;-) I compiled and installed 3.15.0-1.g7a8842b-desktop Today I didn't experience any issues with suspend to RAM. So the regression is between 3.15.0 and 3.16.4. Will try and find a narrower interval. Vincent: please test the 3.15 thoroughly, we need to be totally sure about it functioning! thank you Kernel 3.16.0 failed. Next up: 3.15.10 Roberto: don't worry, once I find the threshold I'll double check. Laptop survived one work day with 2 suspends with 3.15.10. The breaking change is likely to be between 3.15.10 and 3.16.0 Got a crash today with 3.16-rc2. I couldn't boot 3.16-rc1. So now I'd say the breaking change is between 3.15.10 and 3.16-rc2. I see many changes related to ACPICA have been introduced. I'll try one more day with 3.15.10 to confirm that it's stable. Not sure how to proceed from there, I could try and bisect individual commits ? But that will likely take a month (one day per step). Also compiling the kernel seems to take 2 hours even on my modern machine. I started bisecting between 3.15.10 and 3.16-rc2 There are about 13 steps. Here's the result so far: git bisect start # good: [f35b5e46feabab668a44df5b33f3558629f94dfc] Linux 3.15.10 git bisect good f35b5e46feabab668a44df5b33f3558629f94dfc # bad: [a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee] Linux 3.16-rc2 git bisect bad a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # bad: [d09cc3659db494aca4b3bb2393c533fb4946b794] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad d09cc3659db494aca4b3bb2393c533fb4946b794 any chance to get the first bad commit? Zhang Rui, I'm still bisecting... As state above, the laptop needs to be used actively at least 3 hours before the bug can be reproduced. Which means I'm roughly doing one step a day. Here is the current updated log: git bisect start # good: [f35b5e46feabab668a44df5b33f3558629f94dfc] Linux 3.15.10 git bisect good f35b5e46feabab668a44df5b33f3558629f94dfc # bad: [a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee] Linux 3.16-rc2 git bisect bad a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # bad: [d09cc3659db494aca4b3bb2393c533fb4946b794] Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad d09cc3659db494aca4b3bb2393c533fb4946b794 # bad: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next git bisect bad 5142c33ed86acbcef5c63a63d2b7384b9210d39f # bad: [4046136afbd1038d776bad9c59e1e4cca78186fb] Merge tag 'char-misc-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc into next git bisect bad 4046136afbd1038d776bad9c59e1e4cca78186fb # good: [825f4e0271b0de3f7f31d963dcdaa0056fe9b73a] Merge tag 'soc-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into next git bisect good 825f4e0271b0de3f7f31d963dcdaa0056fe9b73a 9 steps to go. To be clearer, here is how I test: 1. compile the kernel (takes about 1-2 hours) 2. boot the kernel 3. use the computer from 10 am until around 1 pm 4. suspend to ram 5. after lunch, continue using the computer until 6-7 pm 6. suspend to ram For bad commits, the computer will hang aither at step 4. or 6. Suspending the computer earlier than 2 hours usually works correctly. For good commits, it will always suspend properly. But to be sure I prefer to let it run the whole day to avoid false positives, as it is still not clear what exactly is causing the freezing to happen. Created attachment 157201 [details]
screen hang at shutdown
I've attached a screenshot of the hanged system after shutdown. Roberto, I'd suggest raising the shutdown issue as a separate ticket, as I believe it is a completely different issue. I never had such shutdown issues with this laptop. Strangely, shutdown freeze occurs only after several hours of work, just like suspend-to-ram. This makes me think that they might be related. However, we will see when the s2ram problem will be solved... Next step done: # good: [70bc6bb3f254c1cf605a30a2d5bb18eff90a9584] Merge tag 'zynq-dt-for-3.16' of git://git.xilinx.com/linux-xlnx into next/dt git bisect good 70bc6bb3f254c1cf605a30a2d5bb18eff90a9584 I see some new people have joined us here. Welcome to the daily bisect ! Please have a seat. Here is today's update: # good: [755a9ba7bf24a45b6dbf8bb15a5a56c8ed12461a] Merge tag 'dt-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc into next git bisect good 755a9ba7bf24a45b6dbf8bb15a5a56c8ed12461a there are about 7 steps to go. Stay tuned :-) # good: [182434f748885e169554fba410aebfef6bdf21ed] Merge tag 'exynos-cpuidle' of http://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung into next/drivers git bisect good 182434f748885e169554fba410aebfef6bdf21ed Bisecting: 36 revisions left to test after this (roughly 6 steps) # bad: [6a57bad6e78ba0355f0f6df8cca1f7df42b58bfd] Merge tag 'extcon-next-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-next git bisect bad 6a57bad6e78ba0355f0f6df8cca1f7df42b58bfd # good: [0604002cde72cd60a11013daf2d9f456d4895ce8] extcon: max14577: Use devm_extcon_dev_allocate for extcon_dev git bisect good 0604002cde72cd60a11013daf2d9f456d4895ce8 Bisecting: 14 revisions left to test after this (roughly 4 steps) Got "lucky" once more this evening with a bad one: # bad: [86113500c060bccb0f08bdcadcecc0bd267fd25a] mei: make return values consistent across the driver git bisect bad 86113500c060bccb0f08bdcadcecc0bd267fd25a 3 steps to go, yay! Bisecting: 3 revisions left to test after this (roughly 2 steps) # good: [a532bbedc85ff3b834ba81e49163a3f543be1775] mei: add function to check write queues git bisect good a532bbedc85ff3b834ba81e49163a3f543be1775 # good: [e13fa90ce42d8e7ee501426ea414c8ae4a5366ef] mei: me: use runtime PG pm domain for non wakeable devices git bisect good e13fa90ce42d8e7ee501426ea414c8ae4a5366ef # bad: [61a1aea7c7cb40de071e202cfaa31fa2c1fca8ba] mei: me: bump hbm version to 1.1 to support power gating git bisect bad 61a1aea7c7cb40de071e202cfaa31fa2c1fca8ba Bisecting: 0 revisions left to test after this (roughly 0 steps) # bad: [d2d56faebaed1dd9bc011fcceed7df6b1bea8fac] mei: txe: use runtime PG pm domain for non wakeable devices git bisect bad d2d56faebaed1dd9bc011fcceed7df6b1bea8fac # first bad commit: [d2d56faebaed1dd9bc011fcceed7df6b1bea8fac] mei: txe: use runtime PG pm domain for non wakeable devices And, after one month of bisecting, ladies and gentleman, here is the breaking commit: commit d2d56faebaed1dd9bc011fcceed7df6b1bea8fac Author: Alexander Usyskin <alexander.usyskin@intel.com> Date: Tue Mar 18 22:52:06 2014 +0200 mei: txe: use runtime PG pm domain for non wakeable devices For non wakeable devices we can't use pci runtime framework as we are not able to wakeup from D3 states. Instead we create new pg runtime domain that only drives TXE power gating protocol to reduce the power consumption. Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> :040000 040000 3a6e2f73692e0747af2148a103588367e9aca3e2 c9c5bfdce6535d7d714caf93c480e19e1709cadc M drivers I will do another set of tests, one with that commit and one with the one before, to make sure that there was no false positive. d2d56faebaed1dd9bc011fcceed7df6b1bea8fac confirmed broken. Note that the two times I got it hanging the screen was still on and showed the KDE lock screen (which isn't the case in all fail). I hope it's not yet another similar issue. Next up: 1) Test the commit before that: e13fa90ce42d8e7ee501426ea414c8ae4a5366ef and make sure s2ram doesn't hang over a span of several days (in progress) 2) Test with 3.17.4 to see whether the bug was fixed 3) Test with 3.17.4 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac 4) Test with 3.18-rc6 5) Test with 3.18-rc6 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac 6) Bisect the broken commit itself, removing lines one by one... In the meantime I hope a kernel dev could chime in with some ideas/theories as to why this would hang s2ram after 3 hours of laptop use (but not before), and not fail the blank tests as shown here https://bugzilla.kernel.org/show_bug.cgi?id=86241#c9. If needed I can also test patches. Some update: 1) The commit before the breaking one, which is e13fa90ce42d8e7ee501426ea414c8ae4a5366ef worked fine for me over several days: did not hang. This confirms that d2d56faebaed1dd9bc011fcceed7df6b1bea8fac is the breaking commit. 2) Test with 3.17.4 (in progress) 4) Test with 3.18-rc5: s2ram still hangs, so the problem was not fixed there 3.17.4 broken too. Next up: will to revert the commit. Bad news: 3.17.4 + reverted d2d56faebaed1dd9bc011fcceed7df6b1bea8fac is STILL broken. This means that d2d56faebaed1dd9bc011fcceed7df6b1bea8fac probably isn't the only commit that breaks suspend. It is likely that there is a bunch of commits after that that will trigger the issue. I have no idea how I can efficiently find out which it is. I could try redoing the full bisect but by reverting d2d56faebaed1dd9bc011fcceed7df6b1bea8fac for every step... but this is going a bit too far. Any ideas what I could do or how I could debug this ? Can devs provide a patch that logs more information ? The hardest part is that logs do not contain any useful/extra information, especially that the laptop crashes directly and log entries, if any, are probably not flushed to disk. There is no way I could hook the laptop to a serial console as suggested in some websites. Any suggestions ? Okay, so that one commit at the beginning of the breaking change is related to Intel's MEI. And I noticed that "lsmod | grep mei" shows this: mei_me 19527 0 mei 88055 1 mei_me Next time I could try unloading that driver before sleep (not sure if possible or recommended though). I have unloaded both "mei_me" and "mei" (which seem to be related to the AMT management service from Intel which I don't use and doesn't seem to work anywwa). After unloading these modules, I've been able to suspend properly since at least one day. I'll continue for one more day to make sure it's not a false positive. Roberto, could you try the same on your laptop ? (blacklist "mei" and "mei_me") +Tomas Winkler for mei/amt Can you please also append run on lspci -vnx 00:16.0 Thanks The command didn't work with the 00:16.0 argument so I copied the matching part from the output: 00:16.0 0780: 8086:9c3a (rev 04) Subsystem: 1028:060a Flags: fast devsel, IRQ 16 Memory at f0519000 (64-bit, non-prefetchable) [size=32] Capabilities: [50] Power Management version 3 Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+ Kernel modules: mei_me 00: 86 80 3a 9c 02 00 18 00 04 00 80 07 00 00 80 00 10: 04 90 51 f0 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 0a 06 30: 00 00 00 00 50 00 00 00 00 00 00 00 0b 01 00 00 And this is when not verbose: 00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04) Note: I haven't found anywhere in the BIOS how to enable AMT and didn't find any docs about it for the Dell XPS 13. When I bought the laptop it came with Ubuntu so it is likely that AMT was already permanently disabled by default. I read somewhere online that disabling AMT was a permanent operation. Not sure if this could affect the module somehow. Holy shit, I confirm It was that fucking mei module! By blacklisting mei+mei_ime I can finally suspend AND shutdown properly! One year of trouble on a 1300+ euros laptop for a useless module, WTF intel!! BTW, a huge thank to Vincent: excellent work! thank you!! re-assign to mei experts. (In reply to Vincent Petry from comment #56) > The command didn't work with the 00:16.0 argument so I copied the matching > part from the output: > > 00:16.0 0780: 8086:9c3a (rev 04) > Subsystem: 1028:060a > Flags: fast devsel, IRQ 16 > Memory at f0519000 (64-bit, non-prefetchable) [size=32] > Capabilities: [50] Power Management version 3 > Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+ > Kernel modules: mei_me > 00: 86 80 3a 9c 02 00 18 00 04 00 80 07 00 00 80 00 > 10: 04 90 51 f0 00 00 00 00 00 00 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 28 10 0a 06 > 30: 00 00 00 00 50 00 00 00 00 00 00 00 0b 01 00 00 > > > And this is when not verbose: > 00:16.0 Communication controller: Intel Corporation 8 Series HECI #0 (rev 04) Thanks for the output and for hard work on bisecting. I will try to find similar product to reproduce, though I'll be gland if you can help me to nail this issue as I do not have immediate access to all possible Intel based laptops. There might be also some BIOS incompatibility. BIOS is usually not finalized by Intel and their might be inconsistencies. To focus on the issue and your bisection finding can you please check if the runtime pm is enabled for this devices. cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_status cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_enabled To enable verbose debug we can enable dynamic debug logs. cat /etc/modprobe.d/mei.conf options mei dyndbg="+pltf" options mei_me dyndbg="+pltf" Maybe this will give as some more hints since I do not something suspicious in the log youv'e attached > > Note: I haven't found anywhere in the BIOS how to enable AMT and didn't find > any docs about it for the Dell XPS 13. When I bought the laptop it came with > Ubuntu so it is likely that AMT was already permanently disabled by default. > I read somewhere online that disabling AMT was a permanent operation. > Not sure if this could affect the module somehow. AMT is only available on vPro machines and MEI driver has also uses beyond AMT so this is not the issue here. For your personal use you can also blacklist the module though I'll be glad if you can help me to validate a possible fix. Created attachment 159581 [details]
mei dmesg debug message from real s2ram
I've attached mei-debug.txt which contains the dmesg journal from before the suspend and a few lines after it (which unfortunately contain nothing about mei).
And here the result from sys before the suspend:
% cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_status
suspended
% cat /sys/devices/pci0000\:00/0000\:00\:16.0/power/runtime_enabled
enabled
After booting the computer the state are the same.
So still no clue why this only hangs after ~3 hours of laptop usage.
I hope you'll find something :-)
Created attachment 159591 [details]
mei dmesg debug message from successful s2ram
And here for reference the dmesg of a SUCCESSFUL suspend (from when laptop usage time is just a few minutes). So here you can see what is supposed to happen but doesn't happen in the other case.
Just to avoid confusion: - the fail log is here: https://bugzilla.kernel.org/attachment.cgi?id=159581 There is actually a successful suspend in the middle of that log, it was from a few hours before when the threshold time wasn't reached yet. The actual failing suspend is obviously from the bottom of that log when it hung. - the success log is here (suspending a few minutes after boot): https://bugzilla.kernel.org/attachment.cgi?id=159591 Hope it helps. Created attachment 174711 [details]
Disable mei on XPS13 9333
Hi,
would be preventing the module from loading automatically on the XPS13 9333 an acceptable (and temporary) solution?
I confirm that either mei or mei_me is causing issues (I don't know exactly which one, I haven't done any test). Ever since I blacklisted them, I didn't have a single failed suspension.
Since few months have passed since this bug was reported, wouldn't be wise to simply disable it?
Something like the patch here attached should work.
Gabriele
sorry this totally fail of my radar, I'm looking into the logs. Tomas Looks like we are hitting an unexpected interrupt during the spend flow which we interpret as request for reset due to register setting. Need to find out why we are hitting interrupt here. Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_stop:289: mei_me 0000:00:16.0: stopping the device. Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_reset:119: mei_me 0000:00:16.0: remove iamthif and wd from the file list. Dec 03 14:30:02 vvortex.ttv kernel: [6603] mei_reset:136: mei_me 0000:00:16.0: powering down: end of reset Dec 03 14:30:02 vvortex.ttv kernel: [476] mei_me_irq_thread_handler:643: mei_me 0000:00:16.0: function called after ISR to handle the interrupt processing. Dec 03 14:30:02 vvortex.ttv kernel: mei_me 0000:00:16.0: FW not ready: resetting. Dec 03 14:30:02 vvortex.ttv kernel: [476] mei_me_irq_thread_handler:708: mei_me 0000:00:16.0: interrupt thread end ret = 0 Dec 03 14:30:02 vvortex.ttv kernel: [175] mei_reset:119: mei_me 0000:00:16.0: remove iamthif and wd from the file list. Dec 03 14:30:02 vvortex.ttv kernel: PM: suspend of devices complete after 594.944 msecs Dec 03 14:30:02 vvortex.ttv kernel: PM: late suspend of devices complete after 18.910 msecs Dec 03 14:30:02 vvortex.ttv kernel: ehci-pci 0000:00:1d.0: System wakeup enabled by ACPI Currently I suspect this is a FW/BIOS issue and I'm checking the bug database if this was perhaps fixed. I believe that the FW version is pretty old on that machine. Thinkpad yoga has a similar problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770397 Created attachment 179251 [details]
suspend issue possible fix
I'll be glad if someone can check if the attached patch fixes the issue. Thanks I've been able to successfully suspend my laptop with no issues twice already. The first time after it has been up for 10 hours, the second time for 4 hours (no cold boot in between). So it seems to work, but it's too soon to be sure. I'll report any issue that may arise in the next days. May I add you to the Tested-by: ? Sure. Just for record, everything is still working fine. Not experiencing the problem on 3.16.0-41 since I backlisted "mei" families. Thanks to whom have spent a lot of time on testing especially to Vincent Petry. commit 3dc196eae1db548f05e53e5875ff87b8ff79f249 Author: Alexander Usyskin <alexander.usyskin@intel.com> Date: Sat Jun 13 08:51:17 2015 +0300 mei: me: wait for power gating exit confirmation Fix the hbm power gating state machine so it will wait till it receives confirmation interrupt for the PG_ISOLATION_EXIT message. In process of the suspend flow the devices first have to exit from the power gating state (runtime pm resume). If we do not handle the confirmation interrupt after sending PG_ISOLATION_EXIT message, we may receive it already after the suspend flow has changed the device state and interrupt will be interpreted as a spurious event, consequently link reset will be invoked which will prevent the device from completing the suspend flow kernel: [6603] mei_reset:136: mei_me 0000:00:16.0: powering down: end of reset kernel: [476] mei_me_irq_thread_handler:643: mei_me 0000:00:16.0: function called after ISR to handle the interrupt processing. kernel: mei_me 0000:00:16.0: FW not ready: resetting Cc: <stable@vger.kernel.org> #3.18+ Cc: Gabriele Mazzotta <gabriele.mzt@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=86241 Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770397 Tested-by: Gabriele Mazzotta <gabriele.mzt@gmail.com> Signed-off-by: Alexander Usyskin <alexander.usyskin@intel.com> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Can we close this issue? I don't have permission to change the status of the bug I have been getting similar symptoms again with Kernel 4.6.3-1-default. Will need more time to gather information. My previous kernel before the update that made this happen again was 4.6.0-1, all on openSUSE Tumbleweed. Would be good to know if anyone else with a Dell XPS 13 is seeing similar symptoms with 4.6.3. Dear All, the solution has been posted many times in many places. You have to blacklist the modules mei and mei_me, simple like sudo cat << EOF > /etc/modprobe.d/blacklist.suspend-bug.conf blacklist mei blacklist mei_me EOF Vincent, I haven't been using 4.6.3 that much, but I've tried to suspend my laptop running 4.6.3 after more than 5 hours of uptime and it didn't hung. Here (XPS 9343 i5-5200U w/ kernel 4.20.12) and sometimes (one each ten suspension), it still happens: I suspend the laptop, then I try to resume it but: - black screen - keyboard unresponsive - Ethernet LED light up Now, I have blacklisted the two `mei` modules in order to see if they are really the problem! UPDATE: after one week, where I used daily the laptop, that is one suspend and resume per day, the issue presents again, EVEN IF I blacklisted `mei` and `mei_me` modules. Please reopen the bug. |