Bug 156441 - Bisected: Laptop cannot suspend after update to kernel 4.5.3 (up to 4.7.2) with powersaving udev rule - Go direct_complete if driver has no callbacks - AMD Samsung 305
Summary: Bisected: Laptop cannot suspend after update to kernel 4.5.3 (up to 4.7.2) wi...
Status: RESOLVED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-09 16:46 UTC by n0000b.n000b
Modified: 2018-03-11 17:01 UTC (History)
7 users (show)

See Also:
Kernel Version: 4.5.3 up to 4.7.2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg of multiple suspend/cycles with 4.4.9 (160.60 KB, text/plain)
2016-09-09 16:46 UTC, n0000b.n000b
Details
journalctl -k of suspend/resume cycle until freeze in 4.5.3 (127.16 KB, text/plain)
2016-09-09 16:48 UTC, n0000b.n000b
Details
journalctl of suspend/resume cycles until freeze in 4.7.2 (401.95 KB, text/plain)
2016-09-09 16:49 UTC, n0000b.n000b
Details
lsmod (4.48 KB, text/plain)
2016-10-25 06:32 UTC, n0000b.n000b
Details
journalctl -b after echo freezer > /sys/power/pm_test (397.69 KB, text/plain)
2016-12-19 23:48 UTC, n0000b.n000b
Details
journalctl -b after echo devices > /sys/power/pm_test (465.51 KB, text/plain)
2016-12-19 23:49 UTC, n0000b.n000b
Details
journalctl -b after echo platform > /sys/power/pm_test (441.85 KB, text/plain)
2016-12-19 23:50 UTC, n0000b.n000b
Details
journalctl -b after echo processor > /sys/power/pm_test (395.77 KB, text/plain)
2016-12-19 23:51 UTC, n0000b.n000b
Details
journalctl -b after echo core > /sys/power/pm_test (400.64 KB, text/plain)
2016-12-19 23:51 UTC, n0000b.n000b
Details
debug patch (945 bytes, patch)
2017-01-09 09:17 UTC, Zhang Rui
Details | Diff
dmesg after multiple suspend with patched kernel (241.89 KB, text/plain)
2017-01-10 05:36 UTC, n0000b.n000b
Details
output of lspci (2.93 KB, text/plain)
2017-01-11 22:41 UTC, n0000b.n000b
Details
lspci -xv (15.95 KB, text/plain)
2017-01-12 14:27 UTC, n0000b.n000b
Details
tree /sys/bus/pci/devices/0000\:00\:14.4 (1.75 KB, text/plain)
2017-01-19 05:06 UTC, n0000b.n000b
Details
dmesg output after various suspend to idle cycles (115.14 KB, text/plain)
2017-01-20 00:11 UTC, n0000b.n000b
Details

Description n0000b.n000b 2016-09-09 16:46:37 UTC
Created attachment 232871 [details]
dmesg of multiple suspend/cycles with 4.4.9

Cannot suspend multiple times in my laptop (samsung 305V4A)  after upgrading from kernel 4.4.9 to 4.5.3 (also tested in 4.7.2) when i have some custom power saving udev rules.

Udev rule:

ACTION=="add", SUBSYSTEM=="pci", ATTR{power/control}="auto"
Comment 1 n0000b.n000b 2016-09-09 16:48:26 UTC
Created attachment 232881 [details]
journalctl -k of suspend/resume cycle until freeze in 4.5.3
Comment 2 n0000b.n000b 2016-09-09 16:49:15 UTC
Created attachment 232891 [details]
journalctl of suspend/resume cycles until freeze in 4.7.2
Comment 3 n0000b.n000b 2016-09-09 16:49:48 UTC
bisection of kernel:

# bad: [fbc310e9c553412ebe72c14e5a7bb9807a3d1109] Linux 4.5.3
# good: [1a1a512b983108015ced1e7a7c7775cfeec42d8c] Linux 4.4.9
git bisect start 'v4.5.3' 'v4.4.9'
# good: [afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc] Linux 4.4
git bisect good afd2ff9b7e1b367172f18ba7f693dfb62bdcb2dc
# good: [e535d74bc50df2357d3253f8f3ca48c66d0d892a] Merge tag 'docs-4.5' of git://git.lwn.net/linux
git bisect good e535d74bc50df2357d3253f8f3ca48c66d0d892a
# bad: [d43421565bf0510d35e6a39ebf96586ad486f3aa] Merge tag 'pci-v4.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad d43421565bf0510d35e6a39ebf96586ad486f3aa
# skip: [984065055e6e39f8dd812529e11922374bd39352] Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
git bisect skip 984065055e6e39f8dd812529e11922374bd39352
# good: [6c03a6bd0dd836db388feb28fda1868037491ee7] drm/i915: Don't register CRT connector when it's fused off
git bisect good 6c03a6bd0dd836db388feb28fda1868037491ee7
# good: [d0710abbcd88b1ff17760e97d74a673e67b49ea1] drm/i915: Set the map-and-fenceable flag for preallocated objects
git bisect good d0710abbcd88b1ff17760e97d74a673e67b49ea1
# good: [2d663b55816e5c1d211a77fff90687053fe78aac] Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
git bisect good 2d663b55816e5c1d211a77fff90687053fe78aac
# good: [1305eda751d7df3069b1fcb6f62036185acd24a0] Merge tag 'armsoc-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good 1305eda751d7df3069b1fcb6f62036185acd24a0
# good: [f0dba77620368d154bff9542675c6844e4678761] Merge tag 'davinci-for-v4.5/dts' of git://git.kernel.org/pub/scm/linux/kernel/git/nsekhar/linux-davinci into next/dt
git bisect good f0dba77620368d154bff9542675c6844e4678761
# good: [f9cd69fe5eb6347b4de56458d0378bc0fa44bce9] Merge tag 'armsoc-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good f9cd69fe5eb6347b4de56458d0378bc0fa44bce9
# bad: [30f05309bde49295e02e45c7e615f73aa4e0ccc2] Merge tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad 30f05309bde49295e02e45c7e615f73aa4e0ccc2
# good: [ce96cb7386a57b270648f9ba6003065329a26bd3] Merge tag 'samsung-clk-exynos4-4.5' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into next/drivers
git bisect good ce96cb7386a57b270648f9ba6003065329a26bd3
# bad: [f11aef69b235bc30c323776d75ac23b43aac45bb] Merge branch 'pm-cpuidle'
git bisect bad f11aef69b235bc30c323776d75ac23b43aac45bb
# bad: [6efd3f8cde1d6acc20a715ac6ea17e01421742df] Merge branch 'pm-core'
git bisect bad 6efd3f8cde1d6acc20a715ac6ea17e01421742df
# good: [a72aea722f1b43442c9e219de824d5975dcdaa61] Merge branches 'acpica', 'acpi-video' and 'acpi-fan'
git bisect good a72aea722f1b43442c9e219de824d5975dcdaa61
# bad: [aa8e54b559479d0cb7eb632ba443b8cacd20cd4b] PM / sleep: Go direct_complete if driver has no callbacks
git bisect bad aa8e54b559479d0cb7eb632ba443b8cacd20cd4b
# good: [6b9cb42752dafba3761dde0002ca58ca518b6311] device core: add device_is_bound()
git bisect good 6b9cb42752dafba3761dde0002ca58ca518b6311
# good: [989561de9b5112999475b406557d9c7e9e59c041] PM / Domains: add setter for dev.pm_domain
git bisect good 989561de9b5112999475b406557d9c7e9e59c041
# first bad commit: [aa8e54b559479d0cb7eb632ba443b8cacd20cd4b] PM / sleep: Go direct_complete if driver has no callbacks
Comment 4 Zhang Rui 2016-09-26 05:47:19 UTC
does the problem go away if you revert this commit aa8e54b559479d0cb7eb632ba443b8cacd20cd4b?
Comment 5 n0000b.n000b 2016-09-27 00:38:24 UTC
reverting the commit on kernel 4.5.3 the problem seems to go away
Comment 6 Zhang Rui 2016-09-27 01:13:12 UTC
This seems indeed a regression to me.
I will ping the patch author to look at this issue.

BTW, can you please confirm the problem still exists in the latest upstream kernel?
Comment 7 n0000b.n000b 2016-09-27 07:01:29 UTC
the problem still persist in with fedora kernels 4.8.0-0.rc8.git0.1 and 4.7.4
Comment 8 Tomeu Vizoso 2016-09-28 07:08:31 UTC
Don't have access to the HW, so we should get some debug logging. Maybe work with the Fedora kernel team to provide those?
Comment 9 Zhang Rui 2016-10-17 23:04:10 UTC
Tomeu, what should we do for this bug?
If you want something to be tested, please ask n0000b.n000b@gmail.com directly :)
Comment 10 Len Brown 2016-10-25 00:00:36 UTC
please supply the output from lsmod

it appears that a driver needs to be updated to suspend properly.
Comment 11 n0000b.n000b 2016-10-25 06:32:04 UTC
Created attachment 242631 [details]
lsmod

lsmod before suspend and freeze
Comment 12 Zhang Rui 2016-11-14 04:58:35 UTC
can you please check if the problem still exists with the latest kernel?
Comment 13 n0000b.n000b 2016-11-15 02:45:59 UTC
(In reply to Zhang Rui from comment #12)
> can you please check if the problem still exists with the latest kernel?

problem still persist in fedora 23 with kernel 4.9 rc5
Comment 14 Rafael J. Wysocki 2016-11-28 22:33:20 UTC
Please try the test modes of system suspend (as described in Documentation/power/basic-pm-debugging.txt) and see if you can reproduce the problem in any of them.

It also would be good to check if unloading any driver modules before suspending the system makes the problem go away.
Comment 15 Zhang Rui 2016-12-19 05:49:29 UTC
ping...
Comment 16 n0000b.n000b 2016-12-19 23:43:32 UTC
testing in the different modes I can suspend and resume multiple times (tested 5 times each) without hanging the laptop. Also testing with the options 'devices' 'platform' processors' and 'core the screen goes blank and doesn't come back.

Testing unloading driver modules i cannot find one that make the problem go away.
Comment 17 n0000b.n000b 2016-12-19 23:48:51 UTC
Created attachment 248041 [details]
journalctl -b after echo freezer > /sys/power/pm_test
Comment 18 n0000b.n000b 2016-12-19 23:49:54 UTC
Created attachment 248051 [details]
journalctl -b after echo devices > /sys/power/pm_test

screen goes blank after suspend
Comment 19 n0000b.n000b 2016-12-19 23:50:23 UTC
Created attachment 248061 [details]
journalctl -b after echo platform > /sys/power/pm_test
Comment 20 n0000b.n000b 2016-12-19 23:51:07 UTC
Created attachment 248071 [details]
journalctl -b after echo processor > /sys/power/pm_test
Comment 21 n0000b.n000b 2016-12-19 23:51:49 UTC
Created attachment 248081 [details]
journalctl -b after echo core > /sys/power/pm_test
Comment 22 Zhang Rui 2017-01-09 09:17:12 UTC
Created attachment 250841 [details]
debug patch

please apply this patch with latest kernel, and attach the dmesg output after multiple suspends. (Note that the suspend failure should not be reproducible with this debug patch applied)
Comment 23 n0000b.n000b 2017-01-10 05:36:17 UTC
Created attachment 251071 [details]
dmesg after multiple suspend with patched kernel
Comment 24 Zhang Rui 2017-01-11 03:05:50 UTC
it seems that there are really a lot of devices impacted by this.

(In reply to n0000b.n000b from comment #0)
> Created attachment 232871 [details]
> dmesg of multiple suspend/cycles with 4.4.9
> 
> Cannot suspend multiple times in my laptop (samsung 305V4A)  after upgrading
> from kernel 4.4.9 to 4.5.3 (also tested in 4.7.2) when i have some custom
> power saving udev rules.
>
when saying "Cannot suspend", what do you mean? does the system freezes during suspend?resume?
 
> Udev rule:
> 
> ACTION=="add", SUBSYSTEM=="pci", ATTR{power/control}="auto"

what if you disable this udev rule?
Comment 25 n0000b.n000b 2017-01-11 04:28:22 UTC
(In reply to Zhang Rui from comment #24)

> when saying "Cannot suspend", what do you mean? does the system freezes
> during suspend?resume?

I can suspend successfully 2 times, on the third time the system  hang in the suspend cycle, the leds doesn't turn off.

> what if you disable this udev rule?

if i disable the udev rule i can suspend multiple times without problems (of all the times i've tried)
Comment 26 Zhang Rui 2017-01-11 04:51:10 UTC
then, I think the problem can be reproduced if you remove the udev rule, and enable the runtime PM for PCI devices explicitly, using a script like
for device in $(ls /sys/bus/pci/devices)
do
   echo auto > /sys/bus/pci/devices/$device/power/control
done

If yes, then please
1. disable the device runtime PM for all the PCI devices, using
for device in $(ls /sys/bus/pci/devices)
do
   echo on > /sys/bus/pci/devices/$device/power/control
done
2. and then, enable the device runtime PM one by one, and check which device' runtime PM causes the suspend hang.
Comment 27 n0000b.n000b 2017-01-11 22:41:28 UTC
Created attachment 251251 [details]
output of lspci

the system hangs the third time suspending after the next command

echo auto > sys/bus/pci/devices/0000\:00\:14.4/power/control

attached is the output of lspci on this machine
Comment 28 Zhang Rui 2017-01-12 09:29:40 UTC
and before aa8e54b559479d0cb7eb632ba443b8cacd20cd4b ("PM / sleep: Go direct_complete if driver has no callbacks"), suspend always works even if you set auto to 0000:00:14.4, right?

please attach the output of "lspci -xv" instead.
Comment 29 n0000b.n000b 2017-01-12 14:27:56 UTC
Created attachment 251351 [details]
lspci -xv

>and before aa8e54b559479d0cb7eb632ba443b8cacd20cd4b ("PM / sleep: Go
>>direct_complete if driver has no callbacks"), suspend always works even if
>you >set auto to 0000:00:14.4, right?

yes, suspend always work

attached is lspci -xv
Comment 30 Zhang Rui 2017-01-13 02:01:18 UTC
we need PCI expert on this issue.
Comment 31 Zhang Rui 2017-01-19 03:28:49 UTC
Now we've confirmed that the problem can only be reproduced
1. with commit commit aa8e54b55947 ("PM / sleep: Go >direct_complete if driver has no callbacks")
AND
2. with runtime PM for pci:0000:00:14.4 is enabled.

The difference brought by commit aa8e54b55947 is that device->direct_complete flag is set. And the difference brought by runtime PM is that the device can be in runtime suspended state when the system suspends.
Comment 32 Zhang Rui 2017-01-19 03:33:41 UTC
please attach the output of "tree /sys/bus/pci/devices/0000\:00\:14.4".

please check if the problem can be reproduced with async PM disabled (echo 0 > /sys/power/pm_async

please check if this is also a regression to suspend-to-idle (echo freeze > /sys/power/state)
Comment 33 n0000b.n000b 2017-01-19 05:06:42 UTC
Created attachment 252411 [details]
tree /sys/bus/pci/devices/0000\:00\:14.4

> please check if the problem can be reproduced with async PM disabled 
>(echo 0 > /sys/power/pm_async

the problem can be reproduced (the system freezes on the third suspend) with pm_async 0


>please check if this is also a regression to suspend-to-idle (echo 
freeze > /sys/power/state)

i cannot test this at the moment because the screen doesn't come back after the first try on suspend-to-idle
Comment 34 n0000b.n000b 2017-01-20 00:11:21 UTC
Created attachment 252481 [details]
dmesg output after various suspend to idle cycles

suspend to idle seems to work, but the screen doesn't come back. Attached is the dmesg after various cycles (extracted via ssh)
Comment 35 n0000b.n000b 2017-02-17 02:23:07 UTC
ping?
Comment 36 Chen Yu 2017-03-20 15:51:56 UTC
I got a question about 
commit aa8e54b55947 ("PM / sleep: Go >direct_complete if driver has no callbacks")

Above commit checks if driver for device A has any pm_callbacks, if not, device A will be marked  as go_direct_complete, thus A's parent P will ignore A.
But how about A's children? If A's children has pm callbacks, will they be ignored as a result of this patch?

(Since the original patch to introduce go_direct_complete has mentioned A's children: A and A's children are ok to remain in runtimesuspend, then prepare()
will return a non-zero value.)
Comment 37 n0000b.n000b 2017-04-06 04:23:00 UTC
problem is still present in fedora 25 with kernel 4.11.0-0.rc5.git0.1 from rawhide
Comment 38 n0000b.n000b 2017-06-20 08:34:32 UTC
system still hangs in the third suspend cycle with the udev rule in fedora 26 beta and kernel 4.12.0-0.rc5 from rawhide
Comment 39 Len Brown 2017-09-25 22:47:01 UTC
Can you test Linux-4.14-rc2 or newer?  It may contain a patch that fixes this.
Comment 40 n0000b.n000b 2017-09-30 04:27:22 UTC
tested kernel 4.14.0-0.rc2.git1.2.fc28.x86_64 in fedora 26, the bug is still present, the laptop hangs in the third suspend.
Comment 41 Rafael J. Wysocki 2017-12-18 23:49:19 UTC
Well, OK

Can you please test kernels that don't come from Fedora?  Like something you compiled yourself?

I'm asking, because I'm wondering if you can test patches.
Comment 42 n0000b.n000b 2017-12-19 21:52:30 UTC
Some updates: Now I'm running fedora 27 with rawhide kernels.

Since kernel version 4.15-rc1 I'm able to suspend more than 3 times, but after some days of use (and several suspend and resume cycles) the system hangs.

I'm testing the package 4.15.0-0.rc4.git0.1.fc28.x86_64 from fedora, i've been able to suspend for the third time but im seeing some errors in the log:

[37312.538944] swiotlb_tbl_map_single: 6 callbacks suppressed
[37312.538954] radeon 0000:00:01.0: swiotlb buffer is full (sz: 2097152 bytes)
[37312.538957] swiotlb: coherent allocation failed for device 0000:00:01.0 size=2097152
[37312.538964] CPU: 1 PID: 10832 Comm: kworker/u8:25 Not tainted 4.15.0-0.rc4.git0.1.fc28.x86_64 #1
[37312.538967] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 305V4A/305V5A/3415VA/305V4A/305V4A, BIOS 09PW.ME13.20121101.SKK 11/01/2012
[37312.538980] Workqueue: events_unbound async_run_entry_fn
[37312.538983] Call Trace:
[37312.538999]  dump_stack+0x5c/0x85
[37312.539006]  swiotlb_alloc_coherent+0xe0/0x150
[37312.539027]  ttm_dma_pool_get_pages+0x20b/0x5e0 [ttm]
[37312.539043]  ttm_dma_populate+0x24d/0x340 [ttm]
[37312.539055]  ttm_tt_bind+0x23/0x50 [ttm]
[37312.539070]  ttm_bo_handle_move_mem+0x5cd/0x600 [ttm]
[37312.539083]  ttm_bo_evict+0x147/0x310 [ttm]
[37312.539097]  ttm_mem_evict_first+0x15b/0x1d0 [ttm]
[37312.539109]  ttm_bo_force_list_clean+0x67/0x110 [ttm]
[37312.539180]  radeon_suspend_kms+0xb5/0x3b0 [radeon]
[37312.539189]  pci_pm_suspend+0x76/0x120
[37312.539194]  ? pci_pm_freeze+0xb0/0xb0
[37312.539198]  dpm_run_callback+0x4b/0x130
[37312.539203]  __device_suspend+0x116/0x420
[37312.539207]  async_suspend+0x1a/0x90
[37312.539213]  async_run_entry_fn+0x33/0x160
[37312.539218]  process_one_work+0x182/0x3a0
[37312.539223]  worker_thread+0x2e/0x380
[37312.539228]  ? process_one_work+0x3a0/0x3a0
[37312.539231]  kthread+0x111/0x130
[37312.539236]  ? kthread_create_worker_on_cpu+0x70/0x70
[37312.539242]  ret_from_fork+0x1f/0x30


swiotlb: coherent allocation failed for device 0000:02:00.0 size=2097152
[37313.853268] CPU: 1 PID: 10825 Comm: kworker/u8:18 Not tainted 4.15.0-0.rc4.git0.1.fc28.x86_64 #1
[37313.853271] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 305V4A/305V5A/3415VA/305V4A/305V4A, BIOS 09PW.ME13.20121101.SKK 11/01/2012
[37313.853285] Workqueue: events_unbound async_run_entry_fn
[37313.853288] Call Trace:
[37313.853303]  dump_stack+0x5c/0x85
[37313.853310]  swiotlb_alloc_coherent+0xe0/0x150
[37313.853333]  ttm_dma_pool_get_pages+0x20b/0x5e0 [ttm]
[37313.853349]  ttm_dma_populate+0x24d/0x340 [ttm]
[37313.853362]  ttm_bo_move_memcpy+0x17f/0x600 [ttm]
[37313.853369]  ? acpi_os_release_object+0xa/0x10
[37313.853445]  radeon_bo_move+0x1a7/0x220 [radeon]
[37313.853460]  ttm_bo_handle_move_mem+0x2ae/0x600 [ttm]
[37313.853473]  ttm_bo_evict+0x147/0x310 [ttm]
[37313.853530]  ? radeon_pm_compute_clocks_dpm+0xf3/0x500 [radeon]
[37313.853551]  ? drm_kms_helper_poll_enable.part.4+0x50/0xb0 [drm_kms_helper]
[37313.853557]  ? find_next_iomem_res+0x33/0x100
[37313.853570]  ttm_mem_evict_first+0x15b/0x1d0 [ttm]
[37313.853582]  ttm_bo_force_list_clean+0x67/0x110 [ttm]
[37313.853622]  radeon_suspend_kms+0x112/0x3b0 [radeon]
[37313.853630]  pci_pm_suspend+0x76/0x120
[37313.853634]  ? pci_pm_freeze+0xb0/0xb0
[37313.853638]  dpm_run_callback+0x4b/0x130
[37313.853643]  __device_suspend+0x116/0x420
[37313.853648]  async_suspend+0x1a/0x90
[37313.853652]  async_run_entry_fn+0x33/0x160
[37313.853658]  process_one_work+0x182/0x3a0
[37313.853663]  worker_thread+0x2e/0x380
[37313.853668]  ? process_one_work+0x3a0/0x3a0
[37313.853671]  kthread+0x111/0x130
[37313.853675]  ? kthread_create_worker_on_cpu+0x70/0x70
[37313.853681]  ret_from_fork+0x1f/0x30


And yes I can test kernels out of Fedora and patches but at a slow pace
Comment 43 Zhang Rui 2017-12-20 01:44:06 UTC
what if you blacklist radeon driver? you may get vga console only, but does suspend/resume works for many times in this case?
Comment 44 n0000b.n000b 2017-12-20 01:58:04 UTC
im in kernel 4.15-rc4 and since rc1 I can suspend multiple times (i'm in the six time right now), i haven't tested if this works if i keep running the laptop for several days
Comment 45 Zhang Rui 2018-01-15 03:42:08 UTC
good news.

Please confirm if the problem is gone in latest upstream kernel.
Comment 46 Zhang Rui 2018-01-29 07:07:42 UTC
can you please confirm if the problem in gone in latest upstream kernel?
Comment 47 n0000b.n000b 2018-01-31 22:10:51 UTC
i've been using the kernel 4.15-rc9 and i can suspend and resume multiple times, I will now test the final release and report back
Comment 48 n0000b.n000b 2018-03-11 17:01:27 UTC
With recent kernels i can suspend and resume multiple times, sometimes the session crashes but the bug reported is solved. Thanks.

Note You need to log in before you can comment on or make changes to this bug.