Bug 194771

Summary: coretemp initialization gets stuck
Product: Drivers Reporter: Imre Deak (imre.deak)
Component: Hardware MonitoringAssignee: Jean Delvare (jdelvare)
Status: RESOLVED CODE_FIX    
Severity: normal CC: fenghua.yu, linux, regressions, rouven+kbugzilla, tglx
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.10.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log with failed suspend
4.11-rc1 dmesg
tlp configuration
dmesg-disabled-tlp

Description Imre Deak 2017-03-03 10:07:52 UTC
Created attachment 255063 [details]
dmesg log with failed suspend

On an Intel KBL machine suspends fails with the following stack trace. I attached the dmesg log.

[  244.186724] PM: Preparing system for sleep (mem)
[  244.190427] Freezing user space processes ...
[  264.199520] Freezing of tasks failed after 20.009 seconds (1 tasks refusing to freeze,
+wq_busy=0):
[  264.199722] systemd-udevd   D    0   319    281 0x00000004
[  264.199785] Call Trace:
[  264.199819]  __schedule+0x30e/0xbd0
[  264.199848]  schedule+0x3b/0x90
[  264.199867]  schedule_timeout+0x23b/0x490
[  264.199886]  ? _raw_spin_unlock_irq+0x27/0x50
[  264.199904]  ? __this_cpu_preempt_check+0x13/0x20
[  264.199923]  ? trace_hardirqs_on_caller+0xe7/0x200
[  264.199947]  wait_for_common+0x11a/0x1d0
[  264.200075]  ? wake_up_q+0x70/0x70
[  264.200100]  wait_for_completion+0x18/0x20
[  264.200118]  cpuhp_issue_call+0x9b/0xd0
[  264.200137]  __cpuhp_setup_state+0xe9/0x170
[  264.200159]  ? coretemp_remove+0x60/0x60 [coretemp]
[  264.200174]  ? 0xffffffffa0275000
[  264.200197]  coretemp_init+0x8b/0x1000 [coretemp]
[  264.200217]  do_one_initcall+0x3f/0x170
[  264.200236]  ? rcu_read_lock_sched_held+0x75/0x80
[  264.200255]  ? kmem_cache_alloc_trace+0x274/0x2e0
[  264.200269]  ? do_init_module+0x22/0x1fb
[  264.200311]  load_module+0x2091/0x2410
[  264.200326]  ? symbol_put_addr+0x60/0x60
[  264.200356]  ? kernel_read_file+0x105/0x190
[  264.200389]  SyS_finit_module+0xbc/0xf0
[  264.200424]  entry_SYSCALL_64_fastpath+0x1c/0xb1
Comment 1 Imre Deak 2017-03-03 10:09:36 UTC
CC'ing more maintainers.
Comment 2 Guenter Roeck 2017-03-03 17:53:53 UTC
Maybe related to hotplug state machine. Copying tglx@ for input.
Comment 3 Rouven Czerwinski 2017-03-06 14:12:24 UTC
Created attachment 255099 [details]
4.11-rc1 dmesg

4.11-rc1 hung systemd-udev dmesg
Comment 4 Rouven Czerwinski 2017-03-06 14:14:01 UTC
The above attached dmesg is from this mornings drm-tip kernel, building 4.11-rc1 ATM to verify.
Comment 5 Rouven Czerwinski 2017-03-06 15:04:41 UTC
Seeing the same error under 4.11 rc1, but not every boot.
Comment 6 Rouven Czerwinski 2017-03-06 21:01:38 UTC
Created attachment 255107 [details]
tlp configuration

Disabling tlp via systemtl disable tlp fixes the seems to fix the issue for me (3 boots in a row successful), attaching my /etc/tlp/default as a reference.
Comment 7 Rouven Czerwinski 2017-03-16 09:28:59 UTC
Created attachment 255285 [details]
dmesg-disabled-tlp

The bug still happens with a disabled tlp, although a lot less, it is sometimes triggered.
Comment 8 Rouven Czerwinski 2017-03-16 09:31:31 UTC
Adding Thorsten Leemhuis for potential regression tracking.
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-03-18 15:00:40 UTC
(In reply to Rouven Czerwinski from comment #8)
> Adding Thorsten Leemhuis for potential regression tracking.

Thx for this, but I'm sorry, it won't make it to the regression reports. I only do them in my spare time and the time I have to compile those is quite limited. So to get at least something done I'm focussing on mainline regressions in the current development version (4.11); seem this regression was introduced in 4.10, so it won't make the list I'm currently compiling, as that is for 4.11 only. 

Sorry to disappoint you. I hope to have things a bit more optimized in the future to keep track of regressions in older release as well. But one step at a time.
Comment 10 Rouven Czerwinski 2017-03-23 10:43:12 UTC
Commit dc434e056fe1dada20df7ba07f32739d3a701adf from 4.11-rc3 seems to have fixed the problem for me.
Comment 11 Jean Delvare 2017-03-28 12:05:20 UTC
Imre, can you please check if applying commit dc434e056fe1dada20df7ba07f32739d3a701adf on top of your 4.10 kernel also solves the problem for you?
Comment 12 Imre Deak 2017-03-28 14:56:11 UTC
(In reply to Jean Delvare from comment #11)
> Imre, can you please check if applying commit
> dc434e056fe1dada20df7ba07f32739d3a701adf on top of your 4.10 kernel also
> solves the problem for you?

The problem doesn't seem to happen any more on the KBL machine I reported with a 4.11-rc4 based drm-tip kernel (which then has dc434e056fe1da). If it's also ok with you we could close this bug based on this result. Getting access to that machine is cumbersome (it's in our CI farm).
Comment 13 Jean Delvare 2017-03-29 07:24:53 UTC
If 4.11-rc4 works for you then we can indeed close this as fixed.

If it could be confirmed that commit dc434e056fe1dada20df7ba07f32739d3a701adf fixes it, then that commit should be considered for stable and longterm kernel branches.
Comment 14 Jean Delvare 2017-05-04 07:04:52 UTC
For the record, commit dc434e056fe1 ("cpu/hotplug: Serialize callback invocations proper") was included in stable kernel 4.10.14.