Bug 194581 - [Bisected] Intel i7-M640: only one cpu core active after resume from suspend(cpufreq_online return -5)
Summary: [Bisected] Intel i7-M640: only one cpu core active after resume from suspend(...
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-13 17:49 UTC by Tomasz Maciej Nowak
Modified: 2017-03-25 14:13 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.9
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel log of affected system (63.60 KB, text/plain)
2017-02-13 17:49 UTC, Tomasz Maciej Nowak
Details
dmidecode output of affected system VPCZ13C5E (5.25 KB, text/plain)
2017-02-13 17:51 UTC, Tomasz Maciej Nowak
Details
sporadical backtrace of affected system VPCZ13C5E (74.79 KB, text/plain)
2017-02-25 12:39 UTC, Tomasz Maciej Nowak
Details
trace output of affected system VPCZ13C5E (580.45 KB, text/plain)
2017-03-06 11:32 UTC, Tomasz Maciej Nowak
Details
patch used to revert cpufreq.c changes beetwen 4.8 and 4.9 (7.02 KB, patch)
2017-03-13 20:26 UTC, Tomasz Maciej Nowak
Details | Diff
trace output of affected kernel version 4.9 (53.04 KB, text/plain)
2017-03-22 14:40 UTC, Tomasz Maciej Nowak
Details
trace output of unaffected kernel version 4.8 (35.61 KB, text/plain)
2017-03-22 14:41 UTC, Tomasz Maciej Nowak
Details
trace output of kernel version 4.9 with reverted cpufreq.c (41.17 KB, text/plain)
2017-03-22 14:44 UTC, Tomasz Maciej Nowak
Details
trial patch to fix (2.35 KB, text/plain)
2017-03-23 08:08 UTC, Chen Yu
Details
acpi dump of affected system VPCZ13C5E (239.03 KB, text/plain)
2017-03-23 13:10 UTC, Tomasz Maciej Nowak
Details
dmesg before bringing cpu1 down (58.28 KB, text/plain)
2017-03-23 13:11 UTC, Tomasz Maciej Nowak
Details
dmesg after bringing cpu1 up (58.86 KB, text/plain)
2017-03-23 13:14 UTC, Tomasz Maciej Nowak
Details

Description Tomasz Maciej Nowak 2017-02-13 17:49:27 UTC
Created attachment 254733 [details]
kernel log of affected system

After waking up from suspend only cpu core is active. In dmesg attached smpboot fails to boot sleeping cores with: "Error taking CPU1 up: -5"
Bisecting kernel wasn't very informative as during this process there was a lot of stages where suspend didn't work at all or kernel failed to boot. The result was commit
"cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321.

Last known good stable kernel version: 4.8.x
Bug still present in 4.10-rc8

The affected system is Sony VAIO VPCZ13C5E (for details see attached dmi.txt)
Comment 1 Tomasz Maciej Nowak 2017-02-13 17:51:03 UTC
Created attachment 254735 [details]
dmidecode output of affected system VPCZ13C5E
Comment 2 Len Brown 2017-02-14 00:25:34 UTC
what you you see if you manually take a processor offline/online?

"# echo 0 > /sys/devices/system/cpu/cpu1/online"

"# echo1 > /sys/devices/system/cpu/cpu1/online"
Comment 3 Tomasz Maciej Nowak 2017-02-14 09:14:09 UTC
(In reply to Len Brown from comment #2)
> what you you see if you manually take a processor offline/online?
> 
> "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> 
> "# echo1 > /sys/devices/system/cpu/cpu1/online"

For echo 1:
[  140.619767] smpboot: CPU 1 is now offline

For echo 0 output is:
bash: echo: write error: Input/output error
In kmsg:
[  146.935909] x86: Booting SMP configuration:
[  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  146.941157] smpboot: CPU 1 is now offline

The same goes for cpu2 and cpu3.
Comment 4 Tomasz Maciej Nowak 2017-02-14 09:18:29 UTC
(In reply to Tomasz Maciej Nowak from comment #3)
> (In reply to Len Brown from comment #2)
> > what you you see if you manually take a processor offline/online?
> > 
> > "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> > 
> > "# echo1 > /sys/devices/system/cpu/cpu1/online"
> 
> For echo 1:
> [  140.619767] smpboot: CPU 1 is now offline
> 
> For echo 0 output is:
> bash: echo: write error: Input/output error
> In kmsg:
> [  146.935909] x86: Booting SMP configuration:
> [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [  146.941157] smpboot: CPU 1 is now offline
> 
> The same goes for cpu2 and cpu3.

Sorry for the noise, wrongly assigned outputs, should be:
For echo 0:
[  140.619767] smpboot: CPU 1 is now offline

For echo 1 output is:
bash: echo: write error: Input/output error
In kmsg:
[  146.935909] x86: Booting SMP configuration:
[  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  146.941157] smpboot: CPU 1 is now offline
Comment 5 Tomasz Maciej Nowak 2017-02-25 12:39:02 UTC
Created attachment 254913 [details]
sporadical backtrace of affected system VPCZ13C5E

Finally caught this as it appeared very rarely after the system wakeup.
Comment 6 Chen Yu 2017-03-06 03:07:21 UTC
(In reply to Tomasz Maciej Nowak from comment #4)
> (In reply to Tomasz Maciej Nowak from comment #3)
> > (In reply to Len Brown from comment #2)
> > > what you you see if you manually take a processor offline/online?
> > > 
> > > "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> > > 
> > > "# echo1 > /sys/devices/system/cpu/cpu1/online"
> > 
> > For echo 1:
> > [  140.619767] smpboot: CPU 1 is now offline
> > 
> > For echo 0 output is:
> > bash: echo: write error: Input/output error
> > In kmsg:
> > [  146.935909] x86: Booting SMP configuration:
> > [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> > [  146.941157] smpboot: CPU 1 is now offline
> > 
> > The same goes for cpu2 and cpu3.
> 
> Sorry for the noise, wrongly assigned outputs, should be:
> For echo 0:
> [  140.619767] smpboot: CPU 1 is now offline
> 
> For echo 1 output is:
> bash: echo: write error: Input/output error
> In kmsg:
> [  146.935909] x86: Booting SMP configuration:
> [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [  146.941157] smpboot: CPU 1 is now offline

So this issue can also be reproduced during normal cpu offline/online, right?
Please help check with:
1. cd /sys/kernel/debug/tracing/
2. echo 0 > tracing_on
3. echo function_graph > current_tracer
4. echo _cpu_up > set_graph_function
5. echo > trace
6. echo 1 > tracing_on
7.  
"# echo 0 > /sys/devices/system/cpu/cpu1/online"
"# echo1 > /sys/devices/system/cpu/cpu1/online"

8. cat trace > ~/trace_cpu_online_error.log
9. attach your trace_cpu_online_error.log
Comment 7 Tomasz Maciej Nowak 2017-03-06 11:32:26 UTC
Created attachment 255095 [details]
trace output of affected system VPCZ13C5E

Yes, either if I invoke suspend or take down cores manually, only reboot brings them back online.
Comment 8 Chen Yu 2017-03-12 17:45:58 UTC
BTW, do you still get the issue if you revert 
"cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321
?
Comment 9 Tomasz Maciej Nowak 2017-03-13 20:26:19 UTC
Created attachment 255221 [details]
patch used to revert cpufreq.c changes beetwen 4.8 and 4.9

Thanks for pointing that out, I should have done that at the beginning. Seems it fixes the non-booting cores after suspend. Single commit revert did not work so instead I reverted all changes in drivers/cpufreq/cpufreq.c between version 4.8 and 4.9.
Comment 10 Chen Yu 2017-03-20 14:45:44 UTC
Thanks Tomasz,
According to the logs in Comment 7, it looks like one of CPU1's cpuhp_callback returns a negative result ,which cause the system to cancel bringing up the CPU1.
And as you mentioned in Comment 9, it is possible that the cpufreq_online has failed and returned a negative value. But according to the commit, I can not figure out why it failed(the commit just put the cpufreq_online under cpuhp's scope without changing its semantics).
Please help test with the following steps again, and let's see if there is any clue there:
1. cd /sys/kernel/debug/tracing/
2. echo 1 > events/cpuhp/enable  
3. echo cpufreq_online > set_graph_function
4. echo function_graph > current_tracer
5. echo > trace
6. echo 0 > /sys/devices/system/cpu/cpu1/online
7. echo 1 > tracing_on
8. echo 1 > /sys/devices/system/cpu/cpu1/online
9. cat trace > ~/trace_cpufreq_online.log
10. attach your trace_cpufreq_online.log
Comment 11 Chen Yu 2017-03-20 15:00:19 UTC
(In reply to Chen Yu from comment #10)
> Thanks Tomasz,
> According to the logs in Comment 7, it looks like one of CPU1's
> cpuhp_callback returns a negative result ,which cause the system to cancel
> bringing up the CPU1.
> And as you mentioned in Comment 9, it is possible that the cpufreq_online
> has failed and returned a negative value. But according to the commit, I can
> not figure out why it failed(the commit just put the cpufreq_online under
> cpuhp's scope without changing its semantics).
> Please help test with the following steps again, and let's see if there is
> any clue there:
> 1. cd /sys/kernel/debug/tracing/
> 2. echo 1 > events/cpuhp/enable  
> 3. echo cpufreq_online > set_graph_function
> 4. echo function_graph > current_tracer
> 5. echo > trace
> 6. echo 0 > /sys/devices/system/cpu/cpu1/online
> 7. echo 1 > tracing_on
> 8. echo 1 > /sys/devices/system/cpu/cpu1/online
> 9. cat trace > ~/trace_cpufreq_online.log
> 10. attach your trace_cpufreq_online.log

Amend:

1. cd /sys/kernel/debug/tracing/
2. echo 1 > events/cpuhp/enable  
3. echo cpufreq_online > set_graph_function
4. echo function_graph > current_tracer
5. echo > set_ftrace_filter
6. echo 0 > tracing_on
7. echo > trace
8. echo 0 > /sys/devices/system/cpu/cpu1/online
9. echo 1 > tracing_on
10. echo 1 > /sys/devices/system/cpu/cpu1/online
11. cat trace > ~/trace_cpufreq_online.log
12. attach your trace_cpufreq_online.log
Comment 12 Tomasz Maciej Nowak 2017-03-22 14:40:13 UTC
Created attachment 255423 [details]
trace output of affected kernel version 4.9

Thanks for the reply. Also in additional attachments I added tracing output of unaffected kernel version and affected kernel version (4.9) with changes in cpufreq.c reverted to 4.8 state. I hope the added trace output will give a clue on what is happening. Btw, bug is still present in 4.11-rc3.
Comment 13 Tomasz Maciej Nowak 2017-03-22 14:41:14 UTC
Created attachment 255425 [details]
trace output of unaffected kernel version 4.8
Comment 14 Tomasz Maciej Nowak 2017-03-22 14:44:05 UTC
Created attachment 255429 [details]
trace output of kernel version 4.9 with reverted cpufreq.c
Comment 15 Chen Yu 2017-03-23 07:00:26 UTC
According to the log, the problem was caused by CPU1 does not have _PCT during online, should be here:
acpi_cpufreq_cpu_init->acpi_processor_register_performance
->acpi_processor_get_performance_info:
if (!acpi_has_method(pr->handle, "_PCT"))
Could you upload your acpidump raw data?
# acpidump > acpi.out
Comment 16 Chen Yu 2017-03-23 08:08:56 UTC
Created attachment 255441 [details]
trial patch to fix

Please also check if this patch works for you?

And please boot up the patched kernel with the following bootcommad int the grub menu:
dyndbg="file drivers/cpufreq/cpufreq.c +p"
And after boot up , please provide the dmesg.

Then tries normal cpuoffline/online and provide the dmesg again.
thx
Comment 17 Tomasz Maciej Nowak 2017-03-23 13:10:02 UTC
Created attachment 255447 [details]
acpi dump of affected system VPCZ13C5E

Yes, with the patch kernel brings up the offline cores. Additional informations in requested dmesg outputs.
Comment 18 Tomasz Maciej Nowak 2017-03-23 13:11:58 UTC
Created attachment 255449 [details]
dmesg before bringing cpu1 down
Comment 19 Tomasz Maciej Nowak 2017-03-23 13:14:23 UTC
Created attachment 255451 [details]
dmesg after bringing cpu1 up
Comment 20 Chen Yu 2017-03-25 04:27:14 UTC
OK, so there is no _PCT under your processer, patch has been sent to:
https://patchwork.kernel.org/patch/9644385/
thx.
Comment 21 Tomasz Maciej Nowak 2017-03-25 14:13:45 UTC
Thank You for resolving this issue.

Note You need to log in before you can comment on or make changes to this bug.