Bug 194581

Summary: [Bisected] Intel i7-M640: only one cpu core active after resume from suspend(cpufreq_online return -5)
Product: Power Management Reporter: Tomasz Maciej Nowak (tmn505)
Component: Hibernation/SuspendAssignee: Chen Yu (yu.c.chen)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: lenb, rjw, yu.c.chen
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.9 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel log of affected system
dmidecode output of affected system VPCZ13C5E
sporadical backtrace of affected system VPCZ13C5E
trace output of affected system VPCZ13C5E
patch used to revert cpufreq.c changes beetwen 4.8 and 4.9
trace output of affected kernel version 4.9
trace output of unaffected kernel version 4.8
trace output of kernel version 4.9 with reverted cpufreq.c
trial patch to fix
acpi dump of affected system VPCZ13C5E
dmesg before bringing cpu1 down
dmesg after bringing cpu1 up

Description Tomasz Maciej Nowak 2017-02-13 17:49:27 UTC
Created attachment 254733 [details]
kernel log of affected system

After waking up from suspend only cpu core is active. In dmesg attached smpboot fails to boot sleeping cores with: "Error taking CPU1 up: -5"
Bisecting kernel wasn't very informative as during this process there was a lot of stages where suspend didn't work at all or kernel failed to boot. The result was commit
"cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321.

Last known good stable kernel version: 4.8.x
Bug still present in 4.10-rc8

The affected system is Sony VAIO VPCZ13C5E (for details see attached dmi.txt)
Comment 1 Tomasz Maciej Nowak 2017-02-13 17:51:03 UTC
Created attachment 254735 [details]
dmidecode output of affected system VPCZ13C5E
Comment 2 Len Brown 2017-02-14 00:25:34 UTC
what you you see if you manually take a processor offline/online?

"# echo 0 > /sys/devices/system/cpu/cpu1/online"

"# echo1 > /sys/devices/system/cpu/cpu1/online"
Comment 3 Tomasz Maciej Nowak 2017-02-14 09:14:09 UTC
(In reply to Len Brown from comment #2)
> what you you see if you manually take a processor offline/online?
> 
> "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> 
> "# echo1 > /sys/devices/system/cpu/cpu1/online"

For echo 1:
[  140.619767] smpboot: CPU 1 is now offline

For echo 0 output is:
bash: echo: write error: Input/output error
In kmsg:
[  146.935909] x86: Booting SMP configuration:
[  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  146.941157] smpboot: CPU 1 is now offline

The same goes for cpu2 and cpu3.
Comment 4 Tomasz Maciej Nowak 2017-02-14 09:18:29 UTC
(In reply to Tomasz Maciej Nowak from comment #3)
> (In reply to Len Brown from comment #2)
> > what you you see if you manually take a processor offline/online?
> > 
> > "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> > 
> > "# echo1 > /sys/devices/system/cpu/cpu1/online"
> 
> For echo 1:
> [  140.619767] smpboot: CPU 1 is now offline
> 
> For echo 0 output is:
> bash: echo: write error: Input/output error
> In kmsg:
> [  146.935909] x86: Booting SMP configuration:
> [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [  146.941157] smpboot: CPU 1 is now offline
> 
> The same goes for cpu2 and cpu3.

Sorry for the noise, wrongly assigned outputs, should be:
For echo 0:
[  140.619767] smpboot: CPU 1 is now offline

For echo 1 output is:
bash: echo: write error: Input/output error
In kmsg:
[  146.935909] x86: Booting SMP configuration:
[  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
[  146.941157] smpboot: CPU 1 is now offline
Comment 5 Tomasz Maciej Nowak 2017-02-25 12:39:02 UTC
Created attachment 254913 [details]
sporadical backtrace of affected system VPCZ13C5E

Finally caught this as it appeared very rarely after the system wakeup.
Comment 6 Chen Yu 2017-03-06 03:07:21 UTC
(In reply to Tomasz Maciej Nowak from comment #4)
> (In reply to Tomasz Maciej Nowak from comment #3)
> > (In reply to Len Brown from comment #2)
> > > what you you see if you manually take a processor offline/online?
> > > 
> > > "# echo 0 > /sys/devices/system/cpu/cpu1/online"
> > > 
> > > "# echo1 > /sys/devices/system/cpu/cpu1/online"
> > 
> > For echo 1:
> > [  140.619767] smpboot: CPU 1 is now offline
> > 
> > For echo 0 output is:
> > bash: echo: write error: Input/output error
> > In kmsg:
> > [  146.935909] x86: Booting SMP configuration:
> > [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> > [  146.941157] smpboot: CPU 1 is now offline
> > 
> > The same goes for cpu2 and cpu3.
> 
> Sorry for the noise, wrongly assigned outputs, should be:
> For echo 0:
> [  140.619767] smpboot: CPU 1 is now offline
> 
> For echo 1 output is:
> bash: echo: write error: Input/output error
> In kmsg:
> [  146.935909] x86: Booting SMP configuration:
> [  146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1
> [  146.941157] smpboot: CPU 1 is now offline

So this issue can also be reproduced during normal cpu offline/online, right?
Please help check with:
1. cd /sys/kernel/debug/tracing/
2. echo 0 > tracing_on
3. echo function_graph > current_tracer
4. echo _cpu_up > set_graph_function
5. echo > trace
6. echo 1 > tracing_on
7.  
"# echo 0 > /sys/devices/system/cpu/cpu1/online"
"# echo1 > /sys/devices/system/cpu/cpu1/online"

8. cat trace > ~/trace_cpu_online_error.log
9. attach your trace_cpu_online_error.log
Comment 7 Tomasz Maciej Nowak 2017-03-06 11:32:26 UTC
Created attachment 255095 [details]
trace output of affected system VPCZ13C5E

Yes, either if I invoke suspend or take down cores manually, only reboot brings them back online.
Comment 8 Chen Yu 2017-03-12 17:45:58 UTC
BTW, do you still get the issue if you revert 
"cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321
?
Comment 9 Tomasz Maciej Nowak 2017-03-13 20:26:19 UTC
Created attachment 255221 [details]
patch used to revert cpufreq.c changes beetwen 4.8 and 4.9

Thanks for pointing that out, I should have done that at the beginning. Seems it fixes the non-booting cores after suspend. Single commit revert did not work so instead I reverted all changes in drivers/cpufreq/cpufreq.c between version 4.8 and 4.9.
Comment 10 Chen Yu 2017-03-20 14:45:44 UTC
Thanks Tomasz,
According to the logs in Comment 7, it looks like one of CPU1's cpuhp_callback returns a negative result ,which cause the system to cancel bringing up the CPU1.
And as you mentioned in Comment 9, it is possible that the cpufreq_online has failed and returned a negative value. But according to the commit, I can not figure out why it failed(the commit just put the cpufreq_online under cpuhp's scope without changing its semantics).
Please help test with the following steps again, and let's see if there is any clue there:
1. cd /sys/kernel/debug/tracing/
2. echo 1 > events/cpuhp/enable  
3. echo cpufreq_online > set_graph_function
4. echo function_graph > current_tracer
5. echo > trace
6. echo 0 > /sys/devices/system/cpu/cpu1/online
7. echo 1 > tracing_on
8. echo 1 > /sys/devices/system/cpu/cpu1/online
9. cat trace > ~/trace_cpufreq_online.log
10. attach your trace_cpufreq_online.log
Comment 11 Chen Yu 2017-03-20 15:00:19 UTC
(In reply to Chen Yu from comment #10)
> Thanks Tomasz,
> According to the logs in Comment 7, it looks like one of CPU1's
> cpuhp_callback returns a negative result ,which cause the system to cancel
> bringing up the CPU1.
> And as you mentioned in Comment 9, it is possible that the cpufreq_online
> has failed and returned a negative value. But according to the commit, I can
> not figure out why it failed(the commit just put the cpufreq_online under
> cpuhp's scope without changing its semantics).
> Please help test with the following steps again, and let's see if there is
> any clue there:
> 1. cd /sys/kernel/debug/tracing/
> 2. echo 1 > events/cpuhp/enable  
> 3. echo cpufreq_online > set_graph_function
> 4. echo function_graph > current_tracer
> 5. echo > trace
> 6. echo 0 > /sys/devices/system/cpu/cpu1/online
> 7. echo 1 > tracing_on
> 8. echo 1 > /sys/devices/system/cpu/cpu1/online
> 9. cat trace > ~/trace_cpufreq_online.log
> 10. attach your trace_cpufreq_online.log

Amend:

1. cd /sys/kernel/debug/tracing/
2. echo 1 > events/cpuhp/enable  
3. echo cpufreq_online > set_graph_function
4. echo function_graph > current_tracer
5. echo > set_ftrace_filter
6. echo 0 > tracing_on
7. echo > trace
8. echo 0 > /sys/devices/system/cpu/cpu1/online
9. echo 1 > tracing_on
10. echo 1 > /sys/devices/system/cpu/cpu1/online
11. cat trace > ~/trace_cpufreq_online.log
12. attach your trace_cpufreq_online.log
Comment 12 Tomasz Maciej Nowak 2017-03-22 14:40:13 UTC
Created attachment 255423 [details]
trace output of affected kernel version 4.9

Thanks for the reply. Also in additional attachments I added tracing output of unaffected kernel version and affected kernel version (4.9) with changes in cpufreq.c reverted to 4.8 state. I hope the added trace output will give a clue on what is happening. Btw, bug is still present in 4.11-rc3.
Comment 13 Tomasz Maciej Nowak 2017-03-22 14:41:14 UTC
Created attachment 255425 [details]
trace output of unaffected kernel version 4.8
Comment 14 Tomasz Maciej Nowak 2017-03-22 14:44:05 UTC
Created attachment 255429 [details]
trace output of kernel version 4.9 with reverted cpufreq.c
Comment 15 Chen Yu 2017-03-23 07:00:26 UTC
According to the log, the problem was caused by CPU1 does not have _PCT during online, should be here:
acpi_cpufreq_cpu_init->acpi_processor_register_performance
->acpi_processor_get_performance_info:
if (!acpi_has_method(pr->handle, "_PCT"))
Could you upload your acpidump raw data?
# acpidump > acpi.out
Comment 16 Chen Yu 2017-03-23 08:08:56 UTC
Created attachment 255441 [details]
trial patch to fix

Please also check if this patch works for you?

And please boot up the patched kernel with the following bootcommad int the grub menu:
dyndbg="file drivers/cpufreq/cpufreq.c +p"
And after boot up , please provide the dmesg.

Then tries normal cpuoffline/online and provide the dmesg again.
thx
Comment 17 Tomasz Maciej Nowak 2017-03-23 13:10:02 UTC
Created attachment 255447 [details]
acpi dump of affected system VPCZ13C5E

Yes, with the patch kernel brings up the offline cores. Additional informations in requested dmesg outputs.
Comment 18 Tomasz Maciej Nowak 2017-03-23 13:11:58 UTC
Created attachment 255449 [details]
dmesg before bringing cpu1 down
Comment 19 Tomasz Maciej Nowak 2017-03-23 13:14:23 UTC
Created attachment 255451 [details]
dmesg after bringing cpu1 up
Comment 20 Chen Yu 2017-03-25 04:27:14 UTC
OK, so there is no _PCT under your processer, patch has been sent to:
https://patchwork.kernel.org/patch/9644385/
thx.
Comment 21 Tomasz Maciej Nowak 2017-03-25 14:13:45 UTC
Thank You for resolving this issue.