Created attachment 254733 [details] kernel log of affected system After waking up from suspend only cpu core is active. In dmesg attached smpboot fails to boot sleeping cores with: "Error taking CPU1 up: -5" Bisecting kernel wasn't very informative as during this process there was a lot of stages where suspend didn't work at all or kernel failed to boot. The result was commit "cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321. Last known good stable kernel version: 4.8.x Bug still present in 4.10-rc8 The affected system is Sony VAIO VPCZ13C5E (for details see attached dmi.txt)
Created attachment 254735 [details] dmidecode output of affected system VPCZ13C5E
what you you see if you manually take a processor offline/online? "# echo 0 > /sys/devices/system/cpu/cpu1/online" "# echo1 > /sys/devices/system/cpu/cpu1/online"
(In reply to Len Brown from comment #2) > what you you see if you manually take a processor offline/online? > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > "# echo1 > /sys/devices/system/cpu/cpu1/online" For echo 1: [ 140.619767] smpboot: CPU 1 is now offline For echo 0 output is: bash: echo: write error: Input/output error In kmsg: [ 146.935909] x86: Booting SMP configuration: [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 146.941157] smpboot: CPU 1 is now offline The same goes for cpu2 and cpu3.
(In reply to Tomasz Maciej Nowak from comment #3) > (In reply to Len Brown from comment #2) > > what you you see if you manually take a processor offline/online? > > > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > > > "# echo1 > /sys/devices/system/cpu/cpu1/online" > > For echo 1: > [ 140.619767] smpboot: CPU 1 is now offline > > For echo 0 output is: > bash: echo: write error: Input/output error > In kmsg: > [ 146.935909] x86: Booting SMP configuration: > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 146.941157] smpboot: CPU 1 is now offline > > The same goes for cpu2 and cpu3. Sorry for the noise, wrongly assigned outputs, should be: For echo 0: [ 140.619767] smpboot: CPU 1 is now offline For echo 1 output is: bash: echo: write error: Input/output error In kmsg: [ 146.935909] x86: Booting SMP configuration: [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 146.941157] smpboot: CPU 1 is now offline
Created attachment 254913 [details] sporadical backtrace of affected system VPCZ13C5E Finally caught this as it appeared very rarely after the system wakeup.
(In reply to Tomasz Maciej Nowak from comment #4) > (In reply to Tomasz Maciej Nowak from comment #3) > > (In reply to Len Brown from comment #2) > > > what you you see if you manually take a processor offline/online? > > > > > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > > > > > "# echo1 > /sys/devices/system/cpu/cpu1/online" > > > > For echo 1: > > [ 140.619767] smpboot: CPU 1 is now offline > > > > For echo 0 output is: > > bash: echo: write error: Input/output error > > In kmsg: > > [ 146.935909] x86: Booting SMP configuration: > > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > > [ 146.941157] smpboot: CPU 1 is now offline > > > > The same goes for cpu2 and cpu3. > > Sorry for the noise, wrongly assigned outputs, should be: > For echo 0: > [ 140.619767] smpboot: CPU 1 is now offline > > For echo 1 output is: > bash: echo: write error: Input/output error > In kmsg: > [ 146.935909] x86: Booting SMP configuration: > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 146.941157] smpboot: CPU 1 is now offline So this issue can also be reproduced during normal cpu offline/online, right? Please help check with: 1. cd /sys/kernel/debug/tracing/ 2. echo 0 > tracing_on 3. echo function_graph > current_tracer 4. echo _cpu_up > set_graph_function 5. echo > trace 6. echo 1 > tracing_on 7. "# echo 0 > /sys/devices/system/cpu/cpu1/online" "# echo1 > /sys/devices/system/cpu/cpu1/online" 8. cat trace > ~/trace_cpu_online_error.log 9. attach your trace_cpu_online_error.log
Created attachment 255095 [details] trace output of affected system VPCZ13C5E Yes, either if I invoke suspend or take down cores manually, only reboot brings them back online.
BTW, do you still get the issue if you revert "cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321 ?
Created attachment 255221 [details] patch used to revert cpufreq.c changes beetwen 4.8 and 4.9 Thanks for pointing that out, I should have done that at the beginning. Seems it fixes the non-booting cores after suspend. Single commit revert did not work so instead I reverted all changes in drivers/cpufreq/cpufreq.c between version 4.8 and 4.9.
Thanks Tomasz, According to the logs in Comment 7, it looks like one of CPU1's cpuhp_callback returns a negative result ,which cause the system to cancel bringing up the CPU1. And as you mentioned in Comment 9, it is possible that the cpufreq_online has failed and returned a negative value. But according to the commit, I can not figure out why it failed(the commit just put the cpufreq_online under cpuhp's scope without changing its semantics). Please help test with the following steps again, and let's see if there is any clue there: 1. cd /sys/kernel/debug/tracing/ 2. echo 1 > events/cpuhp/enable 3. echo cpufreq_online > set_graph_function 4. echo function_graph > current_tracer 5. echo > trace 6. echo 0 > /sys/devices/system/cpu/cpu1/online 7. echo 1 > tracing_on 8. echo 1 > /sys/devices/system/cpu/cpu1/online 9. cat trace > ~/trace_cpufreq_online.log 10. attach your trace_cpufreq_online.log
(In reply to Chen Yu from comment #10) > Thanks Tomasz, > According to the logs in Comment 7, it looks like one of CPU1's > cpuhp_callback returns a negative result ,which cause the system to cancel > bringing up the CPU1. > And as you mentioned in Comment 9, it is possible that the cpufreq_online > has failed and returned a negative value. But according to the commit, I can > not figure out why it failed(the commit just put the cpufreq_online under > cpuhp's scope without changing its semantics). > Please help test with the following steps again, and let's see if there is > any clue there: > 1. cd /sys/kernel/debug/tracing/ > 2. echo 1 > events/cpuhp/enable > 3. echo cpufreq_online > set_graph_function > 4. echo function_graph > current_tracer > 5. echo > trace > 6. echo 0 > /sys/devices/system/cpu/cpu1/online > 7. echo 1 > tracing_on > 8. echo 1 > /sys/devices/system/cpu/cpu1/online > 9. cat trace > ~/trace_cpufreq_online.log > 10. attach your trace_cpufreq_online.log Amend: 1. cd /sys/kernel/debug/tracing/ 2. echo 1 > events/cpuhp/enable 3. echo cpufreq_online > set_graph_function 4. echo function_graph > current_tracer 5. echo > set_ftrace_filter 6. echo 0 > tracing_on 7. echo > trace 8. echo 0 > /sys/devices/system/cpu/cpu1/online 9. echo 1 > tracing_on 10. echo 1 > /sys/devices/system/cpu/cpu1/online 11. cat trace > ~/trace_cpufreq_online.log 12. attach your trace_cpufreq_online.log
Created attachment 255423 [details] trace output of affected kernel version 4.9 Thanks for the reply. Also in additional attachments I added tracing output of unaffected kernel version and affected kernel version (4.9) with changes in cpufreq.c reverted to 4.8 state. I hope the added trace output will give a clue on what is happening. Btw, bug is still present in 4.11-rc3.
Created attachment 255425 [details] trace output of unaffected kernel version 4.8
Created attachment 255429 [details] trace output of kernel version 4.9 with reverted cpufreq.c
According to the log, the problem was caused by CPU1 does not have _PCT during online, should be here: acpi_cpufreq_cpu_init->acpi_processor_register_performance ->acpi_processor_get_performance_info: if (!acpi_has_method(pr->handle, "_PCT")) Could you upload your acpidump raw data? # acpidump > acpi.out
Created attachment 255441 [details] trial patch to fix Please also check if this patch works for you? And please boot up the patched kernel with the following bootcommad int the grub menu: dyndbg="file drivers/cpufreq/cpufreq.c +p" And after boot up , please provide the dmesg. Then tries normal cpuoffline/online and provide the dmesg again. thx
Created attachment 255447 [details] acpi dump of affected system VPCZ13C5E Yes, with the patch kernel brings up the offline cores. Additional informations in requested dmesg outputs.
Created attachment 255449 [details] dmesg before bringing cpu1 down
Created attachment 255451 [details] dmesg after bringing cpu1 up
OK, so there is no _PCT under your processer, patch has been sent to: https://patchwork.kernel.org/patch/9644385/ thx.
Thank You for resolving this issue.