Bug 194581
Summary: | [Bisected] Intel i7-M640: only one cpu core active after resume from suspend(cpufreq_online return -5) | ||
---|---|---|---|
Product: | Power Management | Reporter: | Tomasz Maciej Nowak (tmn505) |
Component: | Hibernation/Suspend | Assignee: | Chen Yu (yu.c.chen) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | lenb, rjw, yu.c.chen |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.9 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
kernel log of affected system
dmidecode output of affected system VPCZ13C5E sporadical backtrace of affected system VPCZ13C5E trace output of affected system VPCZ13C5E patch used to revert cpufreq.c changes beetwen 4.8 and 4.9 trace output of affected kernel version 4.9 trace output of unaffected kernel version 4.8 trace output of kernel version 4.9 with reverted cpufreq.c trial patch to fix acpi dump of affected system VPCZ13C5E dmesg before bringing cpu1 down dmesg after bringing cpu1 up |
Created attachment 254735 [details]
dmidecode output of affected system VPCZ13C5E
what you you see if you manually take a processor offline/online? "# echo 0 > /sys/devices/system/cpu/cpu1/online" "# echo1 > /sys/devices/system/cpu/cpu1/online" (In reply to Len Brown from comment #2) > what you you see if you manually take a processor offline/online? > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > "# echo1 > /sys/devices/system/cpu/cpu1/online" For echo 1: [ 140.619767] smpboot: CPU 1 is now offline For echo 0 output is: bash: echo: write error: Input/output error In kmsg: [ 146.935909] x86: Booting SMP configuration: [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 146.941157] smpboot: CPU 1 is now offline The same goes for cpu2 and cpu3. (In reply to Tomasz Maciej Nowak from comment #3) > (In reply to Len Brown from comment #2) > > what you you see if you manually take a processor offline/online? > > > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > > > "# echo1 > /sys/devices/system/cpu/cpu1/online" > > For echo 1: > [ 140.619767] smpboot: CPU 1 is now offline > > For echo 0 output is: > bash: echo: write error: Input/output error > In kmsg: > [ 146.935909] x86: Booting SMP configuration: > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 146.941157] smpboot: CPU 1 is now offline > > The same goes for cpu2 and cpu3. Sorry for the noise, wrongly assigned outputs, should be: For echo 0: [ 140.619767] smpboot: CPU 1 is now offline For echo 1 output is: bash: echo: write error: Input/output error In kmsg: [ 146.935909] x86: Booting SMP configuration: [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 [ 146.941157] smpboot: CPU 1 is now offline Created attachment 254913 [details]
sporadical backtrace of affected system VPCZ13C5E
Finally caught this as it appeared very rarely after the system wakeup.
(In reply to Tomasz Maciej Nowak from comment #4) > (In reply to Tomasz Maciej Nowak from comment #3) > > (In reply to Len Brown from comment #2) > > > what you you see if you manually take a processor offline/online? > > > > > > "# echo 0 > /sys/devices/system/cpu/cpu1/online" > > > > > > "# echo1 > /sys/devices/system/cpu/cpu1/online" > > > > For echo 1: > > [ 140.619767] smpboot: CPU 1 is now offline > > > > For echo 0 output is: > > bash: echo: write error: Input/output error > > In kmsg: > > [ 146.935909] x86: Booting SMP configuration: > > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > > [ 146.941157] smpboot: CPU 1 is now offline > > > > The same goes for cpu2 and cpu3. > > Sorry for the noise, wrongly assigned outputs, should be: > For echo 0: > [ 140.619767] smpboot: CPU 1 is now offline > > For echo 1 output is: > bash: echo: write error: Input/output error > In kmsg: > [ 146.935909] x86: Booting SMP configuration: > [ 146.935952] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 146.941157] smpboot: CPU 1 is now offline So this issue can also be reproduced during normal cpu offline/online, right? Please help check with: 1. cd /sys/kernel/debug/tracing/ 2. echo 0 > tracing_on 3. echo function_graph > current_tracer 4. echo _cpu_up > set_graph_function 5. echo > trace 6. echo 1 > tracing_on 7. "# echo 0 > /sys/devices/system/cpu/cpu1/online" "# echo1 > /sys/devices/system/cpu/cpu1/online" 8. cat trace > ~/trace_cpu_online_error.log 9. attach your trace_cpu_online_error.log Created attachment 255095 [details]
trace output of affected system VPCZ13C5E
Yes, either if I invoke suspend or take down cores manually, only reboot brings them back online.
BTW, do you still get the issue if you revert "cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321 ? Created attachment 255221 [details]
patch used to revert cpufreq.c changes beetwen 4.8 and 4.9
Thanks for pointing that out, I should have done that at the beginning. Seems it fixes the non-booting cores after suspend. Single commit revert did not work so instead I reverted all changes in drivers/cpufreq/cpufreq.c between version 4.8 and 4.9.
Thanks Tomasz, According to the logs in Comment 7, it looks like one of CPU1's cpuhp_callback returns a negative result ,which cause the system to cancel bringing up the CPU1. And as you mentioned in Comment 9, it is possible that the cpufreq_online has failed and returned a negative value. But according to the commit, I can not figure out why it failed(the commit just put the cpufreq_online under cpuhp's scope without changing its semantics). Please help test with the following steps again, and let's see if there is any clue there: 1. cd /sys/kernel/debug/tracing/ 2. echo 1 > events/cpuhp/enable 3. echo cpufreq_online > set_graph_function 4. echo function_graph > current_tracer 5. echo > trace 6. echo 0 > /sys/devices/system/cpu/cpu1/online 7. echo 1 > tracing_on 8. echo 1 > /sys/devices/system/cpu/cpu1/online 9. cat trace > ~/trace_cpufreq_online.log 10. attach your trace_cpufreq_online.log (In reply to Chen Yu from comment #10) > Thanks Tomasz, > According to the logs in Comment 7, it looks like one of CPU1's > cpuhp_callback returns a negative result ,which cause the system to cancel > bringing up the CPU1. > And as you mentioned in Comment 9, it is possible that the cpufreq_online > has failed and returned a negative value. But according to the commit, I can > not figure out why it failed(the commit just put the cpufreq_online under > cpuhp's scope without changing its semantics). > Please help test with the following steps again, and let's see if there is > any clue there: > 1. cd /sys/kernel/debug/tracing/ > 2. echo 1 > events/cpuhp/enable > 3. echo cpufreq_online > set_graph_function > 4. echo function_graph > current_tracer > 5. echo > trace > 6. echo 0 > /sys/devices/system/cpu/cpu1/online > 7. echo 1 > tracing_on > 8. echo 1 > /sys/devices/system/cpu/cpu1/online > 9. cat trace > ~/trace_cpufreq_online.log > 10. attach your trace_cpufreq_online.log Amend: 1. cd /sys/kernel/debug/tracing/ 2. echo 1 > events/cpuhp/enable 3. echo cpufreq_online > set_graph_function 4. echo function_graph > current_tracer 5. echo > set_ftrace_filter 6. echo 0 > tracing_on 7. echo > trace 8. echo 0 > /sys/devices/system/cpu/cpu1/online 9. echo 1 > tracing_on 10. echo 1 > /sys/devices/system/cpu/cpu1/online 11. cat trace > ~/trace_cpufreq_online.log 12. attach your trace_cpufreq_online.log Created attachment 255423 [details]
trace output of affected kernel version 4.9
Thanks for the reply. Also in additional attachments I added tracing output of unaffected kernel version and affected kernel version (4.9) with changes in cpufreq.c reverted to 4.8 state. I hope the added trace output will give a clue on what is happening. Btw, bug is still present in 4.11-rc3.
Created attachment 255425 [details]
trace output of unaffected kernel version 4.8
Created attachment 255429 [details]
trace output of kernel version 4.9 with reverted cpufreq.c
According to the log, the problem was caused by CPU1 does not have _PCT during online, should be here: acpi_cpufreq_cpu_init->acpi_processor_register_performance ->acpi_processor_get_performance_info: if (!acpi_has_method(pr->handle, "_PCT")) Could you upload your acpidump raw data? # acpidump > acpi.out Created attachment 255441 [details]
trial patch to fix
Please also check if this patch works for you?
And please boot up the patched kernel with the following bootcommad int the grub menu:
dyndbg="file drivers/cpufreq/cpufreq.c +p"
And after boot up , please provide the dmesg.
Then tries normal cpuoffline/online and provide the dmesg again.
thx
Created attachment 255447 [details]
acpi dump of affected system VPCZ13C5E
Yes, with the patch kernel brings up the offline cores. Additional informations in requested dmesg outputs.
Created attachment 255449 [details]
dmesg before bringing cpu1 down
Created attachment 255451 [details]
dmesg after bringing cpu1 up
OK, so there is no _PCT under your processer, patch has been sent to: https://patchwork.kernel.org/patch/9644385/ thx. Thank You for resolving this issue. |
Created attachment 254733 [details] kernel log of affected system After waking up from suspend only cpu core is active. In dmesg attached smpboot fails to boot sleeping cores with: "Error taking CPU1 up: -5" Bisecting kernel wasn't very informative as during this process there was a lot of stages where suspend didn't work at all or kernel failed to boot. The result was commit "cpufreq: Convert to hotplug state machine" 27622b061eb4bb4d16b5d61219ac10a792010321. Last known good stable kernel version: 4.8.x Bug still present in 4.10-rc8 The affected system is Sony VAIO VPCZ13C5E (for details see attached dmi.txt)