So I have 4 (fake) NUMA nodes on my system. Each of them has exactly one CPU. After fresh boot, 'make -j4' runs over all CPUs, over all NUMA nodes. However, after doing mem suspend & resume, the 'make -j4' does no longer run over all NUMA nodes, but only on one (and one CPU subsequently in my case). Steps to reproduce: 1) make sure that processes are scheduled on each NUMA node, e.g. make -j4 runs on 4 CPUs concurrently 2) suspend & resume 3) Observe, that condition from step 1) does no longer hold I've noticed this with 3.18 kernel and 3.19 is no better yet.
to show that a program can run on all 4 cpus, and also to show the cpu usage on all 4 cpus... please show the output from turbostat -v during the make for both the pre-suspend and post-suspend experiments.
ping Michal.
Created attachment 164741 [details] turbostat -v prior to suspend
Created attachment 164751 [details] turbostat -v after the suspend
Created attachment 164761 [details] turbostat -v prior to suspend
What is the last known good kernel?
Three things look broken here. 1. cpu0: MSR_IA32_POWER_CTL: 0x0004005f (C1E auto-promotion: ENabled) This went from DISabled before suspend, to ENabled after suspend. It should not have changed. The effect is that you will no longer be able to access C1, as it will become synonymous with C1E. This isn't the issue that the bug report is filed against, but it is clearly a bug. 2. turbostat lost its mind: Can you tell me what happened when this was spit out? Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 1739 99.97 1740 1447 0 0.03 0.00 0.00 0.00 97 97 0.00 0.00 0.00 0.00 33.99 25.83 0.60 0 1 3479 99.97 3480 2893 0 0.03 0.00 0.00 0.00 97 97 0.00 0.00 0.00 0.00 33.99 25.83 0.60 1 3 3479 99.97 3480 2893 0 0.03 0.00 0.00 0.00 91 /home/zippy/tmp/kernel.git/tools/power/x86/turbostat/turbostat: APERF or MPERF went backwards * * Frequency results do not cover entire interval * * fix this by running Linux-2.6.30 or later * Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 478278********262867114678241955591117 0******** 100.01 100.01 100.00 62 62 400.01 0.00 0.00 0.0011710.25 3348.8912839.39 0 1 1157470********1042217535596967847664753 -3023******** 100.00 100.00 100.00 62 62 100.00 0.00 0.00 0.0011710.25 3348.8912839.39 1 3 755641********1066033123229967847664754 -3023******** 100.00 100.00 100.00 61 3. After resume, you see only minimal usage on cpu3 while cpu1 is fully busy, yet you expect both CPUs to be fully busy. Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt - - 908 50.84 1787 1447 0 7.69 0.58 0.20 40.69 78 78 0.00 0.00 0.00 0.00 21.75 13.54 0.15 0 1 3556 99.48 3575 2893 0 0.52 0.00 0.00 0.00 78 78 0.00 0.00 0.00 0.00 21.75 13.54 0.15 1 3 77 2.20 3492 2893 0 14.85 1.16 0.40 81.39 65 Note, however, that since turbostat gave you output for cpu3, it is possible for a thread to run on cpu3... What if you add an additional cycle soaker, does it soak up cpu3? # cat /dev/null > /dev/zero What if you run that in background and bind it to cpu3 using taskset(1)? can you make it busy that way?
(In reply to Rafael J. Wysocki from comment #6) > What is the last known good kernel? I think 3.17. Although back in those days I had other issues (e.g. with multihead intel video card), so I can't tell for sure.
(In reply to Len Brown from comment #7) > Three things look broken here. > > > 2. turbostat lost its mind: > > Can you tell me what happened when this was spit out? > > Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 > CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt > CorWatt GFXWatt > - - 1739 99.97 1740 1447 0 0.03 0.00 > 0.00 0.00 97 97 0.00 0.00 0.00 0.00 33.99 25.83 > 0.60 > 0 1 3479 99.97 3480 2893 0 0.03 0.00 > 0.00 0.00 97 97 0.00 0.00 0.00 0.00 33.99 25.83 > 0.60 > 1 3 3479 99.97 3480 2893 0 0.03 0.00 > 0.00 0.00 91 > /home/zippy/tmp/kernel.git/tools/power/x86/turbostat/turbostat: APERF or > MPERF went backwards * > * Frequency results do not cover entire interval * > * fix this by running Linux-2.6.30 or later * > Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 > CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt > CorWatt GFXWatt > - - 478278********262867114678241955591117 0******** > 100.01 100.01 100.00 62 62 400.01 0.00 0.00 > 0.0011710.25 3348.8912839.39 > 0 1 1157470********1042217535596967847664753 -3023******** > 100.00 100.00 100.00 62 62 100.00 0.00 0.00 > 0.0011710.25 3348.8912839.39 > 1 3 755641********1066033123229967847664754 -3023******** > 100.00 100.00 100.00 61 I've just simply suspended the machine while turbostat was still running. And then resumed (with again turbostat still running obviously). > > > 3. After resume, you see only minimal usage on cpu3 while cpu1 is fully busy, > yet you expect both CPUs to be fully busy. > > Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 > CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt > CorWatt GFXWatt > - - 908 50.84 1787 1447 0 7.69 0.58 > 0.20 40.69 78 78 0.00 0.00 0.00 0.00 21.75 13.54 > 0.15 > 0 1 3556 99.48 3575 2893 0 0.52 0.00 > 0.00 0.00 78 78 0.00 0.00 0.00 0.00 21.75 13.54 > 0.15 > 1 3 77 2.20 3492 2893 0 14.85 1.16 > 0.40 81.39 65 > > Note, however, that since turbostat gave you output for cpu3, it is possible > for a thread to run on cpu3... > > What if you add an additional cycle soaker, does it soak up cpu3? > > # cat /dev/null > /dev/zero > > What if you run that in background and bind it to cpu3 using taskset(1)? > can you make it busy that way? Unfortunately no.
One thing I've noticed too. I don't even need to go through suspend & resume cycle. It's sufficient to offline all the CPUs but CPU0 and bring them online again.
I've found a gentoo bug describing the same symptoms: https://bugs.gentoo.org/show_bug.cgi?id=537834
Having the same problems I poked around a bit and found out, that this does not happen in a virtual machine in which DMA32 zone exist on all NUMA nodes. I'm talking about the scheduling, I don't know about the C1E auto-promotion for now. So the problem might be that scheduler dispatches only to nodes with DMA32 zone. Hope that helps a bit, let me know if I can help anyhow.
Like csets, fake numa nodes set up in user-space simply don't survive the processor offline/online process, because there is no state associated with offline processors. suspending the system uses processor offline/online. so... if you want this user config to survive, you need to save/restore it before/after suspend/resume. can you verify that manually invoking processor offline/online also causes this problem? eg. # echo 0 > /sys/devices/system/cpu/cpuN/online # echo 1 > /sys/devices/system/cpu/cpuN/online why is this marked as a regression -- did this work before?
Created attachment 171911 [details] turbostat output with manual online->offline->online for cpus
(In reply to Len Brown from comment #13) > Like csets, fake numa nodes set up in user-space simply > don't survive the processor offline/online process, > because there is no state associated with offline processors. > By user-space you mean non-hardware, but in software (kernel), right? > suspending the system uses processor offline/online. > > so... > if you want this user config to survive, you need to > save/restore it before/after suspend/resume. > I'm not sure what user config are you talking about, the only thing I'm setting up is a kernel command-line. > can you verify that manually invoking processor offline/online > also causes this problem? > > eg. > # echo 0 > /sys/devices/system/cpu/cpuN/online > # echo 1 > /sys/devices/system/cpu/cpuN/online > Yes, as Michal wrote is comment #10, this causes the same issue. I reproduced this with manually switching the CPUs off and on while running turbostat, look at the attachment added in comment #14. > > why is this marked as a regression -- did this work before? Yes, this worked with 3.17.* for me.
The data in attachment in comment #14 were made with 3.19.2, but now I am testing it with 4.0.0-rc4, particularly tag next-20150323 and the bug is still there. The only difference is that the load is not kept on NUMA node 0 (cpus 0 and 4), but node 3 (cpus 3 and 7), which is even more weird.
OK, so this is not a power management bug, it is a CPU online/offline problem. You should be able to use bisection to indentify the exact kernel commit that broke things for you. Failing that, I'm afriad it will be difficult to isolate it.
Created attachment 173281 [details] Partial git bisect log Well, not knowing what changes to isolate this starts with "roughly 13 steps", that is each one of them is rebuild+reboot (plus another reboot if that commit is bad). This will probably be a long run, so I just started this in case someone wants to continue, here are my last good/bad commits: bad: dfe2c6dcc8ca2cdc662d7c0473e9811b72ef3370 good: 5e40d331bd72447197f26525f21711c4a265b6a6 output of 'git bisect log' in attachment.
After 8 hours spent with bisecting this, the first bad commit is ... wait for it ... yes, the merge commit 9d9420f1209a1facea7110d549ac695f5aeeb503.
@Len, can you please provide an update?
Disregard my comment #19, I re-tested (rebased branches on top of each other, solved the configs and started bisecting again) and found out that I made a mistake in the process. I'm on a good way now and I'll have the results soon. Based on the result I'll try to elaborate on what could the fix be.
So the scheduling was broken by the following commit: commit cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09 Author: Dave Hansen <dave.hansen@linux.intel.com> Date: Thu Sep 18 12:33:34 2014 -0700 x86, sched: Add new topology for multi-NUMA-node CPUs However, I couldn't find anything misleading there apart from typo in one line that's fixed right in the following commit: commit 728e5653e6fdb2a0892e94a600aef8c9a036c7eb Author: Dave Hansen <dave.hansen@linux.intel.com> Date: Tue Sep 30 14:45:46 2014 -0700 sched/x86: Fix up typo in topology detection which, unfortunately, doesn't fix the problem.
Created attachment 182161 [details] Workaround patch fixing the problem OK, so I removed the lines that decide how to schedule the current task. It fixes the problem, although I understand that is not the right thing to do and the problem is somewhere else. Let me know if I can help in any way or even guide me where to continue finding the problem as I'm not a kernel developer. But using fake NUMA nodes is something vital for me for testing and developing other software.
Looks like this might be fixed with https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=8f37961cf22304fb286c7604d3a7f6104dcc1283 I have not tried it yet.