Bug 90891 - After resume, scheduler's not dispatching over distant NUMA nodes
Summary: After resume, scheduler's not dispatching over distant NUMA nodes
Status: NEEDINFO
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-07 10:43 UTC by Michal Privoznik
Modified: 2016-11-16 14:13 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.19.0-rc3+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
turbostat -v prior to suspend (14 bytes, text/plain)
2015-01-26 06:04 UTC, Michal Privoznik
Details
turbostat -v after the suspend (7.38 KB, text/plain)
2015-01-26 06:04 UTC, Michal Privoznik
Details
turbostat -v prior to suspend (16.09 KB, text/plain)
2015-01-26 06:07 UTC, Michal Privoznik
Details
turbostat output with manual online->offline->online for cpus (10.06 KB, text/plain)
2015-03-24 06:30 UTC, Martin Kletzander
Details
Partial git bisect log (1.08 KB, text/plain)
2015-04-07 08:53 UTC, Martin Kletzander
Details
Workaround patch fixing the problem (536 bytes, patch)
2015-07-08 08:38 UTC, Martin Kletzander
Details | Diff

Description Michal Privoznik 2015-01-07 10:43:46 UTC
So I have 4 (fake) NUMA nodes on my system. Each of them has exactly one CPU. After fresh boot, 'make -j4' runs over all CPUs, over all NUMA nodes. However, after doing mem suspend & resume, the 'make -j4' does no longer run over all NUMA nodes, but only on one (and one CPU subsequently in my case).

Steps to reproduce:
1) make sure that processes are scheduled on each NUMA node, e.g. make -j4 runs on 4 CPUs concurrently
2) suspend & resume
3) Observe, that condition from step 1) does no longer hold

I've noticed this with 3.18 kernel and 3.19 is no better yet.
Comment 1 Len Brown 2015-01-13 01:25:43 UTC
to show that a program can run on all 4 cpus,
and also to show the cpu usage on all 4 cpus...

please show the output from turbostat -v during the make
for both the pre-suspend and post-suspend experiments.
Comment 2 Zhang Rui 2015-01-26 03:05:08 UTC
ping Michal.
Comment 3 Michal Privoznik 2015-01-26 06:04:09 UTC
Created attachment 164741 [details]
turbostat -v prior to suspend
Comment 4 Michal Privoznik 2015-01-26 06:04:44 UTC
Created attachment 164751 [details]
turbostat -v after the suspend
Comment 5 Michal Privoznik 2015-01-26 06:07:57 UTC
Created attachment 164761 [details]
turbostat -v prior to suspend
Comment 6 Rafael J. Wysocki 2015-01-26 22:22:36 UTC
What is the last known good kernel?
Comment 7 Len Brown 2015-01-27 01:25:53 UTC
Three things look broken here.

1. cpu0: MSR_IA32_POWER_CTL: 0x0004005f (C1E auto-promotion: ENabled)
This went from DISabled before suspend, to ENabled after suspend.
It should not have changed.  The effect is that you will no longer
be able to access C1, as it will become synonymous with C1E.
This isn't  the issue that the bug report is filed against,
but it is clearly a bug.

2. turbostat lost its mind:

Can you tell me what happened when this was spit out?

    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -    1739   99.97    1740    1447       0    0.03    0.00    0.00    0.00      97      97    0.00    0.00    0.00    0.00   33.99   25.83    0.60
       0       1    3479   99.97    3480    2893       0    0.03    0.00    0.00    0.00      97      97    0.00    0.00    0.00    0.00   33.99   25.83    0.60
       1       3    3479   99.97    3480    2893       0    0.03    0.00    0.00    0.00      91
/home/zippy/tmp/kernel.git/tools/power/x86/turbostat/turbostat: APERF or MPERF went backwards *
* Frequency results do not cover entire interval *
* fix this by running Linux-2.6.30 or later *
    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -  478278********262867114678241955591117       0********  100.01  100.01  100.00      62      62  400.01    0.00    0.00    0.0011710.25 3348.8912839.39
       0       1 1157470********1042217535596967847664753   -3023********  100.00  100.00  100.00      62      62  100.00    0.00    0.00    0.0011710.25 3348.8912839.39
       1       3  755641********1066033123229967847664754   -3023********  100.00  100.00  100.00      61


3. After resume, you see only minimal usage on cpu3 while cpu1 is fully busy,
yet you expect both CPUs to be fully busy.

    Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
       -       -     908   50.84    1787    1447       0    7.69    0.58    0.20   40.69      78      78    0.00    0.00    0.00    0.00   21.75   13.54    0.15
       0       1    3556   99.48    3575    2893       0    0.52    0.00    0.00    0.00      78      78    0.00    0.00    0.00    0.00   21.75   13.54    0.15
       1       3      77    2.20    3492    2893       0   14.85    1.16    0.40   81.39      65

Note, however, that since turbostat gave you output for cpu3, it is possible for a thread to run on cpu3...

What if you add an additional cycle soaker, does it soak up cpu3?

# cat /dev/null > /dev/zero

What if you run that in background and bind it to cpu3 using taskset(1)?
can you make it busy that way?
Comment 8 Michal Privoznik 2015-01-27 09:50:42 UTC
(In reply to Rafael J. Wysocki from comment #6)
> What is the last known good kernel?

I think 3.17. Although back in those days I had other issues (e.g. with multihead intel video card), so I can't tell for sure.
Comment 9 Michal Privoznik 2015-01-27 09:58:29 UTC
(In reply to Len Brown from comment #7)
> Three things look broken here.
> 

> 
> 2. turbostat lost its mind:
> 
> Can you tell me what happened when this was spit out?
> 
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3 
> CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt
> CorWatt GFXWatt
>        -       -    1739   99.97    1740    1447       0    0.03    0.00   
> 0.00    0.00      97      97    0.00    0.00    0.00    0.00   33.99   25.83
> 0.60
>        0       1    3479   99.97    3480    2893       0    0.03    0.00   
> 0.00    0.00      97      97    0.00    0.00    0.00    0.00   33.99   25.83
> 0.60
>        1       3    3479   99.97    3480    2893       0    0.03    0.00   
> 0.00    0.00      91
> /home/zippy/tmp/kernel.git/tools/power/x86/turbostat/turbostat: APERF or
> MPERF went backwards *
> * Frequency results do not cover entire interval *
> * fix this by running Linux-2.6.30 or later *
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3 
> CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt
> CorWatt GFXWatt
>        -       -  478278********262867114678241955591117       0******** 
> 100.01  100.01  100.00      62      62  400.01    0.00    0.00   
> 0.0011710.25 3348.8912839.39
>        0       1 1157470********1042217535596967847664753   -3023******** 
> 100.00  100.00  100.00      62      62  100.00    0.00    0.00   
> 0.0011710.25 3348.8912839.39
>        1       3  755641********1066033123229967847664754   -3023******** 
> 100.00  100.00  100.00      61


I've just simply suspended the machine while turbostat was still running. And then resumed (with again turbostat still running obviously).

> 
> 
> 3. After resume, you see only minimal usage on cpu3 while cpu1 is fully busy,
> yet you expect both CPUs to be fully busy.
> 
>     Core     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz     SMI  CPU%c1  CPU%c3 
> CPU%c6  CPU%c7 CoreTmp  PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt
> CorWatt GFXWatt
>        -       -     908   50.84    1787    1447       0    7.69    0.58   
> 0.20   40.69      78      78    0.00    0.00    0.00    0.00   21.75   13.54
> 0.15
>        0       1    3556   99.48    3575    2893       0    0.52    0.00   
> 0.00    0.00      78      78    0.00    0.00    0.00    0.00   21.75   13.54
> 0.15
>        1       3      77    2.20    3492    2893       0   14.85    1.16   
> 0.40   81.39      65
> 
> Note, however, that since turbostat gave you output for cpu3, it is possible
> for a thread to run on cpu3...
> 
> What if you add an additional cycle soaker, does it soak up cpu3?
> 
> # cat /dev/null > /dev/zero
> 
> What if you run that in background and bind it to cpu3 using taskset(1)?
> can you make it busy that way?

Unfortunately no.
Comment 10 Michal Privoznik 2015-01-28 14:37:10 UTC
One thing I've noticed too. I don't even need to go through suspend & resume cycle. It's sufficient to offline all the CPUs but CPU0 and bring them online again.
Comment 11 Michal Privoznik 2015-01-29 09:37:25 UTC
I've found a gentoo bug describing the same symptoms:

https://bugs.gentoo.org/show_bug.cgi?id=537834
Comment 12 Martin Kletzander 2015-02-16 13:57:48 UTC
Having the same problems I poked around a bit and found out, that this does not happen in a virtual machine in which DMA32 zone exist on all NUMA nodes.  I'm  talking about the scheduling, I don't know about the C1E auto-promotion for now.  So the problem might be that scheduler dispatches only to nodes with DMA32 zone.  Hope that helps a bit, let me know if I can help anyhow.
Comment 13 Len Brown 2015-03-24 00:56:29 UTC
Like csets, fake numa nodes set up in user-space simply
don't survive the processor offline/online process,
because there is no state associated with offline processors.

suspending the system uses processor offline/online.

so...
if you want this user config to survive, you need to
save/restore it before/after suspend/resume.

can you verify that manually invoking processor offline/online
also causes this problem?

eg.
# echo 0 > /sys/devices/system/cpu/cpuN/online
# echo 1 > /sys/devices/system/cpu/cpuN/online


why is this marked as a regression -- did this work before?
Comment 14 Martin Kletzander 2015-03-24 06:30:19 UTC
Created attachment 171911 [details]
turbostat output with manual online->offline->online for cpus
Comment 15 Martin Kletzander 2015-03-24 06:35:13 UTC
(In reply to Len Brown from comment #13)
> Like csets, fake numa nodes set up in user-space simply
> don't survive the processor offline/online process,
> because there is no state associated with offline processors.
> 

By user-space you mean non-hardware, but in software (kernel), right?

> suspending the system uses processor offline/online.
> 
> so...
> if you want this user config to survive, you need to
> save/restore it before/after suspend/resume.
> 

I'm not sure what user config are you talking about, the only thing I'm setting up is a kernel command-line.

> can you verify that manually invoking processor offline/online
> also causes this problem?
> 
> eg.
> # echo 0 > /sys/devices/system/cpu/cpuN/online
> # echo 1 > /sys/devices/system/cpu/cpuN/online
> 

Yes, as Michal wrote is comment #10, this causes the same issue. I reproduced this with manually switching the CPUs off and on while running turbostat, look at the attachment added in comment #14.

> 
> why is this marked as a regression -- did this work before?

Yes, this worked with 3.17.* for me.
Comment 16 Martin Kletzander 2015-03-24 07:36:52 UTC
The data in attachment in comment #14 were made with 3.19.2, but now I am testing it with 4.0.0-rc4, particularly tag next-20150323 and the bug is still there.  The only difference is that the load is not kept on NUMA node 0 (cpus 0 and 4), but node 3 (cpus 3 and 7), which is even more weird.
Comment 17 Rafael J. Wysocki 2015-04-02 00:32:20 UTC
OK, so this is not a power management bug, it is a CPU online/offline problem.

You should be able to use bisection to indentify the exact kernel commit that broke things for you.  Failing that, I'm afriad it will be difficult to isolate it.
Comment 18 Martin Kletzander 2015-04-07 08:53:05 UTC
Created attachment 173281 [details]
Partial git bisect log

Well, not knowing what changes to isolate this starts with "roughly 13 steps", that is each one of them is rebuild+reboot (plus another reboot if that commit is bad).  This will probably be a long run, so I just started this in case someone wants to continue, here are my last good/bad commits:

bad:  dfe2c6dcc8ca2cdc662d7c0473e9811b72ef3370
good: 5e40d331bd72447197f26525f21711c4a265b6a6

output of 'git bisect log' in attachment.
Comment 19 Martin Kletzander 2015-04-07 14:07:53 UTC
After 8 hours spent with bisecting this, the first bad commit is ... wait for it ... yes, the merge commit 9d9420f1209a1facea7110d549ac695f5aeeb503.
Comment 20 Michal Privoznik 2015-04-21 14:00:49 UTC
@Len, can you please provide an update?
Comment 21 Martin Kletzander 2015-06-26 05:58:32 UTC
Disregard my comment #19, I re-tested (rebased branches on top of each other, solved the configs and started bisecting again) and found out that I made a mistake in the process.  I'm on a good way now and I'll have the results soon.  Based on the result I'll try to elaborate on what could the fix be.
Comment 22 Martin Kletzander 2015-06-26 08:51:38 UTC
So the scheduling was broken by the following commit:

commit cebf15eb09a2fd2fa73ee4faa9c4d2f813cf0f09
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Thu Sep 18 12:33:34 2014 -0700

    x86, sched: Add new topology for multi-NUMA-node CPUs


However, I couldn't find anything misleading there apart from typo in one line that's fixed right in the following commit:

commit 728e5653e6fdb2a0892e94a600aef8c9a036c7eb
Author: Dave Hansen <dave.hansen@linux.intel.com>
Date:   Tue Sep 30 14:45:46 2014 -0700

    sched/x86: Fix up typo in topology detection


which, unfortunately, doesn't fix the problem.
Comment 23 Martin Kletzander 2015-07-08 08:38:43 UTC
Created attachment 182161 [details]
Workaround patch fixing the problem

OK, so I removed the lines that decide how to schedule the current task.  It fixes the problem, although I understand that is not the right thing to do and the problem is somewhere else.  Let me know if I can help in any way or even guide me where to continue finding the problem as I'm not a kernel developer.  But using fake NUMA nodes is something vital for me for testing and developing other software.
Comment 24 Martin Kletzander 2016-11-16 14:13:57 UTC
Looks like this might be fixed with https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=8f37961cf22304fb286c7604d3a7f6104dcc1283 I have not tried it yet.

Note You need to log in before you can comment on or make changes to this bug.