Bug 196569

Summary: Bad rss-counter state in pgtable-generic on large-memory, multi-socket machines
Product: Memory Management Reporter: Charles Allen (charles.allen)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal CC: charles.allen
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.9.24 Subsystem:
Regression: No Bisected commit-id:

Description Charles Allen 2017-08-02 15:00:27 UTC
Aug 02 13:37:55 ip-172-19-15-116 kernel: ../source/mm/pgtable-generic.c:33: bad pmd ffff8bb32d6d6200(0000000fcda001e0)
Aug 02 13:43:10 ip-172-19-15-116 kernel: BUG: Bad rss-counter state mm:ffff8bbe9ba3dc00 idx:1 val:512
Aug 02 13:43:10 ip-172-19-15-116 kernel: BUG: non-zero nr_ptes on freeing mm: 1


The above bug shows up regularly on some 4 numa zone virtualized machines (x1.32xlarge in AWS) running CoreOS.

These machines have 4 sockets with approximately 2TiB RAM.

The workload is a combination of Spark and Druid which are in memory and cpu cgroup isolation.

We haven't been able to get a specific test which can reproduce this in an artificial environment, but we see it with alarming regularity under production workloads.

We've tried numerous sysctl tunings around numa configs but cannot seem to avoid this bug.

When using half-sized machines (x1.16xlarge) with the same work loads, we have not seen this show up.
Comment 1 Charles Allen 2017-08-02 15:15:42 UTC
$ cat /proc/buddyinfo
Node 0, zone      DMA      1      1      1      0      2      1      1      0      1      1      3
Node 0, zone    DMA32     71     56     53     55     54     24     10      6      6      3    472
Node 0, zone   Normal 5721269 4547407 783814  47549   3829    818    200     95     10      0      0
Node 1, zone   Normal 2793102 4831801 2898624 486040  78417  25929  14492   7374   2941      0      0
Node 2, zone   Normal 3004007 4822788 1463551 143127  27984  12282   4765   1915     48      0      0
Node 3, zone   Normal 6013297 3858190 2328605 228874  22890   7524   3784   2369   1164      1      0


$ cat /proc/meminfo
MemTotal:       2014741740 kB
MemFree:        392410636 kB
MemAvailable:   1334517536 kB
Buffers:           33108 kB
Cached:         941241888 kB
SwapCached:            0 kB
Active:         1436430984 kB
Inactive:       164888628 kB
Active(anon):   660049028 kB
Inactive(anon):     1700 kB
Active(file):   776381956 kB
Inactive(file): 164886928 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:           2773500 kB
Writeback:             0 kB
AnonPages:      660029896 kB
Mapped:         478195376 kB
Shmem:              1880 kB
Slab:           13393720 kB
SReclaimable:   11277508 kB
SUnreclaim:      2116212 kB
KernelStack:      298288 kB
PageTables:      3131392 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    1007370868 kB
Committed_AS:   754887872 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:  182294528 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      280576 kB
DirectMap2M:    12433408 kB
DirectMap1G:    2036334592 kB
Comment 2 Andrew Morton 2017-08-02 22:54:59 UTC
That's a pretty old kernel.  Have you googled for `linux "Bad rss-counter state"' and checked that your kernel has the various fixes which are mentioned there?
Comment 3 Charles Allen 2017-08-04 22:18:33 UTC
We tested 4.11.6-coreos-r1. I thought we had encountered this error on that version as well, but I don't seem to have an explicit log for it.

We have not tested any other patches to prevent this error.