Bug 119581
Summary: | IPv6 Netlink problem, memory leak and "Cannot allocate memory" errors | ||
---|---|---|---|
Product: | Networking | Reporter: | Petr Novopashenniy (pety) |
Component: | IPV6 | Assignee: | Hideaki YOSHIFUJI (yoshfuji) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | daniel, hannes, kafai, pety |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.4.11 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Fix mem leak in rt6i_pcpu |
Description
Petr Novopashenniy
2016-06-03 11:23:49 UTC
Are you saying that you could deplete your whole memory in the box? Did it recover if you killed zebra or you flushed the routing tables? Is it a permanent error or a temporary error, I was not completely sure from your description. Yes, I can deplete whole memory in the box (zebra proccess in `D` state and not accessible from the telnet interface, 300-400 MB swap used). Memory is not released if I kill zebra/bgpd or if I flush ipv6 routing table. Only zebra/bgpd memory release after kill. This is a permanent error. Thanks, Petr, Trying to reproduce. Would it be possible for you to test a 4.6.1 kernel or even 4.7-rc1 one if you don't mind? In the mean time I look how I can reproduce that in the meantime. Maybe can you also paste a cat /proc/slabinfo after booting with kernel command line slab_nomerge? That would help a lot, thanks! Petr, can you also do 'cat /proc/net/rt6_stats >> rt6_stats.log' in your script? The file has the number of routing entries in it. I am interested to see if the numbers in the rt6_stats file increase gradually. Thanks! A quick update. I cannot reproduce with a simple script keeps adding and deleting thousands of routes in a loop after running for a few hours. I will try harder. Having a zebra daemon running with 31'000 static routes while switching the nexthop for those routes between two interfaces doesn't show any kind of memory depletion for me. I will leave the script running over the night. Anything special in quagga what could trigger this? Does this router at the same time handle forwarding traffic? Okay, I could reproduce it: The trace looks somehow random, it happend during pcpu_balance. [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 (In reply to hannes from comment #8) > Okay, I could reproduce it: > > The trace looks somehow random, it happend during pcpu_balance. > > [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 > [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 > [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 > [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 > [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 > [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 > [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 > [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 > [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 > [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 > [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 > [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 > [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 > [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 > [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 Maybe this is related to the recent use-after-free in pcpu subsystem? First two, they are not yet upstream: https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/log/?h=for-4.7-fixes (In reply to Daniel Borkmann from comment #9) > (In reply to hannes from comment #8) > > Okay, I could reproduce it: > > > > The trace looks somehow random, it happend during pcpu_balance. > > > > [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 > > [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 > > [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 > > [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 > > [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 > > [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 > > [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 > > [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 > > [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 > > [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > > [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 > > [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > > [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 > > [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > > [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > > [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 > > [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 > > [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > > [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 > > [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > > Maybe this is related to the recent use-after-free in pcpu subsystem? > > First two, they are not yet upstream: > > https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/log/?h=for-4.7- > fixes This was, by the way, triggered through KASAN from eBPF side when large pcpu array maps were allocated and freed again. In those maps each slot has a pcpu entry, so one pcpu alloc per entry, thus constantly stressing the pcpu allocator. Maybe here, situation might be similar with regards to pcpu allocations? I don't think so. It looks more like a memory leak and not like a use-after-free. (In reply to Martin KaFai Lau from comment #4) > Petr, can you also do 'cat /proc/net/rt6_stats >> rt6_stats.log' in your > script? The file has the number of routing entries in it. I am interested > to see if the numbers in the rt6_stats file increase gradually. > > Thanks! Yes, after ~400 seconds of run (one cat for loop): http://ftp.neva.ru/rt6_stats.log It's enough? (In reply to hannes from comment #7) > Does this router at the same time handle forwarding traffic? RouterA (in my example) really forward traffic, my test box - not forward. Hi Petr, thanks for the log. Seems like we cannot pin down the problem with that only. Can you also provide a full splash (output of dmesg on OOM). Thanks, Hannes (In reply to hannes from comment #14) > Hi Petr, > > thanks for the log. Seems like we cannot pin down the problem with that > only. Can you also provide a full splash (output of dmesg on OOM). > > Thanks, > Hannes Ok, now I test 4.7.0-rc2 without swap space and limit RAM to 1024M (via mem=1024M kernel cmd). And wait OOM. (In reply to hannes from comment #14) > Hi Petr, > > thanks for the log. Seems like we cannot pin down the problem with that > only. Can you also provide a full splash (output of dmesg on OOM). > > Thanks, > Hannes Before OOM the system is very slow: # time uptime 15:47:43 up 16 min, 2 users, load average: 4.57, 2.13, 0.83 real 0m14.206s user 0m0.006s sys 0m0.241s # And this is dmesg after OOM: http://ftp.neva.ru/Tests/dmesg_aftrer_OOM /proc/slabinfo after boot with slab_nomerge and after BGP session established: http://ftp.neva.ru/Tests/slab_after_boot1.txt /proc/slabinfo after OOM: http://ftp.neva.ru/Tests/slab1.txt PS: Also, I have IPv4 full view on this box, but test only IPv6. At first glance, this bug added in 4.2.0-rc8. 4.2.0-rc7 run without memory leak and "Cannot allocate memory" errors. Hi Petr, thank you, this is very viable information. Did you have a chance to check a very current kernel, too? I couldn't reproduce it with the recent -rc kernel yesterday, the test is still running. Do we see the same results? Thanks, Hannes Hi Hannes! My last test was on latest rc kernel (4.7.0-rc2, with kernel config http://ftp.neva.ru/Tests/config-4.7.0-rc2): This is dmesg output http://ftp.neva.ru/Tests/dmesg_aftrer_OOM Petr In 4.7-rc4 this bug is still exist. (In reply to Martin KaFai Lau from comment #5) > A quick update. I cannot reproduce with a simple script keeps adding and > deleting thousands of routes in a loop after running for a few hours. I > will try harder. You can simple add and delete one static route in a loop, like this (..4:1 is my default gw): while true ; do ip -6 route add 2001:b09::/32 via 2001:b08:2:280::4:1 dev eth0 ip -6 route del 2001:b09::/32 via 2001:b08:2:280::4:1 dev eth0 done and wait 1-2 or more days (if you use C program with same functional, you can save your time). With X86_64 kernel you lose all memory (and sometimes see "RTNETLINK answers: Cannot allocate memory" before you run out of memory). With X86 32 bit kernel you lose a little memory, but you can see ENOMEM and ENOBUFS errors: Before test: # free total used free shared buff/cache available Mem: 2048696 51432 1751952 5496 245312 1835492 Swap: 4120636 0 4120636 After test: # free total used free shared buff/cache available Mem: 2048696 125056 1674736 5496 248904 1761208 Swap: 4120636 0 4120636 # ip -6 route add 2000:b08:3::/48 via 2001:B08:2:100::1 dev eth0 RTNETLINK answers: Cannot allocate memory # ip route add 10.0.0.0/24 via 195.208.113.126 dev eth0 RTNETLINK answers: No buffer space available # --pety Hi Petr, ip -6 r add and del was the first test I did before going into zebra/quagga. I did add a few thousands and then del a few thousands in a loop with >1 threads but cannot reprod after a few days. I can retry with one add and one del exactly as you do again. btw, do you have a chance to capture the slabinfo in your latest test? Thanks, -- Martin (In reply to Martin KaFai Lau from comment #22) > Hi Petr, > > ip -6 r add and del was the first test I did before going into zebra/quagga. > I did add a few thousands and then del a few thousands in a loop with >1 > threads but cannot reprod after a few days. I can retry with one add and > one del exactly as you do again. > > btw, do you have a chance to capture the slabinfo in your latest test? > > Thanks, > -- Martin Hi, Martin! Yes, this is 4.4.13 32 bit kernel, slabinfo after test: http://ftp.neva.ru/Tests/slab_bereza1.txt Can you give to me your kernel version and kernel config? (I reproduce this bug on 4.4.* and 4.7rc* (but not test 4.5.* and 4.6.*) with simple add and remove static route, without use quagga/zebra). If you're interested, I can give ssh access to my x86_64 test box. --pety Hi Pter, your slabinfo indicates you are using ~100MB of slab, that shouldn't cause a OOM in any case. How many CPU's does your system have? Is this a non-SMP kernel build? How many memory does the system have? Thanks, Hannes And also please a dump of /proc/vmallocinfo after the oom situation happend. Thanks! (In reply to hannes from comment #24) > Hi Pter, > > your slabinfo indicates you are using ~100MB of slab, that shouldn't cause a > OOM in any case. > > How many CPU's does your system have? Is this a non-SMP kernel build? How > many memory does the system have? > > Thanks, > Hannes Hi! Yes, on 32 bit system I have no OOM, but ENOMEM and ENOBUFS errors with netlink: # ip -6 route add 2000:b08:3::/48 via 2001:B08:2:100::1 dev eth0 RTNETLINK answers: Cannot allocate memory # ip route add 10.0.0.0/24 via 195.208.113.126 dev eth0 RTNETLINK answers: No buffer space available # I have OOM only on X86_64 (64 bit) systems. > > How many CPU's does your system have? Is this a non-SMP kernel build? How > many memory does the system have? 1. x86 (32 bit) system: SMP kernel, "Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz" (2 cores), 2GB RAM 2. x86_64 system (64 bit): SMP kernel, "Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz" (4 cores) , 4G RAM. 3. x86_64 system (64 bit): SMP kernel, "Intel(R) Xeon(R) CPU E5607 @ 2.27GHz" (8 cores), 48G RAM. (In reply to hannes from comment #25) > And also please a dump of /proc/vmallocinfo after the oom situation happend. > Thanks! http://ftp.neva.ru/Tests/vmallocinfo1.txt --pety (In reply to Petr Novopashenniy from comment #27) > (In reply to hannes from comment #25) > > And also please a dump of /proc/vmallocinfo after the oom situation > happend. > > Thanks! > > http://ftp.neva.ru/Tests/vmallocinfo1.txt > Clarification: In this case, for more quick testing, I set memory size to only 1G. --pety (In reply to hannes from comment #25) > And also please a dump of /proc/vmallocinfo after the oom situation happend. > Thanks! This is vmallocinfo diff example when test is running (20-30 seconds between different vmallocinfo): 53d52 < 0xffffc90000b21000-0xffffc90000b62000 266240 pcpu_mem_zalloc+0x2b/0x80 pages=64 vmalloc 58a58,60 > 0xffffc90001278000-0xffffc900012f9000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc > 0xffffc9000135b000-0xffffc900013dc000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc > 0xffffc90001458000-0xffffc900014d9000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 62a65 > 0xffffc9000176e000-0xffffc900017ef000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 65a69 > 0xffffc90001b44000-0xffffc90001bc5000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 68a73 > 0xffffc90001f44000-0xffffc90001fc5000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 71a77 > 0xffffc90002644000-0xffffc90002685000 266240 pcpu_mem_zalloc+0x2b/0x80 > pages=64 vmalloc 72a79 > 0xffffc90002741000-0xffffc900027c2000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 109a117,123 > 0xffffe8fff9600000-0xffffe8fff9800000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9800000-0xffffe8fff9a00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9a00000-0xffffe8fff9c00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9c00000-0xffffe8fff9e00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9e00000-0xffffe8fffa000000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fffa000000-0xffffe8fffa200000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fffa200000-0xffffe8fffa400000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc Hmm could be a fragmentation issue. Thanks, I try to reproduce also with limiting the memory. Created attachment 222071 [details]
Fix mem leak in rt6i_pcpu
Hi Petr,
Thanks for the info. It is very helpful.
I can reproduce what you have seen in /proc/vmallocinfo regarding to the pcpu entries.
I have attached a patch which I see good result. After this patch, I don't see increasing consumption from pcpu in /proc/vmallocinfo. It is against the current net-next branch (4.7.0-rc5) and it is also applicable to the 4.4.y branch.
Can you give this patch a try?
Thanks,
-- Martin
(In reply to Martin KaFai Lau from comment #31) > Created attachment 222071 [details] > Fix mem leak in rt6i_pcpu > > Hi Petr, > > Thanks for the info. It is very helpful. > > I can reproduce what you have seen in /proc/vmallocinfo regarding to the > pcpu entries. > > I have attached a patch which I see good result. After this patch, I don't > see increasing consumption from pcpu in /proc/vmallocinfo. It is against > the current net-next branch (4.7.0-rc5) and it is also applicable to the > 4.4.y branch. > > Can you give this patch a try? > > Thanks, > -- Martin Good day! I test 4.4.14 (32bit and 64bit version) and 4.7-rc6 with this patch, looks like all works fine, thank you very much, Martin! --pety |