This is my production network: IPv6-primary-uplink<-->RouterA<-->RouterB<-->IPv6-backup-uplink I run quagga 0.99.24.1, bgpd, ospf6d and zebra on router A. All uplinks send me IPv6 full view (~30000 routes). Routes from primary uplink has 100 BGP localpref, from backup - 90. RouterA (amd64) run 4.4.11 kernel. For example, when primary uplink flaps, I see mulitple (~600) messages from zebra: netlink-cmd error: Cannot allocate memory, type=RTM_NEWROUTE(24), seq=5603215, pid=0 And after this RouterA has incomplete/incorrect ipv6 kernel routing table. I tried to reproduce the bug on test linux box. I setup quagga (bgpd and zebra), and send full IPv6 view to this box (via IPv4 multiprotocol BGP session). This is ip -6 addr output for two interfaces (eth1 has real working global ipv6 ip, and eth5 has "the first available", not globaly connected ipv6 address (just ip from different subnet and lower link is up))": 2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000 inet6 2001:b08:2:280::4:2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::25b:70ff:fe00:752a/64 scope link valid_lft forever preferred_lft forever 3: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000 inet6 2001:b08:8:280::4:2/120 scope global valid_lft forever preferred_lft forever inet6 fe80::25b:70ff:fe00:752b/64 scope link valid_lft forever preferred_lft forever Routing table: 2001:b08:2:280::4:0/120 dev eth1 proto kernel metric 256 pref medium 2001:b08:8:280::4:0/120 dev eth5 proto kernel metric 256 pref medium fe80::/64 dev eth1 proto kernel metric 256 pref medium fe80::/64 dev eth5 proto kernel metric 256 pref medium ff00::/8 dev eth1 metric 256 pref medium ff00::/8 dev eth5 metric 256 pref medium Quagga BGP scan interval is 5 seconds. And I run this simple script: # while true ; do ip -6 route add 2000::/3 via 2001:b08:2:280::4:1 dev eth1 sleep 3 ip -6 route del 2000::/3 via 2001:b08:2:280::4:1 dev eth1 ip -6 route add 2000::/3 via 2001:b08:8:280::4:4 dev eth5 sleep 3 ip -6 route del 2000::/3 via 2001:b08:8:280::4:4 dev eth5 done # And quagga simple switch nexthop for all ~30000 IPv6 routes from 2001:b08:2:280::4:1 to 2001:b08:8:280::4:4 and back. In zebra.log, I sometimes see "netlink-cmd error: Cannot allocate memory, type=RTM_NEWROUTE" errors (~20-30 records for one time), and after ~10 hours 4G RAM ended. If I run, for example, on my RouterA or on my test linux box 4.1.13 kernel, there no such problem. No memory leak, no "Cannot allocate memory" messages.
Are you saying that you could deplete your whole memory in the box? Did it recover if you killed zebra or you flushed the routing tables? Is it a permanent error or a temporary error, I was not completely sure from your description.
Yes, I can deplete whole memory in the box (zebra proccess in `D` state and not accessible from the telnet interface, 300-400 MB swap used). Memory is not released if I kill zebra/bgpd or if I flush ipv6 routing table. Only zebra/bgpd memory release after kill. This is a permanent error.
Thanks, Petr, Trying to reproduce. Would it be possible for you to test a 4.6.1 kernel or even 4.7-rc1 one if you don't mind? In the mean time I look how I can reproduce that in the meantime. Maybe can you also paste a cat /proc/slabinfo after booting with kernel command line slab_nomerge? That would help a lot, thanks!
Petr, can you also do 'cat /proc/net/rt6_stats >> rt6_stats.log' in your script? The file has the number of routing entries in it. I am interested to see if the numbers in the rt6_stats file increase gradually. Thanks!
A quick update. I cannot reproduce with a simple script keeps adding and deleting thousands of routes in a loop after running for a few hours. I will try harder.
Having a zebra daemon running with 31'000 static routes while switching the nexthop for those routes between two interfaces doesn't show any kind of memory depletion for me. I will leave the script running over the night. Anything special in quagga what could trigger this?
Does this router at the same time handle forwarding traffic?
Okay, I could reproduce it: The trace looks somehow random, it happend during pcpu_balance. [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250
(In reply to hannes from comment #8) > Okay, I could reproduce it: > > The trace looks somehow random, it happend during pcpu_balance. > > [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 > [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 > [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 > [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 > [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 > [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 > [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 > [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 > [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 > [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 > [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 > [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 > [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 > [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 > [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 Maybe this is related to the recent use-after-free in pcpu subsystem? First two, they are not yet upstream: https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/log/?h=for-4.7-fixes
(In reply to Daniel Borkmann from comment #9) > (In reply to hannes from comment #8) > > Okay, I could reproduce it: > > > > The trace looks somehow random, it happend during pcpu_balance. > > > > [13371.558844] [<ffffffff8144ba13>] dump_stack+0x85/0xc2 > > [13371.558848] [<ffffffff81289ca4>] dump_header+0x5b/0x227 > > [13371.558852] [<ffffffff8110d74d>] ? trace_hardirqs_on+0xd/0x10 > > [13371.558854] [<ffffffff811f16a0>] oom_kill_process+0x200/0x3e0 > > [13371.558855] [<ffffffff811f1aea>] out_of_memory+0x21a/0x5c0 > > [13371.558856] [<ffffffff811f1ba8>] ? out_of_memory+0x2d8/0x5c0 > > [13371.558858] [<ffffffff811f8e9b>] __alloc_pages_nodemask+0xd5b/0xec0 > > [13371.558860] [<ffffffff81219498>] pcpu_populate_chunk+0xc8/0x3b0 > > [13371.558861] [<ffffffff8121a389>] pcpu_balance_workfn+0x559/0x5a0 > > [13371.558864] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > > [13371.558865] [<ffffffff810d021c>] process_one_work+0x22c/0x720 > > [13371.558866] [<ffffffff810d0192>] ? process_one_work+0x1a2/0x720 > > [13371.558868] [<ffffffff810d075e>] worker_thread+0x4e/0x490 > > [13371.558869] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > > [13371.558870] [<ffffffff810d0710>] ? process_one_work+0x720/0x720 > > [13371.558872] [<ffffffff810d7a61>] kthread+0x101/0x120 > > [13371.558873] [<ffffffff8110d6b9>] ? trace_hardirqs_on_caller+0x129/0x1b0 > > [13371.558874] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > > [13371.558878] [<ffffffff818b61df>] ret_from_fork+0x3f/0x70 > > [13371.558879] [<ffffffff810d7960>] ? kthread_create_on_node+0x250/0x250 > > Maybe this is related to the recent use-after-free in pcpu subsystem? > > First two, they are not yet upstream: > > https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/log/?h=for-4.7- > fixes This was, by the way, triggered through KASAN from eBPF side when large pcpu array maps were allocated and freed again. In those maps each slot has a pcpu entry, so one pcpu alloc per entry, thus constantly stressing the pcpu allocator. Maybe here, situation might be similar with regards to pcpu allocations?
I don't think so. It looks more like a memory leak and not like a use-after-free.
(In reply to Martin KaFai Lau from comment #4) > Petr, can you also do 'cat /proc/net/rt6_stats >> rt6_stats.log' in your > script? The file has the number of routing entries in it. I am interested > to see if the numbers in the rt6_stats file increase gradually. > > Thanks! Yes, after ~400 seconds of run (one cat for loop): http://ftp.neva.ru/rt6_stats.log It's enough?
(In reply to hannes from comment #7) > Does this router at the same time handle forwarding traffic? RouterA (in my example) really forward traffic, my test box - not forward.
Hi Petr, thanks for the log. Seems like we cannot pin down the problem with that only. Can you also provide a full splash (output of dmesg on OOM). Thanks, Hannes
(In reply to hannes from comment #14) > Hi Petr, > > thanks for the log. Seems like we cannot pin down the problem with that > only. Can you also provide a full splash (output of dmesg on OOM). > > Thanks, > Hannes Ok, now I test 4.7.0-rc2 without swap space and limit RAM to 1024M (via mem=1024M kernel cmd). And wait OOM.
(In reply to hannes from comment #14) > Hi Petr, > > thanks for the log. Seems like we cannot pin down the problem with that > only. Can you also provide a full splash (output of dmesg on OOM). > > Thanks, > Hannes Before OOM the system is very slow: # time uptime 15:47:43 up 16 min, 2 users, load average: 4.57, 2.13, 0.83 real 0m14.206s user 0m0.006s sys 0m0.241s # And this is dmesg after OOM: http://ftp.neva.ru/Tests/dmesg_aftrer_OOM /proc/slabinfo after boot with slab_nomerge and after BGP session established: http://ftp.neva.ru/Tests/slab_after_boot1.txt /proc/slabinfo after OOM: http://ftp.neva.ru/Tests/slab1.txt PS: Also, I have IPv4 full view on this box, but test only IPv6.
At first glance, this bug added in 4.2.0-rc8. 4.2.0-rc7 run without memory leak and "Cannot allocate memory" errors.
Hi Petr, thank you, this is very viable information. Did you have a chance to check a very current kernel, too? I couldn't reproduce it with the recent -rc kernel yesterday, the test is still running. Do we see the same results? Thanks, Hannes
Hi Hannes! My last test was on latest rc kernel (4.7.0-rc2, with kernel config http://ftp.neva.ru/Tests/config-4.7.0-rc2): This is dmesg output http://ftp.neva.ru/Tests/dmesg_aftrer_OOM Petr
In 4.7-rc4 this bug is still exist.
(In reply to Martin KaFai Lau from comment #5) > A quick update. I cannot reproduce with a simple script keeps adding and > deleting thousands of routes in a loop after running for a few hours. I > will try harder. You can simple add and delete one static route in a loop, like this (..4:1 is my default gw): while true ; do ip -6 route add 2001:b09::/32 via 2001:b08:2:280::4:1 dev eth0 ip -6 route del 2001:b09::/32 via 2001:b08:2:280::4:1 dev eth0 done and wait 1-2 or more days (if you use C program with same functional, you can save your time). With X86_64 kernel you lose all memory (and sometimes see "RTNETLINK answers: Cannot allocate memory" before you run out of memory). With X86 32 bit kernel you lose a little memory, but you can see ENOMEM and ENOBUFS errors: Before test: # free total used free shared buff/cache available Mem: 2048696 51432 1751952 5496 245312 1835492 Swap: 4120636 0 4120636 After test: # free total used free shared buff/cache available Mem: 2048696 125056 1674736 5496 248904 1761208 Swap: 4120636 0 4120636 # ip -6 route add 2000:b08:3::/48 via 2001:B08:2:100::1 dev eth0 RTNETLINK answers: Cannot allocate memory # ip route add 10.0.0.0/24 via 195.208.113.126 dev eth0 RTNETLINK answers: No buffer space available # --pety
Hi Petr, ip -6 r add and del was the first test I did before going into zebra/quagga. I did add a few thousands and then del a few thousands in a loop with >1 threads but cannot reprod after a few days. I can retry with one add and one del exactly as you do again. btw, do you have a chance to capture the slabinfo in your latest test? Thanks, -- Martin
(In reply to Martin KaFai Lau from comment #22) > Hi Petr, > > ip -6 r add and del was the first test I did before going into zebra/quagga. > I did add a few thousands and then del a few thousands in a loop with >1 > threads but cannot reprod after a few days. I can retry with one add and > one del exactly as you do again. > > btw, do you have a chance to capture the slabinfo in your latest test? > > Thanks, > -- Martin Hi, Martin! Yes, this is 4.4.13 32 bit kernel, slabinfo after test: http://ftp.neva.ru/Tests/slab_bereza1.txt Can you give to me your kernel version and kernel config? (I reproduce this bug on 4.4.* and 4.7rc* (but not test 4.5.* and 4.6.*) with simple add and remove static route, without use quagga/zebra). If you're interested, I can give ssh access to my x86_64 test box. --pety
Hi Pter, your slabinfo indicates you are using ~100MB of slab, that shouldn't cause a OOM in any case. How many CPU's does your system have? Is this a non-SMP kernel build? How many memory does the system have? Thanks, Hannes
And also please a dump of /proc/vmallocinfo after the oom situation happend. Thanks!
(In reply to hannes from comment #24) > Hi Pter, > > your slabinfo indicates you are using ~100MB of slab, that shouldn't cause a > OOM in any case. > > How many CPU's does your system have? Is this a non-SMP kernel build? How > many memory does the system have? > > Thanks, > Hannes Hi! Yes, on 32 bit system I have no OOM, but ENOMEM and ENOBUFS errors with netlink: # ip -6 route add 2000:b08:3::/48 via 2001:B08:2:100::1 dev eth0 RTNETLINK answers: Cannot allocate memory # ip route add 10.0.0.0/24 via 195.208.113.126 dev eth0 RTNETLINK answers: No buffer space available # I have OOM only on X86_64 (64 bit) systems. > > How many CPU's does your system have? Is this a non-SMP kernel build? How > many memory does the system have? 1. x86 (32 bit) system: SMP kernel, "Intel(R) Core(TM)2 CPU 4300 @ 1.80GHz" (2 cores), 2GB RAM 2. x86_64 system (64 bit): SMP kernel, "Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz" (4 cores) , 4G RAM. 3. x86_64 system (64 bit): SMP kernel, "Intel(R) Xeon(R) CPU E5607 @ 2.27GHz" (8 cores), 48G RAM.
(In reply to hannes from comment #25) > And also please a dump of /proc/vmallocinfo after the oom situation happend. > Thanks! http://ftp.neva.ru/Tests/vmallocinfo1.txt --pety
(In reply to Petr Novopashenniy from comment #27) > (In reply to hannes from comment #25) > > And also please a dump of /proc/vmallocinfo after the oom situation > happend. > > Thanks! > > http://ftp.neva.ru/Tests/vmallocinfo1.txt > Clarification: In this case, for more quick testing, I set memory size to only 1G. --pety
(In reply to hannes from comment #25) > And also please a dump of /proc/vmallocinfo after the oom situation happend. > Thanks! This is vmallocinfo diff example when test is running (20-30 seconds between different vmallocinfo): 53d52 < 0xffffc90000b21000-0xffffc90000b62000 266240 pcpu_mem_zalloc+0x2b/0x80 pages=64 vmalloc 58a58,60 > 0xffffc90001278000-0xffffc900012f9000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc > 0xffffc9000135b000-0xffffc900013dc000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc > 0xffffc90001458000-0xffffc900014d9000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 62a65 > 0xffffc9000176e000-0xffffc900017ef000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 65a69 > 0xffffc90001b44000-0xffffc90001bc5000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 68a73 > 0xffffc90001f44000-0xffffc90001fc5000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 71a77 > 0xffffc90002644000-0xffffc90002685000 266240 pcpu_mem_zalloc+0x2b/0x80 > pages=64 vmalloc 72a79 > 0xffffc90002741000-0xffffc900027c2000 528384 pcpu_mem_zalloc+0x2b/0x80 > pages=128 vmalloc 109a117,123 > 0xffffe8fff9600000-0xffffe8fff9800000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9800000-0xffffe8fff9a00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9a00000-0xffffe8fff9c00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9c00000-0xffffe8fff9e00000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fff9e00000-0xffffe8fffa000000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fffa000000-0xffffe8fffa200000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc > 0xffffe8fffa200000-0xffffe8fffa400000 2097152 pcpu_get_vm_areas+0x0/0x540 > vmalloc
Hmm could be a fragmentation issue. Thanks, I try to reproduce also with limiting the memory.
Created attachment 222071 [details] Fix mem leak in rt6i_pcpu Hi Petr, Thanks for the info. It is very helpful. I can reproduce what you have seen in /proc/vmallocinfo regarding to the pcpu entries. I have attached a patch which I see good result. After this patch, I don't see increasing consumption from pcpu in /proc/vmallocinfo. It is against the current net-next branch (4.7.0-rc5) and it is also applicable to the 4.4.y branch. Can you give this patch a try? Thanks, -- Martin
(In reply to Martin KaFai Lau from comment #31) > Created attachment 222071 [details] > Fix mem leak in rt6i_pcpu > > Hi Petr, > > Thanks for the info. It is very helpful. > > I can reproduce what you have seen in /proc/vmallocinfo regarding to the > pcpu entries. > > I have attached a patch which I see good result. After this patch, I don't > see increasing consumption from pcpu in /proc/vmallocinfo. It is against > the current net-next branch (4.7.0-rc5) and it is also applicable to the > 4.4.y branch. > > Can you give this patch a try? > > Thanks, > -- Martin Good day! I test 4.4.14 (32bit and 64bit version) and 4.7-rc6 with this patch, looks like all works fine, thank you very much, Martin! --pety