Bug 217777
Summary: | kernel: WARNING: CPU: 0 PID: 16269 at net/netfilter/nf_conntrack_core.c:1210 | ||
---|---|---|---|
Product: | Networking | Reporter: | LimeTech (tomm) |
Component: | Netfilter/Iptables | Assignee: | networking_netfilter-iptables (networking_netfilter-iptables) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | asmadeus, elessard, fw, marc, sebastian.held |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
Non-functional test case
Add bridge netfilter specific workaround for broadcast race bug |
Description
LimeTech
2023-08-08 17:44:11 UTC
Created attachment 304816 [details]
Non-functional test case
Conntrack has design assumptions that get broken by bridge netfilter.
However, I fail to reproduce this problem.
Attaching a non-functional reproducer. Can you please confirm if the setup is correct (macvlan on top of bridge)? I tried persistent flooding, but so far to no avail (the clones done by br_flood are processed in serial fashion so not surprising that this doesn't trigger any problem).
I can see problematic skb_clone() in bridge (br_flood and friends), but not in macvlan driver. By the time packet makes it to macvlan the conntrack entry is already confirmed, so skb_clone() is safe.
Is there anything else about the setup that makes a difference?
(In reply to Florian Westphal from comment #1) > Created attachment 304816 [details] > Non-functional test case > > Conntrack has design assumptions that get broken by bridge netfilter. > However, I fail to reproduce this problem. > > Attaching a non-functional reproducer. Can you please confirm if the setup > is correct (macvlan on top of bridge)? I tried persistent flooding, but so > far to no avail (the clones done by br_flood are processed in serial fashion > so not surprising that this doesn't trigger any problem). > > I can see problematic skb_clone() in bridge (br_flood and friends), but not > in macvlan driver. By the time packet makes it to macvlan the conntrack > entry is already confirmed, so skb_clone() is safe. > > Is there anything else about the setup that makes a difference? The only reliable way we can reproduce this issue is to start docker, setting up a macvlan network where the parent interface is a bridge on the host. You do not even need to start a container. see: https://docs.docker.com/network/network-tutorial-macvlan/ If the parent interface is not a bridge, the problem doesn't seem to happen. If you physically disconnect the ethernet cable, the problem doesn't seem to happen either. As a side note: Maybe nf_nat_setup_info errors have the same source as they can be solved by the same workarounds: https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/?tab=comments Also assuming this is related to that unraid thread (I came here looking for nf_nat_setup_info but also see these warnings); I've hit a kernel panic as attached below on a similar setup as the initial post (eth0 > br0 > a couple of interfaces with macvlan, other classic bridge slaves all on docker) Looking at the exact RIP after loading up debug symbols we're apparently in there: nf_nat_setup_info > get_unique_tuple > find_appropriate_src > same_src (currently on 6.1.69 kernel so it's a bit old) I ran out of time for a few days but if that gives you an idea please feel free to send me a patch to try -- Unfortunately it's a machine I can't reboot too often, but I hit warnings similar to the above every few days and it's the second panic I get in a couple of months so at least the warning will be relatively easy to confirm. Now I've confirmed it's a panic (didn't have traces last time it crashed) I'll setup kexec-kdump for next time this happens, but if it's a race in rcu it probably won't help much -- I'll take the time to decipher what exact field was accessed at the time of the general protection fault first next week. [1554289.142290] general protection fault, probably for non-canonical address 0xc79af08c657a8def: 0000 [#2] PREEMPT SMP NOPTI [1554289.142447] CPU: 21 PID: 841620 Comm: npm install Tainted: G D W 6.1.0-17-amd64 #1 Debian 6.1.69-1 [1554289.142585] Hardware name: ASUS System Product Name/PRIME B650-PLUS, BIOS 0823 11/21/2022 [1554289.142719] RIP: 0010:nf_nat_setup_info+0xa4d/0xd40 [nf_nat] [1554289.142865] Code: 89 ce 41 38 47 46 74 28 4d 8b bf 90 00 00 00 4d 85 ff 0f 84 41 fe ff ff 49 81 ef 90 00 00 00 0f 84 34 fe ff ff 0f b6 44 24 56 <41> 38 47 46 75 d8 49 8b 47 20 49 8b 57 28 48 33 44 24 30 48 33 54 [1554289.143155] RSP: 0018:ffffa085c061c550 EFLAGS: 00010286 [1554289.143302] RAX: 0000000000000006 RBX: ffff92b1b33c6500 RCX: ffffffffabc8ca40 [1554289.143448] RDX: ffff92ae51600000 RSI: 07e07d0650662282 RDI: 4e5012ca31d609fc [1554289.143598] RBP: ffffa085c061c628 R08: ffffa085c061c528 R09: 0000000000000000 [1554289.143744] R10: bbabb09d7b6a920d R11: d84ca9f0a96a01c4 R12: 0000000000000000 [1554289.143894] R13: ffffa085c061c63c R14: ffffffffabc8ca40 R15: c79af08c657a8da9 [1554289.144044] FS: 00007f58c3720800(0000) GS:ffff92bd58740000(0000) knlGS:0000000000000000 [1554289.144200] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1554289.144356] CR2: 0000000007351000 CR3: 00000007115ac000 CR4: 0000000000750ee0 [1554289.144513] PKRU: 55555554 [1554289.144670] Call Trace: [1554289.144817] <IRQ> [1554289.144966] ? __die_body.cold+0x1a/0x1f [1554289.145124] ? die_addr+0x38/0x60 [1554289.145273] ? exc_general_protection+0x221/0x4a0 [1554289.145421] ? asm_exc_general_protection+0x22/0x30 [1554289.145569] ? nf_nat_setup_info+0xa4d/0xd40 [nf_nat] [1554289.145717] ? krealloc+0x85/0xc0 [1554289.145862] nf_nat_masquerade_ipv4+0x142/0x190 [nf_nat] [1554289.146011] masquerade_tg+0x7f/0xa0 [xt_MASQUERADE] [1554289.146158] nft_target_eval_xt+0x67/0xa0 [nft_compat] [1554289.146306] nft_do_chain+0x198/0x5d0 [nf_tables] [1554289.146468] nft_nat_do_chain+0xa4/0x109 [nft_chain_nat] [1554289.146623] nf_nat_inet_fn+0x161/0x320 [nf_nat] [1554289.146773] nf_nat_ipv4_out+0x4b/0x100 [nf_nat] [1554289.146920] nf_hook_slow+0x41/0xc0 [1554289.147067] ip_output+0xe5/0x130 [1554289.147215] ? __ip_finish_output+0x190/0x190 [1554289.147361] ip_sabotage_in+0x64/0x70 [br_netfilter] [1554289.147508] nf_hook_slow+0x41/0xc0 [1554289.147651] ip_rcv+0x89/0x1d0 [1554289.147792] ? ip_sublist_rcv+0x240/0x240 [1554289.147927] __netif_receive_skb_one_core+0x89/0xa0 [1554289.148066] netif_receive_skb+0x115/0x160 [1554289.148201] br_handle_frame_finish+0x334/0x5b0 [bridge] [1554289.148348] ? br_handle_local_finish+0x20/0x20 [bridge] [1554289.148497] br_nf_hook_thresh+0xec/0x100 [br_netfilter] [1554289.148638] ? br_handle_local_finish+0x20/0x20 [bridge] [1554289.148787] br_nf_pre_routing_finish+0x15e/0x440 [br_netfilter] [1554289.148926] ? br_handle_local_finish+0x20/0x20 [bridge] [1554289.149076] ? nf_nat_ipv4_pre_routing+0x45/0xb0 [nf_nat] [1554289.149215] br_nf_pre_routing+0x47f/0x580 [br_netfilter] [1554289.149349] ? br_nf_hook_thresh+0x100/0x100 [br_netfilter] [1554289.149480] br_handle_frame+0x233/0x3b0 [bridge] [1554289.149614] ? br_handle_local_finish+0x20/0x20 [bridge] [1554289.149748] ? br_handle_frame_finish+0x5b0/0x5b0 [bridge] [1554289.149876] __netif_receive_skb_core.constprop.0+0x263/0xef0 [1554289.149986] ? update_load_avg+0x7e/0x780 [1554289.150098] ? sched_clock_cpu+0x9/0xb0 [1554289.150203] __netif_receive_skb_one_core+0x3c/0xa0 [1554289.150307] process_backlog+0x81/0x120 [1554289.150408] __napi_poll+0x2b/0x160 [1554289.150510] net_rx_action+0x29e/0x350 [1554289.150611] __do_softirq+0xc6/0x2ab [1554289.150712] do_softirq.part.0+0x5f/0x80 [1554289.150810] </IRQ> [1554289.150903] <TASK> [1554289.150998] __local_bh_enable_ip+0x64/0x70 [1554289.151091] __dev_queue_xmit+0x29f/0xd70 [1554289.151184] ? __local_bh_enable_ip+0x64/0x70 [1554289.151278] ? __dev_queue_xmit+0x29f/0xd70 [1554289.151372] ? apparmor_nf_unregister+0x20/0x20 [1554289.151466] ? nf_hook_slow+0x41/0xc0 [1554289.151560] ip_finish_output2+0x297/0x560 [1554289.151653] __ip_queue_xmit+0x171/0x450 [1554289.151748] __tcp_transmit_skb+0xaa4/0xc00 [1554289.151842] tcp_connect+0xb2a/0xe40 [1554289.151935] tcp_v4_connect+0x412/0x520 [1554289.152026] __inet_stream_connect+0xda/0x3e0 [1554289.152119] ? tomoyo_socket_connect_permission+0xaf/0xf0 [1554289.152215] inet_stream_connect+0x36/0x50 [1554289.152307] __sys_connect+0xa4/0xd0 [1554289.152399] __x64_sys_connect+0x14/0x20 [1554289.152488] do_syscall_64+0x5b/0xc0 [1554289.152580] ? exit_to_user_mode_prepare+0x40/0x1e0 [1554289.152672] entry_SYSCALL_64_after_hwframe+0x64/0xce [1554289.152763] RIP: 0033:0x7f58c390b25b [1554289.152855] Code: 83 ec 18 89 54 24 0c 48 89 34 24 89 7c 24 08 e8 bb fa ff ff 8b 54 24 0c 48 8b 34 24 41 89 c0 8b 7c 24 08 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 08 e8 f1 fa ff ff 8b 44 [1554289.153065] RSP: 002b:00007ffeaa34cad0 EFLAGS: 00000293 ORIG_RAX: 000000000000002a [1554289.153174] RAX: ffffffffffffffda RBX: 000000000734f130 RCX: 00007f58c390b25b [1554289.153284] RDX: 0000000000000010 RSI: 00007ffeaa34cb90 RDI: 0000000000000092 [1554289.153395] RBP: 00007ffeaa34cb60 R08: 0000000000000000 R09: 0000000006fffe80 [1554289.153505] R10: 00007ffeaa34caac R11: 0000000000000293 R12: 0000000006fffed8 [1554289.153616] R13: 00007ffeaa34cb90 R14: 0000000000000010 R15: 00007f58c3720690 [1554289.153725] </TASK> [1554289.153836] Modules linked in: tls qrtr netconsole bluetooth jitterentropy_rng drbg ansi_cprng ecdh_generic ecc joydev veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay binfmt_misc macvlan bridge stp llc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm amdgpu irqbypass ghash_clmulni_intel snd_hda_codec_realtek sha512_ssse3 sha512_generic snd_hda_codec_generic sha256_ssse3 sha1_ssse3 snd_hda_codec_hdmi snd_hda_intel gpu_sched snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi drm_display_helper snd_hda_codec cec snd_hda_core aesni_intel rc_core drm_ttm_helper snd_hwdep ttm eeepc_wmi crypto_simd snd_pcm asus_wmi cryptd snd_timer drm_kms_helper battery platform_profile sparse_keymap snd ledtrig_audio rfkill i2c_algo_bit rapl sp5100_tco pcspkr soundcore wmi_bmof ccp k10temp watchdog sg [1554289.153864] acpi_cpufreq button evdev drm msr fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic hid_generic usbhid hid sd_mod dm_mod ahci libahci xhci_pci libata xhci_hcd nvme nvme_core usbcore scsi_mod r8169 t10_pi realtek mdio_devres libphy crc32_pclmul crc64_rocksoft crc32c_intel crc64 crc_t10dif i2c_piix4 crct10dif_generic usb_common scsi_common crct10dif_pclmul crct10dif_common video wmi gpio_amdpt gpio_generic [last unloaded: netconsole] [1554289.155637] ---[ end trace 0000000000000000 ]--- [1554289.547937] RIP: 0010:refill_obj_stock+0x5b/0x180 [1554289.548123] Code: 57 49 8b 47 10 48 39 f8 0f 84 9c 00 00 00 4c 89 ff e8 59 f3 ff ff 49 89 c6 e8 f1 d2 de ff 48 8b 45 00 a8 03 0f 85 c9 00 00 00 <65> 48 ff 00 e8 0c 0c df ff 49 89 6f 10 8b 45 18 85 c0 74 07 31 c0 [1554289.548474] RSP: 0018:ffffa085c0a139d0 EFLAGS: 00010046 [1554289.548652] RAX: ffff92b6c701b290 RBX: 0000000000000070 RCX: 0000000000000020 [1554289.548830] RDX: ffff92ae43c54000 RSI: 0000000000000002 RDI: ffff92ae43c54000 [1554289.549012] RBP: ffff92ae5179cd00 R08: ffff92af4cb61e90 R09: 0000000000000238 [1554289.549193] R10: 0000000000000002 R11: ffffa085c0a139d0 R12: 0000000000000212 [1554289.549374] R13: ffffffffa8fa4df1 R14: ffff92ae4d97cd00 R15: ffff92bd582adb40 [1554289.549553] FS: 00007f58c3720800(0000) GS:ffff92bd58740000(0000) knlGS:0000000000000000 [1554289.549734] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1554289.549915] CR2: 0000000007351000 CR3: 00000007115ac000 CR4: 0000000000750ee0 [1554289.550098] PKRU: 55555554 [1554289.550285] Kernel panic - not syncing: Fatal exception in interrupt [1554289.550972] Kernel Offset: 0x27c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [1554289.927344] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Thanks! So I was able to reproduce in a VM; I didn't try to simplify my setup much I just reproduced what I had on the physical machine: - gave it 20GB of ram and 20 cores - single virtio network interface - br0 contains that interface - a couple of macvlan interfaces on top of br0 ** one directly: # The primary network interface auto br0 iface br0 inet static bridge_ports ens4 bridge_stp no # need an address to please network scripts address 10.255.255.241/32 # workaround to access macvlan containers... post-up ip link add macvlan-lan link br0 type macvlan mode bridge post-up ip a add x.y.z.99/16 dev macvlan-lan post-up ip link set macvlan-lan up post-up ip r add default via x.y.z.1 dev macvlan-lan # remove useless ip post-up ip a del 10.255.255.241/32 dev br0 ** others through docker (probably don't need all 3, or actually even any at all -- I just happened to have these on the host): docker network create -d macvlan -o parent=br0 --subnet=x.y.z.0/16 --gateway=x.y.z.1 dockernet docker run --restart=always -d --network dockernet --name test --ip x.y.z.98 --network-alias test docker.io/alpine sleep infinity docker run --restart=always -d --network dockernet --name test2 --ip x.y.z.97 --network-alias test2 docker.io/alpine sleep infinity docker run --restart=always -d --network dockernet --name test3 --ip x.y.z.96 --network-alias test3 docker.io/alpine sleep infinity - from outside ping one of these ip so there's at least a minimum of activity (pinged one of the containers with `ping -i 0.2 -s 1500 x.y.z.98` from another host) - run docker containers in a loop; docker setup its own default docker0 bridge interface that NATs through the default exit route (in our case, through the 'macvlan-lan' interface) while true; do docker run --rm -ti docker.io/alpine ping -c 1 x.y.z.20; done (the pinged interface is something else on the network, probably doesn't matter) After a few minutes I get the warning fairly reliably. Which means I could get a core dump, this is just a guess but the culprit seems to be the interaction between these two tasks -- or at least something docker does when it starts a new container as I can't reproduce without spawning new containers: crash> bt 79591 PID: 79591 TASK: ffff917a46cb48c0 CPU: 11 COMMAND: "exe" #0 [ffffa2a0c5d3bb00] __schedule at ffffffffa342e60d #1 [ffffa2a0c5d3bb88] schedule at ffffffffa342ecfa #2 [ffffa2a0c5d3bba0] synchronize_rcu_expedited at ffffffffa2b32fae #3 [ffffa2a0c5d3bc58] ipv6_mc_down at ffffffffa33375b9 #4 [ffffa2a0c5d3bc78] addrconf_ifdown at ffffffffa330987a #5 [ffffa2a0c5d3bd18] addrconf_sysctl_disable at ffffffffa330e7e1 #6 [ffffa2a0c5d3bd98] proc_sys_call_handler at ffffffffa2e0429d #7 [ffffa2a0c5d3bdf0] vfs_write at ffffffffa2d59285 #8 [ffffa2a0c5d3be88] ksys_write at ffffffffa2d595cb #9 [ffffa2a0c5d3bec0] do_syscall_64 at ffffffffa342015b #10 [ffffa2a0c5d3bed8] __x64_sys_fcntl at ffffffffa2d717ed #11 [ffffa2a0c5d3bf18] do_syscall_64 at ffffffffa3420167 #12 [ffffa2a0c5d3bf50] entry_SYSCALL_64_after_hwframe at ffffffffa36000dc RIP: 000056378c84536e RSP: 000000c000cbfb40 RFLAGS: 00000206 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 000056378c84536e RDX: 0000000000000002 RSI: 000000c000cbfdd6 RDI: 0000000000000008 RBP: 000000c000cbfb80 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000206 R12: 000000c000cbfcc0 R13: 0000000000000000 R14: 000000c0000061a0 R15: 00007f2f190c1818 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b and crash> bt PID: 128 TASK: ffff917ad384e100 CPU: 8 COMMAND: "kworker/u40:14" #0 [ffffa2a0c0298a80] machine_kexec at ffffffffa2a706af #1 [ffffa2a0c0298ad8] __crash_kexec at ffffffffa2b6bdf7 #2 [ffffa2a0c0298b98] panic at ffffffffa33dcdba #3 [ffffa2a0c0298c18] check_panic_on_warn.cold at ffffffffa33dcfab #4 [ffffa2a0c0298c20] __warn at ffffffffa2a9acd9 #5 [ffffa2a0c0298c58] report_bug at ffffffffa33b0a32 #6 [ffffa2a0c0298c98] handle_bug at ffffffffa3420721 #7 [ffffa2a0c0298ca8] exc_invalid_op at ffffffffa3420913 #8 [ffffa2a0c0298cc0] asm_exc_invalid_op at ffffffffa3600b36 [exception RIP: __nf_conntrack_confirm+1422] RIP: ffffffffc0c2232e RSP: ffffa2a0c0298d78 RFLAGS: 00010202 RAX: 0000000000000188 RBX: 0000000000002468 RCX: ffff917a40f78800 RDX: ffffffffc0c3f1a0 RSI: 0000000000000001 RDI: ffffffffc0c3f784 RBP: 00000000000381e1 R8: ffffa2a0c0298d40 R9: 0000000000000000 R10: a6667d640558100e R11: 56e02d92e5e345fa R12: 00000000000381e1 R13: 0000000000002468 R14: ffff917a4adbe300 R15: 0000000000000068 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffffa2a0c0298de0] nf_confirm at ffffffffc0c27b30 [nf_conntrack] #10 [ffffa2a0c0298e08] nf_hook_slow at ffffffffa324cac1 #11 [ffffa2a0c0298e38] ip_local_deliver at ffffffffa32575de #12 [ffffa2a0c0298e90] __netif_receive_skb_one_core at ffffffffa31c53b9 #13 [ffffa2a0c0298eb8] process_backlog at ffffffffa31c5631 #14 [ffffa2a0c0298ee8] __napi_poll at ffffffffa31c615b #15 [ffffa2a0c0298f10] net_rx_action at ffffffffa31c668e #16 [ffffa2a0c0298f98] __softirqentry_text_start at ffffffffa38000c6 #17 [ffffa2a0c0298ff0] do_softirq at ffffffffa2aa19af --- <IRQ stack> --- #18 [ffffa2a0c04f3da0] do_softirq at ffffffffa2aa19af [exception RIP: unknown or invalid address] RIP: ffff917a42bac000 RSP: ffffffffc0a8a395 RFLAGS: ffff917a42bac838 RAX: ffffffffc0a8a06d RBX: 0000000000000000 RCX: ffff917a42bac000 RDX: 0000000000000000 RSI: ffff917a4526b800 RDI: ffffa2a0c04f3e88 RBP: 0000000000000202 R8: ffff917a4526b800 R9: ffff917a42bce980 R10: ffffffffa31bf05f R11: ffffffffa2aa1b24 R12: ffff917a544b720c R13: ffff917a4526b800 R14: ffffffffa31beda5 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: ffffa2a0c04f3e40 SS: ffffa2a0c04f3e40 bt: WARNING: possibly bogus exception frame #19 [ffffa2a0c04f3e90] process_one_work at ffffffffa2abd927 #20 [ffffa2a0c04f3ed8] worker_thread at ffffffffa2abdecd #21 [ffffa2a0c04f3f18] kthread at ffffffffa2ac54fa #22 [ffffa2a0c04f3f50] ret_from_fork at ffffffffa2a02062 (this is still on debian's 6.1.0-17-amd64, so 6.1.69) Now I have a setup I can mess with without disturbing all my colleagues I'll first confirm if that's still a problem on master, and depending on that look further into the crash (either simplify the reproducer or just straight look into it more) I'm starting to be unsure if this is related to the nf_nat_setup_info bug I got the other day... But at least the __nf_conntrack_confirm "Another skb with the same unconfirmed conntrack may with the race" warning still happens as of master (a few days old - on 021533194476 ("Kconfig: Disable -Wstringop-overflow for GCC globally")) [ 5376.556356] [ C0] ------------[ cut here ]------------ [ 5376.556959] [ C0] WARNING: CPU: 0 PID: 1160242 at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack] [ 5376.558158] [ C0] Modules linked in: veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUE RADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay macvlan bridge stp llc binfmt_misc ppdev bochs drm_vram_helper drm_kms_helper drm_ttm_helper ttm parport_pc parport evdev pcspkr joydev button sg serio_raw drm fuse dm_mod msr loop configfs efi_pstore qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod t10_pi crc64_rocksoft crc64 ata_generic crc_t10dif crct10dif_generic crct10dif_common ata_piix virtio_net libata net_failover virtio_scsi failover scsi_mod virtio_pci psmouse virtio virtio_pci_legacy_dev virtio_pci_modern_dev floppy virtio_ring i2c_piix4 scsi_common [ 5376.566250] [ C0] CPU: 0 PID: 1160242 Comm: kworker/u40:1 Kdump: loaded Not tainted 6.8.0-rc2+ #2 [ 5376.566984] [ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 5376.567829] [ C0] Workqueue: events_unbound macvlan_process_broadcast [macvlan] [ 5376.568427] [ C0] RIP: 0010:__nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack] [ 5376.569052] [ C0] Code: 8b 64 24 10 e9 6e fe ff ff 44 0f b7 62 0c 31 ed 85 c9 74 05 41 0f b7 6e 0c 66 44 39 e5 0f 85 2a ff ff ff eb b9 45 31 e4 eb ea <0f> 0b 4c 89 cf e8 8a 67 f7 db 8b 44 24 08 41 39 c7 75 18 be 00 02 [ 5376.570681] [ C0] RSP: 0018:ffffb62fc0003d70 EFLAGS: 00010202 [ 5376.571138] [ C0] RAX: 0000000000000188 RBX: 0000000000009731 RCX: 0000000000000102 [ 5376.571762] [ C0] RDX: 0000000000000331 RSI: 0000000000000001 RDI: ffff934ac20cde00 [ 5376.572384] [ C0] RBP: 000000000000c102 R08: 0111efe700000000 R09: ffffffffc0c27984 [ 5376.573030] [ C0] R10: dedebea46dac82d1 R11: fa59405d9ecc4042 R12: 000000000000c102 [ 5376.573654] [ C0] R13: 0000000000009731 R14: ffff934b294e3600 R15: 0000000000000331 [ 5376.574277] [ C0] FS: 0000000000000000(0000) GS:ffff934b3f800000(0000) knlGS:0000000000000000 [ 5376.575001] [ C0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5376.575505] [ C0] CR2: 00007f0484cb38d0 CR3: 0000000035048000 CR4: 00000000000006f0 [ 5376.576128] [ C0] Call Trace: [ 5376.576352] [ C0] <IRQ> [ 5376.576537] [ C0] ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack] [ 5376.577100] [ C0] ? __warn+0x7d/0x130 [ 5376.577394] [ C0] ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack] [ 5376.577929] [ C0] ? report_bug+0x16d/0x1a0 [ 5376.578259] [ C0] ? handle_bug+0x41/0x70 [ 5376.578571] [ C0] ? exc_invalid_op+0x13/0x60 [ 5376.578923] [ C0] ? asm_exc_invalid_op+0x16/0x20 [ 5376.579294] [ C0] ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack] [ 5376.579829] [ C0] nf_confirm+0x2d0/0x2f0 [nf_conntrack] [ 5376.580262] [ C0] nf_hook_slow+0x41/0xf0 [ 5376.580582] [ C0] ip_local_deliver+0xce/0x110 [ 5376.580958] [ C0] ? __pfx_ip_local_deliver_finish+0x10/0x10 [ 5376.581412] [ C0] __netif_receive_skb_one_core+0x89/0xa0 [ 5376.581845] [ C0] process_backlog+0x81/0x120 [ 5376.582186] [ C0] __napi_poll+0x2b/0x1c0 [ 5376.582498] [ C0] net_rx_action+0x145/0x2c0 [ 5376.582843] [ C0] ? activate_task+0x61/0x120 [ 5376.583186] [ C0] ? wakeup_preempt+0x33/0x70 [ 5376.583526] [ C0] __do_softirq+0xbe/0x2aa [ 5376.583848] [ C0] do_softirq.part.0+0x3d/0x60 [ 5376.584198] [ C0] </IRQ> [ 5376.584390] [ C0] <TASK> [ 5376.584583] [ C0] __local_bh_enable_ip+0x64/0x70 [ 5376.584974] [ C0] netif_rx+0xdf/0xf0 [ 5376.585260] [ C0] macvlan_broadcast+0xed/0x190 [macvlan] [ 5376.585695] [ C0] macvlan_process_broadcast+0xe9/0x150 [macvlan] [ 5376.586187] [ C0] process_one_work+0x174/0x340 [ 5376.586543] [ C0] worker_thread+0x277/0x390 [ 5376.586886] [ C0] ? __pfx_worker_thread+0x10/0x10 [ 5376.587265] [ C0] kthread+0xe4/0x110 [ 5376.587551] [ C0] ? __pfx_kthread+0x10/0x10 [ 5376.587885] [ C0] ret_from_fork+0x30/0x50 [ 5376.588205] [ C0] ? __pfx_kthread+0x10/0x10 [ 5376.588539] [ C0] ret_from_fork_asm+0x1b/0x30 [ 5376.588911] [ C0] </TASK> I could reproduce the bug a few times then had much more trouble reproducing once work day was over, but looking at the trace this seems to be about broadcast (my previous trace wasn't so clear, but the first trace given in the original repot also says broadcast) -- with that in mind my ping probably wasn't of any help and I was relying on background noise from other folks on the same subnet. That'd also explain why it's hard to reproduce in lab if something specific happens at that time... I've dumped all the broadcasts I could see on the hypervisor around the time of the crash but there isn't much odd; I'll try to isolate the VM and replay the pcap later to see if I can force it and not just "wait a few hours" Now the reason I don't think this warning is related to the nf_nat_setup_info bug, is that this warning can happen when containers all are idle just from outside traffic, last message referred to mcast teardown but it was just a fluke, I've reproduced on pure idling with containers up. OTOH the nat setup info and other crashes I've seen were all shortly after a container started, so something about network creation/teardown is probably related (which is why I had been running a docker run in a loop) I've also started to think I don't really need macvlan for my setup, so the easiest way out of this mess will probably just be to convert to something simpler... But curiosity killed the cat so I'll probably keep looking a bit further. I might not have much lead about the weird crash, but this warning can definitely be understood with a little bit more time -- if I find some trigger payload it'll be easy to simplify the setup one step at a time and have something you could look at. > Now the reason I don't think this warning is related to the nf_nat_setup_info
> bug, is that this warning can happen when containers all are idle just from
> outside traffic, last message referred to mcast teardown but it was just a
> fluke, I've reproduced on pure idling with containers up. OTOH the nat setup
> info and other crashes I've seen were all shortly after a container started,
> so something about network creation/teardown is probably related (which is
> why I had been running a docker run in a loop)
This might just have been a wrong guess; I got a warn that's similar to the crash without starting containers (still on master):
[35150.155883] [ C14] ------------[ cut here ]------------
[35150.156167] [ C14] WARNING: CPU: 14 PID: 139 at net/netfilter/nf_nat_core.c:678 nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.156816] [ C14] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xf
rm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay macvlan bridge stp llc binfmt_misc ppdev bochs drm_vram_helper drm_kms_helper drm_ttm_helper ttm
parport_pc parport pcspkr joydev button evdev sg serio_raw drm dm_mod fuse msr loop efi_pstore configfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 bt
rfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod ata_generic t10_pi virtio_net crc64_rocksoft crc64 ata_piix crc_t10dif crct10dif_generic crct10dif_commo
n libata net_failover failover virtio_scsi scsi_mod psmouse virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev floppy i2c_piix4 virtio_ring scsi_common
[35150.161299] [ C14] CPU: 14 PID: 139 Comm: kworker/u40:4 Kdump: loaded Not tainted 6.8.0-rc2+ #2
[35150.161818] [ C14] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[35150.162429] [ C14] Workqueue: events_unbound macvlan_process_broadcast [macvlan]
[35150.162879] [ C14] RIP: 0010:nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.163240] [ C14] Code: 89 87 80 00 00 00 48 85 ff 0f 84 c7 fa ff ff 0f b6 07 84 c0 0f 85 67 fa ff ff e9 b0 fa ff ff 0f 0b f6 c4 01 0f 84 22 f8 ff ff <0f> 0b 31
c0 e9 bb fa ff ff 48 8b 74 24 20 48 8d 54 24 38 4c 89 cf
[35150.164409] [ C14] RSP: 0018:ffffb13780420c70 EFLAGS: 00010282
[35150.164746] [ C14] RAX: 0000000000000180 RBX: 0000000000000000 RCX: ffff8c530e6f2ae0
[35150.165209] [ C14] RDX: 0000000000000000 RSI: ffffb13780420d6c RDI: ffff8c533ccd0f00
[35150.165672] [ C14] RBP: ffffb13780420d50 R08: 00000000850110ac R09: 0000000000000000
[35150.166125] [ C14] R10: 0000000000000000 R11: ffff8c53025be9c0 R12: 0000000000000000
[35150.166591] [ C14] R13: ffffb13780420d6c R14: ffff8c533ccd0f00 R15: ffff8c533ccd0f00
[35150.167074] [ C14] FS: 0000000000000000(0000) GS:ffff8c537fb80000(0000) knlGS:0000000000000000
[35150.167592] [ C14] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35150.167952] [ C14] CR2: 00005580c7b6f008 CR3: 000000000d578000 CR4: 00000000000006f0
[35150.168418] [ C14] Call Trace:
[35150.168583] [ C14] <IRQ>
[35150.168715] [ C14] ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.169055] [ C14] ? __warn+0x7d/0x130
[35150.169254] [ C14] ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.169581] [ C14] ? report_bug+0x16d/0x1a0
[35150.169815] [ C14] ? handle_bug+0x41/0x70
[35150.170035] [ C14] ? exc_invalid_op+0x13/0x60
[35150.170283] [ C14] ? asm_exc_invalid_op+0x16/0x20
[35150.170564] [ C14] ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.170898] [ C14] ? rt_dst_alloc+0x2a/0xb0
[35150.171147] [ C14] __nf_nat_alloc_null_binding+0x5e/0x90 [nf_nat]
[35150.171510] [ C14] nf_nat_inet_fn+0x2c3/0x320 [nf_nat]
[35150.171822] [ C14] nf_nat_ipv4_local_in+0x54/0x160 [nf_nat]
[35150.172154] [ C14] nf_hook_slow+0x41/0xf0
[35150.172386] [ C14] ip_local_deliver+0xce/0x110
[35150.172645] [ C14] ? __pfx_ip_local_deliver_finish+0x10/0x10
[35150.173001] [ C14] __netif_receive_skb_one_core+0x89/0xa0
[35150.173312] [ C14] process_backlog+0x81/0x120
[35150.173564] [ C14] __napi_poll+0x2b/0x1c0
[35150.173796] [ C14] net_rx_action+0x145/0x2c0
[35150.174042] [ C14] ? activate_task+0x61/0x120
[35150.174287] [ C14] ? wakeup_preempt+0x33/0x70
[35150.174529] [ C14] __do_softirq+0xbe/0x2aa
[35150.174765] [ C14] do_softirq.part.0+0x3d/0x60
[35150.175018] [ C14] </IRQ>
[35150.175158] [ C14] <TASK>
[35150.175302] [ C14] __local_bh_enable_ip+0x64/0x70
[35150.175572] [ C14] netif_rx+0xdf/0xf0
[35150.175784] [ C14] macvlan_broadcast+0xed/0x190 [macvlan]
[35150.176098] [ C14] macvlan_process_broadcast+0xe9/0x150 [macvlan]
[35150.176461] [ C14] process_one_work+0x174/0x340
[35150.176727] [ C14] worker_thread+0x277/0x390
[35150.176964] [ C14] ? __pfx_worker_thread+0x10/0x10
[35150.177234] [ C14] kthread+0xe4/0x110
[35150.177447] [ C14] ? __pfx_kthread+0x10/0x10
[35150.177680] [ C14] ret_from_fork+0x30/0x50
[35150.177908] [ C14] ? __pfx_kthread+0x10/0x10
[35150.178149] [ C14] ret_from_fork_asm+0x1b/0x30
[35150.178412] [ C14] </TASK>
Looking at the panic-on-warn crash, we have:
nf_nat_inet_fn()
going in the !nf_nat_initialized() check
lpriv->entries is NULL at the time of the crash, skipping sub hooks
nf_nat_alloc_null_binding -> nf_nat_setup_info
hitting WARN_ON(nf_nat_initialized)
Unfortunately by the time we got there there was no trace of the other thread that would have processed this nf_conn, but if timing had been slightly different the warn might not have hit and I'm not surprised this can crash just after that as I've seen.
So we have two independent threads operating on the same "ct"?
it comes from nf_nat_inet_fn's `ct = nf_ct_get(skb, &ctinfo)`, so I guess we have two threads processing two skbs sharing the same "_nfct"...?
Snooping around with bpftrace (the check filters on IP_CT_NEW to reduce spam), I see that seems to be the case:
# bpftrace -e 'kfunc:nf_nat_inet_fn { $ct = args->skb->_nfct; if (($ct & 7) == 2) {
printf("%d/%s: %x: %x: %s\n", tid, comm, args->skb, args->skb->_nfct, kstack);
}}'
(trimmed ouptut a bit for brievty: format shows pid/comm (because swappers all have pid 0...), followed by skb, then our 'ct' field with ctinfo, and a stack)
0/swapper/10: 42f37200: 44b82502:
nf_nat_inet_fn+5
nf_nat_ipv4_pre_routing+69
nf_hook_slow+65
br_nf_pre_routing+1288
br_handle_frame+543
__netif_receive_skb_core.constprop.0+521
__netif_receive_skb_list_core+314
netif_receive_skb_list_internal+461
napi_complete_done+110
virtnet_poll+1009
__napi_poll+43
net_rx_action+332
__do_softirq+232
0/swapper/10: 42f37200: 44b82502:
(same stack as above)
138/kworker/u40:3: 42e70300: 44b82502:
nf_nat_inet_fn+5
nf_nat_ipv4_pre_routing+69
nf_hook_slow+65
ip_rcv+302
__netif_receive_skb_one_core+137
process_backlog+129
__napi_poll+43
net_rx_action+332
__do_softirq+232
do_softirq.part.0+114
__local_bh_enable_ip+129
netif_rx+216
macvlan_broadcast+242
macvlan_process_broadcast+239
process_one_work+372
worker_thread+645
138/kworker/u40:3: 42e70300: 44b82502:
(same stack as above)
I'm not sure if being listed twice in a row is an artifact of something but it's not a problem, having two different workers working the the same ct though definitely looks like it is.
@Florian (or anyone else), does that information rings a bell anywhere?
I've unfortunately not been able to reproduce when disconnecting from the real subnet yet, it looks like it's not just a matter of spamming broadcast packets into the VM; I'll try to spend a bit more time on a reproducer if you need one but if this is enough to figure something I'll be happy to just try any patch in my environment -- my VM hits the warning ~3-4 times/day so it's good enough to validate something if not in a hurry.
Thanks!
Created attachment 305824 [details]
Add bridge netfilter specific workaround for broadcast race bug
Could someone with a problematic setup test the attached patch? It adds an (ugly) workaround for broad/multicast packets.
Thanks for the prompt reply! I've rebuilt a kernel with this patch and will let the VM sit for a while; will report back in ~12h as that should be enough time to assess if problems are gone as I normally get 3-4 warnings during the work day. (at least feature-wise I'm not sure I understand the implication of disabling NAT/conntrack helpers but it doesn't look like anything I care about broke) The VM hasn't encountered any warning so far (~10 hours) I'll leave it up for the next few days and periodically check to report if there's any trouble. I'm away this week so can't upgrade the real machine but will make some time next week to upgrade (patch backports almost cleanly to 6.1) and see if that helps with the mysterious crashes we get once a month...! Sorry for the lack of update... I've had zero warnings on the VM since the 6th -- the host crashed (unpatched) on the 9th and rebooted itself into my 6.1 backport of this patch and also didn't exhibit any warning since either. I'd say this gets rid of this class of warnings :) I'll keep an eye on further crashes -- these only occurred once a month or so -- but I have all the reasons to believe that's also fixed. Thanks a lot! I'd appreciate if you cc me when submitting the patch just for follow-up (I'll retest & send Tested-by at least); not sure if we want to backport something so big for 6.1 but I can live with rebuilding security updates for a while. Florian, any updates on this? You've not pushed the commit yet, it seems. Something close to the patch I tested made it as 62e7151ae3eb465e0ab52a20c941ff33bb6332e9 in v6.8, and has also been backported to semi-recent stable trees: cb734975b0ffa688ff6cc0eed463865bf07b6c01 in v6.7.9 80cd0487f630b5382734997c3e5e3003a77db315 in v6.6.21 2b1414d5e94e477edff1d2c79030f1d742625ea0 in v6.1.81 7c3f28599652acf431a2211168de4a583f30b6d5 in v5.15.151 I'll be switching to it later, but assuming it works until proven otherwise. Closing as resolved. Thanks Florian & Pablo! After upgrading from 5.15.148 to 5.15.155, we saw a change in behavior in our docker + IP multicast network setup, and we've identified that commit 7c3f28599652acf431a2211168de4a583f30b6d5 was responsible for the change. I'm still a bit early in the investigation, but wanted to share what I found in (in case someone else ends up with the same issue), and because I'm also interested in any suggestions on how this could be worked around. The problem is the following: on a host, we have a "docker bridge" network (composed of vethX interfaces + a bridge interface) and an eth0 NIC connected to an external network. The "docker bridge" network is on a 172.17.0.0/16 subnet, and eth0 has a 192.168.1.10/24 IP. IP forwarding is enabled, so a container connected to that "docker bridge" network can reach hosts on the external network by using the bridge interface's IP as the gateway, and the packets gets masqueraded with eth0's IP before being sent on the external network (via the iptables rules added by docker). Nothing special here, and that part is still working fine. Where it gets interesting: we are also doing IP multicast routing between the docker network and eth0. We're running smcrouted, with a (*, G) "multicast route" that says to forward the traffic for the 239.255.255.250 multicast group from the bridge interface (IIF) to eth0 (OIF). Before this patch (e.g. 5.15.148), the forwarded multicast packet that was going out on eth0 had its source IP address NATed. After this patch, the source IP address is not NATed anymore, so instead the source IP is the "IP of the docker container". I haven't tried on a mainline kernel... yet. Is this the expected behavior from now on ? Any suggestions ? Thanks |