Bug 217777

Summary: kernel: WARNING: CPU: 0 PID: 16269 at net/netfilter/nf_conntrack_core.c:1210
Product: Networking Reporter: LimeTech (tomm)
Component: Netfilter/IptablesAssignee: networking_netfilter-iptables (networking_netfilter-iptables)
Status: RESOLVED CODE_FIX    
Severity: normal CC: asmadeus, elessard, fw, marc, sebastian.held
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: Non-functional test case
Add bridge netfilter specific workaround for broadcast race bug

Description LimeTech 2023-08-08 17:44:11 UTC
This call trace occurs consistently when using the docker macvlan driver where the bridge's parent is a Linux bridge.

Below is the typical configuration of br0 and docker macvlan

# commands to configure eth0 / br0 (br0 uses dhcp to obtain IP address)
ip link add name br0 type bridge stp_state 0 forward_delay 0 nf_call_iptables 1 nf_call_ip6tables 1 nf_call_arptables 1
ip link set br0 up
ip link set eth0 down
ip -4 addr flush dev eth0
ip -6 addr flush dev eth0
ip link set eth0 promisc on master br0 up

# command to configure docker macvlan network on br0
docker network create -d macvlan --subnet=10.0.101.0/24 --gateway=10.0.101.1 --aux-address=server=10.0.101.13 -o parent=br0 br0

We are currently running 6.1.43 kernel but this issue has been happening with all previous kernels for a least a couple of years now.  For example:
https://www.spinics.net/lists/netfilter/msg59040.html

Also syzbot detected this same issue in 6.5.0-rc2
https://www.spinics.net/lists/netfilter-devel/msg81831.html

Finally we also tried creating a vhost interface attached to br0 and then set that as the docker macvlan "parent" - same result.


Aug  8 10:06:51 ODROID kernel: ------------[ cut here ]------------
Aug  8 10:06:51 ODROID kernel: WARNING: CPU: 0 PID: 16269 at net/netfilter/nf_conntrack_core.c:1210 __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xfs cmac cifs asn1_decoder cifs_arc4 cifs_md4 oid_registry dns_resolver md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag emc2103 iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet 8021q garp mrp macvlan bridge stp llc bonding tls r8125(O) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel i915 kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi drm_buddy
Aug  8 10:06:51 ODROID kernel: aesni_intel i2c_algo_bit ttm drm_display_helper crypto_simd drm_kms_helper cryptd mei_hdcp mei_pxp nvme rapl i2c_i801 sr_mod input_leds intel_cstate drm i2c_smbus cdrom nvme_core joydev led_class video wmi intel_gtt ahci agpgart backlight libahci mei_me intel_pmc_core i2c_core syscopyarea sysfillrect mei sysimgblt fb_sys_fops thermal fan button unix [last unloaded: r8125(O)]
Aug  8 10:06:51 ODROID kernel: CPU: 0 PID: 16269 Comm: kworker/u8:0 Tainted: P           O       6.1.43-Unraid #2
Aug  8 10:06:51 ODROID kernel: Hardware name: HARDKERNEL ODROID-H2/ODROID-H2, BIOS 5.13 04/27/2020
Aug  8 10:06:51 ODROID kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]
Aug  8 10:06:51 ODROID kernel: RIP: 0010:__nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: Code: 44 24 10 e8 e2 e1 ff ff 8b 7c 24 04 89 ea 89 c6 89 04 24 e8 7e e6 ff ff 84 c0 75 a2 48 89 df e8 9b e2 ff ff 85 c0 89 c5 74 18 <0f> 0b 8b 34 24 8b 7c 24 04 e8 18 dd ff ff e8 93 e3 ff ff e9 72 01
Aug  8 10:06:51 ODROID kernel: RSP: 0018:ffffc90000003d98 EFLAGS: 00010202
Aug  8 10:06:51 ODROID kernel: RAX: 0000000000000001 RBX: ffff88814a7ac400 RCX: 9c0c70d57470940d
Aug  8 10:06:51 ODROID kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88814a7ac400
Aug  8 10:06:51 ODROID kernel: RBP: 0000000000000001 R08: 2fcce5ef4761f5d5 R09: 28e40e9ae48c7a5f
Aug  8 10:06:51 ODROID kernel: R10: f5516b05dfc149e9 R11: ffffc90000003d60 R12: ffffffff82a11d00
Aug  8 10:06:51 ODROID kernel: R13: 0000000000011231 R14: ffff88814c989d00 R15: 0000000000000000
Aug  8 10:06:51 ODROID kernel: FS:  0000000000000000(0000) GS:ffff88846fc00000(0000) knlGS:0000000000000000
Aug  8 10:06:51 ODROID kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  8 10:06:51 ODROID kernel: CR2: 0000000000537d30 CR3: 000000018567a000 CR4: 0000000000350ef0
Aug  8 10:06:51 ODROID kernel: Call Trace:
Aug  8 10:06:51 ODROID kernel: <IRQ>
Aug  8 10:06:51 ODROID kernel: ? __warn+0xab/0x122
Aug  8 10:06:51 ODROID kernel: ? report_bug+0x109/0x17e
Aug  8 10:06:51 ODROID kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: ? handle_bug+0x41/0x6f
Aug  8 10:06:51 ODROID kernel: ? exc_invalid_op+0x13/0x60
Aug  8 10:06:51 ODROID kernel: ? asm_exc_invalid_op+0x16/0x20
Aug  8 10:06:51 ODROID kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: ? __nf_conntrack_confirm+0x9e/0x2b0 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: ? nf_nat_inet_fn+0x60/0x1a8 [nf_nat]
Aug  8 10:06:51 ODROID kernel: nf_conntrack_confirm+0x25/0x54 [nf_conntrack]
Aug  8 10:06:51 ODROID kernel: nf_hook_slow+0x3a/0x96
Aug  8 10:06:51 ODROID kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Aug  8 10:06:51 ODROID kernel: NF_HOOK.constprop.0+0x79/0xd9
Aug  8 10:06:51 ODROID kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Aug  8 10:06:51 ODROID kernel: __netif_receive_skb_one_core+0x77/0x9c
Aug  8 10:06:51 ODROID kernel: process_backlog+0x8c/0x116
Aug  8 10:06:51 ODROID kernel: __napi_poll.constprop.0+0x28/0x124
Aug  8 10:06:51 ODROID kernel: net_rx_action+0x159/0x24f
Aug  8 10:06:51 ODROID kernel: __do_softirq+0x126/0x288
Aug  8 10:06:51 ODROID kernel: do_softirq+0x7f/0xab
Aug  8 10:06:51 ODROID kernel: </IRQ>
Aug  8 10:06:51 ODROID kernel: <TASK>
Aug  8 10:06:51 ODROID kernel: __local_bh_enable_ip+0x4c/0x6b
Aug  8 10:06:51 ODROID kernel: netif_rx+0x52/0x5a
Aug  8 10:06:51 ODROID kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Aug  8 10:06:51 ODROID kernel: ? _raw_spin_unlock+0x14/0x29
Aug  8 10:06:51 ODROID kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
Aug  8 10:06:51 ODROID kernel: process_one_work+0x1a8/0x295
Aug  8 10:06:51 ODROID kernel: worker_thread+0x18b/0x244
Aug  8 10:06:51 ODROID kernel: ? rescuer_thread+0x281/0x281
Aug  8 10:06:51 ODROID kernel: kthread+0xe4/0xef
Aug  8 10:06:51 ODROID kernel: ? kthread_complete_and_exit+0x1b/0x1b
Aug  8 10:06:51 ODROID kernel: ret_from_fork+0x1f/0x30
Aug  8 10:06:51 ODROID kernel: </TASK>
Aug  8 10:06:51 ODROID kernel: ---[ end trace 0000000000000000 ]---
Comment 1 Florian Westphal 2023-08-11 10:31:01 UTC
Created attachment 304816 [details]
Non-functional test case

Conntrack has design assumptions that get broken by bridge netfilter.
However, I fail to reproduce this problem.

Attaching a non-functional reproducer.  Can you please confirm if the setup is correct (macvlan on top of bridge)? I tried persistent flooding, but so far to no avail (the clones done by br_flood are processed in serial fashion so not surprising that this doesn't trigger any problem).

I can see problematic skb_clone() in bridge (br_flood and friends), but not in macvlan driver. By the time packet makes it to macvlan the conntrack entry is already confirmed, so skb_clone() is safe.

Is there anything else about the setup that makes a difference?
Comment 2 LimeTech 2023-08-18 17:16:20 UTC
(In reply to Florian Westphal from comment #1)
> Created attachment 304816 [details]
> Non-functional test case
> 
> Conntrack has design assumptions that get broken by bridge netfilter.
> However, I fail to reproduce this problem.
> 
> Attaching a non-functional reproducer.  Can you please confirm if the setup
> is correct (macvlan on top of bridge)? I tried persistent flooding, but so
> far to no avail (the clones done by br_flood are processed in serial fashion
> so not surprising that this doesn't trigger any problem).
> 
> I can see problematic skb_clone() in bridge (br_flood and friends), but not
> in macvlan driver. By the time packet makes it to macvlan the conntrack
> entry is already confirmed, so skb_clone() is safe.
> 
> Is there anything else about the setup that makes a difference?

The only reliable way we can reproduce this issue is to start docker, setting up a macvlan network where the parent interface is a bridge on the host.  You do not even need to start a container.  see: https://docs.docker.com/network/network-tutorial-macvlan/

If the parent interface is not a bridge, the problem doesn't seem to happen.

If you physically disconnect the ethernet cable, the problem doesn't seem to happen either.
Comment 3 mgutt 2023-09-12 18:11:40 UTC
As a side note: Maybe nf_nat_setup_info errors have the same source as they can be solved by the same workarounds:
https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/?tab=comments
Comment 4 Dominique Martinet 2024-01-23 09:23:14 UTC
Also assuming this is related to that unraid thread (I came here looking for nf_nat_setup_info but also see these warnings); I've hit a kernel panic as attached below on a similar setup as the initial post (eth0 > br0 > a couple of interfaces with macvlan, other classic bridge slaves all on docker)

Looking at the exact RIP after loading up debug symbols we're apparently in there:
nf_nat_setup_info > get_unique_tuple > find_appropriate_src > same_src
(currently on 6.1.69 kernel so it's a bit old)


I ran out of time for a few days but if that gives you an idea please feel free to send me a patch to try -- Unfortunately it's a machine I can't reboot too often, but I hit warnings similar to the above every few days and it's the second panic I get in a couple of months so at least the warning will be relatively easy to confirm.
Now I've confirmed it's a panic (didn't have traces last time it crashed) I'll setup kexec-kdump for next time this happens, but if it's a race in rcu it probably won't help much -- I'll take the time to decipher what exact field was accessed at the time of the general protection fault first next week.


[1554289.142290] general protection fault, probably for non-canonical address 0xc79af08c657a8def: 0000 [#2] PREEMPT SMP NOPTI
[1554289.142447] CPU: 21 PID: 841620 Comm: npm install Tainted: G      D W          6.1.0-17-amd64 #1  Debian 6.1.69-1
[1554289.142585] Hardware name: ASUS System Product Name/PRIME B650-PLUS, BIOS 0823 11/21/2022
[1554289.142719] RIP: 0010:nf_nat_setup_info+0xa4d/0xd40 [nf_nat]
[1554289.142865] Code: 89 ce 41 38 47 46 74 28 4d 8b bf 90 00 00 00 4d 85 ff 0f 84 41 fe ff ff 49 81 ef 90 00 00 00 0f 84 34 fe ff ff 0f b6 44 24 56 <41> 38 47 46 75 d8 49 8b 47 20 49 8b 57 28 48 33 44 24 30 48 33 54
[1554289.143155] RSP: 0018:ffffa085c061c550 EFLAGS: 00010286
[1554289.143302] RAX: 0000000000000006 RBX: ffff92b1b33c6500 RCX: ffffffffabc8ca40
[1554289.143448] RDX: ffff92ae51600000 RSI: 07e07d0650662282 RDI: 4e5012ca31d609fc
[1554289.143598] RBP: ffffa085c061c628 R08: ffffa085c061c528 R09: 0000000000000000
[1554289.143744] R10: bbabb09d7b6a920d R11: d84ca9f0a96a01c4 R12: 0000000000000000
[1554289.143894] R13: ffffa085c061c63c R14: ffffffffabc8ca40 R15: c79af08c657a8da9
[1554289.144044] FS:  00007f58c3720800(0000) GS:ffff92bd58740000(0000) knlGS:0000000000000000
[1554289.144200] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1554289.144356] CR2: 0000000007351000 CR3: 00000007115ac000 CR4: 0000000000750ee0
[1554289.144513] PKRU: 55555554
[1554289.144670] Call Trace:
[1554289.144817]  <IRQ>
[1554289.144966]  ? __die_body.cold+0x1a/0x1f
[1554289.145124]  ? die_addr+0x38/0x60
[1554289.145273]  ? exc_general_protection+0x221/0x4a0
[1554289.145421]  ? asm_exc_general_protection+0x22/0x30
[1554289.145569]  ? nf_nat_setup_info+0xa4d/0xd40 [nf_nat]
[1554289.145717]  ? krealloc+0x85/0xc0
[1554289.145862]  nf_nat_masquerade_ipv4+0x142/0x190 [nf_nat]
[1554289.146011]  masquerade_tg+0x7f/0xa0 [xt_MASQUERADE]
[1554289.146158]  nft_target_eval_xt+0x67/0xa0 [nft_compat]
[1554289.146306]  nft_do_chain+0x198/0x5d0 [nf_tables]
[1554289.146468]  nft_nat_do_chain+0xa4/0x109 [nft_chain_nat]
[1554289.146623]  nf_nat_inet_fn+0x161/0x320 [nf_nat]
[1554289.146773]  nf_nat_ipv4_out+0x4b/0x100 [nf_nat]
[1554289.146920]  nf_hook_slow+0x41/0xc0
[1554289.147067]  ip_output+0xe5/0x130
[1554289.147215]  ? __ip_finish_output+0x190/0x190
[1554289.147361]  ip_sabotage_in+0x64/0x70 [br_netfilter]
[1554289.147508]  nf_hook_slow+0x41/0xc0
[1554289.147651]  ip_rcv+0x89/0x1d0
[1554289.147792]  ? ip_sublist_rcv+0x240/0x240
[1554289.147927]  __netif_receive_skb_one_core+0x89/0xa0
[1554289.148066]  netif_receive_skb+0x115/0x160
[1554289.148201]  br_handle_frame_finish+0x334/0x5b0 [bridge]
[1554289.148348]  ? br_handle_local_finish+0x20/0x20 [bridge]
[1554289.148497]  br_nf_hook_thresh+0xec/0x100 [br_netfilter]
[1554289.148638]  ? br_handle_local_finish+0x20/0x20 [bridge]
[1554289.148787]  br_nf_pre_routing_finish+0x15e/0x440 [br_netfilter]
[1554289.148926]  ? br_handle_local_finish+0x20/0x20 [bridge]
[1554289.149076]  ? nf_nat_ipv4_pre_routing+0x45/0xb0 [nf_nat]
[1554289.149215]  br_nf_pre_routing+0x47f/0x580 [br_netfilter]
[1554289.149349]  ? br_nf_hook_thresh+0x100/0x100 [br_netfilter]
[1554289.149480]  br_handle_frame+0x233/0x3b0 [bridge]
[1554289.149614]  ? br_handle_local_finish+0x20/0x20 [bridge]
[1554289.149748]  ? br_handle_frame_finish+0x5b0/0x5b0 [bridge]
[1554289.149876]  __netif_receive_skb_core.constprop.0+0x263/0xef0
[1554289.149986]  ? update_load_avg+0x7e/0x780
[1554289.150098]  ? sched_clock_cpu+0x9/0xb0
[1554289.150203]  __netif_receive_skb_one_core+0x3c/0xa0
[1554289.150307]  process_backlog+0x81/0x120
[1554289.150408]  __napi_poll+0x2b/0x160
[1554289.150510]  net_rx_action+0x29e/0x350
[1554289.150611]  __do_softirq+0xc6/0x2ab
[1554289.150712]  do_softirq.part.0+0x5f/0x80
[1554289.150810]  </IRQ>
[1554289.150903]  <TASK>
[1554289.150998]  __local_bh_enable_ip+0x64/0x70
[1554289.151091]  __dev_queue_xmit+0x29f/0xd70
[1554289.151184]  ? __local_bh_enable_ip+0x64/0x70
[1554289.151278]  ? __dev_queue_xmit+0x29f/0xd70
[1554289.151372]  ? apparmor_nf_unregister+0x20/0x20
[1554289.151466]  ? nf_hook_slow+0x41/0xc0
[1554289.151560]  ip_finish_output2+0x297/0x560
[1554289.151653]  __ip_queue_xmit+0x171/0x450
[1554289.151748]  __tcp_transmit_skb+0xaa4/0xc00
[1554289.151842]  tcp_connect+0xb2a/0xe40
[1554289.151935]  tcp_v4_connect+0x412/0x520
[1554289.152026]  __inet_stream_connect+0xda/0x3e0
[1554289.152119]  ? tomoyo_socket_connect_permission+0xaf/0xf0
[1554289.152215]  inet_stream_connect+0x36/0x50
[1554289.152307]  __sys_connect+0xa4/0xd0
[1554289.152399]  __x64_sys_connect+0x14/0x20
[1554289.152488]  do_syscall_64+0x5b/0xc0
[1554289.152580]  ? exit_to_user_mode_prepare+0x40/0x1e0
[1554289.152672]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[1554289.152763] RIP: 0033:0x7f58c390b25b
[1554289.152855] Code: 83 ec 18 89 54 24 0c 48 89 34 24 89 7c 24 08 e8 bb fa ff ff 8b 54 24 0c 48 8b 34 24 41 89 c0 8b 7c 24 08 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 08 e8 f1 fa ff ff 8b 44
[1554289.153065] RSP: 002b:00007ffeaa34cad0 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
[1554289.153174] RAX: ffffffffffffffda RBX: 000000000734f130 RCX: 00007f58c390b25b
[1554289.153284] RDX: 0000000000000010 RSI: 00007ffeaa34cb90 RDI: 0000000000000092
[1554289.153395] RBP: 00007ffeaa34cb60 R08: 0000000000000000 R09: 0000000006fffe80
[1554289.153505] R10: 00007ffeaa34caac R11: 0000000000000293 R12: 0000000006fffed8
[1554289.153616] R13: 00007ffeaa34cb90 R14: 0000000000000010 R15: 00007f58c3720690
[1554289.153725]  </TASK>
[1554289.153836] Modules linked in: tls qrtr netconsole bluetooth jitterentropy_rng drbg ansi_cprng ecdh_generic ecc joydev veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay binfmt_misc macvlan bridge stp llc nls_ascii nls_cp437 vfat fat intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm amdgpu irqbypass ghash_clmulni_intel snd_hda_codec_realtek sha512_ssse3 sha512_generic snd_hda_codec_generic sha256_ssse3 sha1_ssse3 snd_hda_codec_hdmi snd_hda_intel gpu_sched snd_intel_dspcfg drm_buddy snd_intel_sdw_acpi drm_display_helper snd_hda_codec cec snd_hda_core aesni_intel rc_core drm_ttm_helper snd_hwdep ttm eeepc_wmi crypto_simd snd_pcm asus_wmi cryptd snd_timer drm_kms_helper battery platform_profile sparse_keymap snd ledtrig_audio rfkill i2c_algo_bit rapl sp5100_tco pcspkr soundcore wmi_bmof ccp k10temp watchdog sg
[1554289.153864]  acpi_cpufreq button evdev drm msr fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic hid_generic usbhid hid sd_mod dm_mod ahci libahci xhci_pci libata xhci_hcd nvme nvme_core usbcore scsi_mod r8169 t10_pi realtek mdio_devres libphy crc32_pclmul crc64_rocksoft crc32c_intel crc64 crc_t10dif i2c_piix4 crct10dif_generic usb_common scsi_common crct10dif_pclmul crct10dif_common video wmi gpio_amdpt gpio_generic [last unloaded: netconsole]
[1554289.155637] ---[ end trace 0000000000000000 ]---
[1554289.547937] RIP: 0010:refill_obj_stock+0x5b/0x180
[1554289.548123] Code: 57 49 8b 47 10 48 39 f8 0f 84 9c 00 00 00 4c 89 ff e8 59 f3 ff ff 49 89 c6 e8 f1 d2 de ff 48 8b 45 00 a8 03 0f 85 c9 00 00 00 <65> 48 ff 00 e8 0c 0c df ff 49 89 6f 10 8b 45 18 85 c0 74 07 31 c0
[1554289.548474] RSP: 0018:ffffa085c0a139d0 EFLAGS: 00010046
[1554289.548652] RAX: ffff92b6c701b290 RBX: 0000000000000070 RCX: 0000000000000020
[1554289.548830] RDX: ffff92ae43c54000 RSI: 0000000000000002 RDI: ffff92ae43c54000
[1554289.549012] RBP: ffff92ae5179cd00 R08: ffff92af4cb61e90 R09: 0000000000000238
[1554289.549193] R10: 0000000000000002 R11: ffffa085c0a139d0 R12: 0000000000000212
[1554289.549374] R13: ffffffffa8fa4df1 R14: ffff92ae4d97cd00 R15: ffff92bd582adb40
[1554289.549553] FS:  00007f58c3720800(0000) GS:ffff92bd58740000(0000) knlGS:0000000000000000
[1554289.549734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1554289.549915] CR2: 0000000007351000 CR3: 00000007115ac000 CR4: 0000000000750ee0
[1554289.550098] PKRU: 55555554
[1554289.550285] Kernel panic - not syncing: Fatal exception in interrupt
[1554289.550972] Kernel Offset: 0x27c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[1554289.927344] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Thanks!
Comment 5 Dominique Martinet 2024-02-02 05:57:27 UTC
So I was able to reproduce in a VM; I didn't try to simplify my setup much I just reproduced what I had on the physical machine:
- gave it 20GB of ram and 20 cores
- single virtio network interface
- br0 contains that interface
- a couple of macvlan interfaces on top of br0
** one directly:
# The primary network interface
auto br0
iface br0 inet static
	bridge_ports ens4
	bridge_stp no
	# need an address to please network scripts
	address 10.255.255.241/32
	# workaround to access macvlan containers...
	post-up ip link add macvlan-lan link br0 type macvlan mode bridge
	post-up ip a add x.y.z.99/16 dev macvlan-lan
	post-up ip link set macvlan-lan up
	post-up ip r add default via x.y.z.1 dev macvlan-lan
	# remove useless ip
	post-up ip a del 10.255.255.241/32 dev br0
** others through docker (probably don't need all 3, or actually even any at all -- I just happened to have these on the host):
docker network create -d macvlan -o parent=br0 --subnet=x.y.z.0/16 --gateway=x.y.z.1 dockernet
docker run --restart=always -d --network dockernet --name test --ip x.y.z.98 --network-alias test docker.io/alpine sleep infinity
docker run --restart=always -d --network dockernet --name test2 --ip x.y.z.97 --network-alias test2 docker.io/alpine sleep infinity
docker run --restart=always -d --network dockernet --name test3 --ip x.y.z.96 --network-alias test3 docker.io/alpine sleep infinity
- from outside ping one of these ip so there's at least a minimum of activity (pinged one of the containers with `ping -i 0.2 -s 1500 x.y.z.98` from another host)
- run docker containers in a loop; docker setup its own default docker0 bridge interface that NATs through the default exit route (in our case, through the 'macvlan-lan' interface)
while true; do docker run --rm -ti docker.io/alpine ping -c 1 x.y.z.20; done
(the pinged interface is something else on the network, probably doesn't matter)

After a few minutes I get the warning fairly reliably.


Which means I could get a core dump, this is just a guess but the culprit seems to be the interaction between these two tasks -- or at least something docker does when it starts a new container as I can't reproduce without spawning new containers:
crash> bt 79591
PID: 79591    TASK: ffff917a46cb48c0  CPU: 11   COMMAND: "exe"
 #0 [ffffa2a0c5d3bb00] __schedule at ffffffffa342e60d
 #1 [ffffa2a0c5d3bb88] schedule at ffffffffa342ecfa
 #2 [ffffa2a0c5d3bba0] synchronize_rcu_expedited at ffffffffa2b32fae
 #3 [ffffa2a0c5d3bc58] ipv6_mc_down at ffffffffa33375b9
 #4 [ffffa2a0c5d3bc78] addrconf_ifdown at ffffffffa330987a
 #5 [ffffa2a0c5d3bd18] addrconf_sysctl_disable at ffffffffa330e7e1
 #6 [ffffa2a0c5d3bd98] proc_sys_call_handler at ffffffffa2e0429d
 #7 [ffffa2a0c5d3bdf0] vfs_write at ffffffffa2d59285
 #8 [ffffa2a0c5d3be88] ksys_write at ffffffffa2d595cb
 #9 [ffffa2a0c5d3bec0] do_syscall_64 at ffffffffa342015b
#10 [ffffa2a0c5d3bed8] __x64_sys_fcntl at ffffffffa2d717ed
#11 [ffffa2a0c5d3bf18] do_syscall_64 at ffffffffa3420167
#12 [ffffa2a0c5d3bf50] entry_SYSCALL_64_after_hwframe at ffffffffa36000dc
    RIP: 000056378c84536e  RSP: 000000c000cbfb40  RFLAGS: 00000206
    RAX: ffffffffffffffda  RBX: 0000000000000008  RCX: 000056378c84536e
    RDX: 0000000000000002  RSI: 000000c000cbfdd6  RDI: 0000000000000008
    RBP: 000000c000cbfb80   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000206  R12: 000000c000cbfcc0
    R13: 0000000000000000  R14: 000000c0000061a0  R15: 00007f2f190c1818
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

and

crash> bt
PID: 128      TASK: ffff917ad384e100  CPU: 8    COMMAND: "kworker/u40:14"
 #0 [ffffa2a0c0298a80] machine_kexec at ffffffffa2a706af
 #1 [ffffa2a0c0298ad8] __crash_kexec at ffffffffa2b6bdf7
 #2 [ffffa2a0c0298b98] panic at ffffffffa33dcdba
 #3 [ffffa2a0c0298c18] check_panic_on_warn.cold at ffffffffa33dcfab
 #4 [ffffa2a0c0298c20] __warn at ffffffffa2a9acd9
 #5 [ffffa2a0c0298c58] report_bug at ffffffffa33b0a32
 #6 [ffffa2a0c0298c98] handle_bug at ffffffffa3420721
 #7 [ffffa2a0c0298ca8] exc_invalid_op at ffffffffa3420913
 #8 [ffffa2a0c0298cc0] asm_exc_invalid_op at ffffffffa3600b36
    [exception RIP: __nf_conntrack_confirm+1422]
    RIP: ffffffffc0c2232e  RSP: ffffa2a0c0298d78  RFLAGS: 00010202
    RAX: 0000000000000188  RBX: 0000000000002468  RCX: ffff917a40f78800
    RDX: ffffffffc0c3f1a0  RSI: 0000000000000001  RDI: ffffffffc0c3f784
    RBP: 00000000000381e1   R8: ffffa2a0c0298d40   R9: 0000000000000000
    R10: a6667d640558100e  R11: 56e02d92e5e345fa  R12: 00000000000381e1
    R13: 0000000000002468  R14: ffff917a4adbe300  R15: 0000000000000068
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffffa2a0c0298de0] nf_confirm at ffffffffc0c27b30 [nf_conntrack]
#10 [ffffa2a0c0298e08] nf_hook_slow at ffffffffa324cac1
#11 [ffffa2a0c0298e38] ip_local_deliver at ffffffffa32575de
#12 [ffffa2a0c0298e90] __netif_receive_skb_one_core at ffffffffa31c53b9
#13 [ffffa2a0c0298eb8] process_backlog at ffffffffa31c5631
#14 [ffffa2a0c0298ee8] __napi_poll at ffffffffa31c615b
#15 [ffffa2a0c0298f10] net_rx_action at ffffffffa31c668e
#16 [ffffa2a0c0298f98] __softirqentry_text_start at ffffffffa38000c6
#17 [ffffa2a0c0298ff0] do_softirq at ffffffffa2aa19af
--- <IRQ stack> ---
#18 [ffffa2a0c04f3da0] do_softirq at ffffffffa2aa19af
    [exception RIP: unknown or invalid address]
    RIP: ffff917a42bac000  RSP: ffffffffc0a8a395  RFLAGS: ffff917a42bac838
    RAX: ffffffffc0a8a06d  RBX: 0000000000000000  RCX: ffff917a42bac000
    RDX: 0000000000000000  RSI: ffff917a4526b800  RDI: ffffa2a0c04f3e88
    RBP: 0000000000000202   R8: ffff917a4526b800   R9: ffff917a42bce980
    R10: ffffffffa31bf05f  R11: ffffffffa2aa1b24  R12: ffff917a544b720c
    R13: ffff917a4526b800  R14: ffffffffa31beda5  R15: 0000000000000000
    ORIG_RAX: 0000000000000000  CS: ffffa2a0c04f3e40  SS: ffffa2a0c04f3e40
bt: WARNING: possibly bogus exception frame
#19 [ffffa2a0c04f3e90] process_one_work at ffffffffa2abd927
#20 [ffffa2a0c04f3ed8] worker_thread at ffffffffa2abdecd
#21 [ffffa2a0c04f3f18] kthread at ffffffffa2ac54fa
#22 [ffffa2a0c04f3f50] ret_from_fork at ffffffffa2a02062


(this is still on debian's 6.1.0-17-amd64, so 6.1.69)

Now I have a setup I can mess with without disturbing all my colleagues I'll first confirm if that's still a problem on master, and depending on that look further into the crash (either simplify the reproducer or just straight look into it more)
Comment 6 Dominique Martinet 2024-02-04 05:27:02 UTC
I'm starting to be unsure if this is related to the nf_nat_setup_info bug I got the other day... But at least the __nf_conntrack_confirm "Another skb with the same unconfirmed conntrack may with the race" warning still happens as of master (a few days old - on 021533194476 ("Kconfig: Disable -Wstringop-overflow for GCC globally"))

[ 5376.556356] [      C0] ------------[ cut here ]------------
[ 5376.556959] [      C0] WARNING: CPU: 0 PID: 1160242 at net/netfilter/nf_conntrack_core.c:1198 __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack]
[ 5376.558158] [      C0] Modules linked in: veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUE
RADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay macvlan bridge stp llc binfmt_misc ppdev bochs drm_vram_helper drm_kms_helper drm_ttm_helper ttm parport_pc parport evdev pcspkr joydev button sg serio_raw drm fuse dm_mod msr loop configfs efi_pstore qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod t10_pi crc64_rocksoft crc64 ata_generic crc_t10dif crct10dif_generic crct10dif_common ata_piix virtio_net libata net_failover virtio_scsi failover scsi_mod virtio_pci psmouse virtio virtio_pci_legacy_dev virtio_pci_modern_dev floppy virtio_ring i2c_piix4 scsi_common
[ 5376.566250] [      C0] CPU: 0 PID: 1160242 Comm: kworker/u40:1 Kdump: loaded Not tainted 6.8.0-rc2+ #2
[ 5376.566984] [      C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 5376.567829] [      C0] Workqueue: events_unbound macvlan_process_broadcast [macvlan]
[ 5376.568427] [      C0] RIP: 0010:__nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack]
[ 5376.569052] [      C0] Code: 8b 64 24 10 e9 6e fe ff ff 44 0f b7 62 0c 31 ed 85 c9 74 05 41 0f b7 6e 0c 66 44 39 e5 0f 85 2a ff ff ff eb b9 45 31 e4 eb ea <0f> 0b 4c 89 cf e8 8a 67 f7 db 8b 44 24 08 41 39 c7 75 18 be 00 02
[ 5376.570681] [      C0] RSP: 0018:ffffb62fc0003d70 EFLAGS: 00010202
[ 5376.571138] [      C0] RAX: 0000000000000188 RBX: 0000000000009731 RCX: 0000000000000102
[ 5376.571762] [      C0] RDX: 0000000000000331 RSI: 0000000000000001 RDI: ffff934ac20cde00
[ 5376.572384] [      C0] RBP: 000000000000c102 R08: 0111efe700000000 R09: ffffffffc0c27984
[ 5376.573030] [      C0] R10: dedebea46dac82d1 R11: fa59405d9ecc4042 R12: 000000000000c102
[ 5376.573654] [      C0] R13: 0000000000009731 R14: ffff934b294e3600 R15: 0000000000000331
[ 5376.574277] [      C0] FS:  0000000000000000(0000) GS:ffff934b3f800000(0000) knlGS:0000000000000000
[ 5376.575001] [      C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5376.575505] [      C0] CR2: 00007f0484cb38d0 CR3: 0000000035048000 CR4: 00000000000006f0
[ 5376.576128] [      C0] Call Trace:
[ 5376.576352] [      C0]  <IRQ>
[ 5376.576537] [      C0]  ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack]
[ 5376.577100] [      C0]  ? __warn+0x7d/0x130
[ 5376.577394] [      C0]  ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack]
[ 5376.577929] [      C0]  ? report_bug+0x16d/0x1a0
[ 5376.578259] [      C0]  ? handle_bug+0x41/0x70
[ 5376.578571] [      C0]  ? exc_invalid_op+0x13/0x60
[ 5376.578923] [      C0]  ? asm_exc_invalid_op+0x16/0x20
[ 5376.579294] [      C0]  ? __nf_conntrack_confirm+0x3fc/0x610 [nf_conntrack]
[ 5376.579829] [      C0]  nf_confirm+0x2d0/0x2f0 [nf_conntrack]
[ 5376.580262] [      C0]  nf_hook_slow+0x41/0xf0
[ 5376.580582] [      C0]  ip_local_deliver+0xce/0x110
[ 5376.580958] [      C0]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[ 5376.581412] [      C0]  __netif_receive_skb_one_core+0x89/0xa0
[ 5376.581845] [      C0]  process_backlog+0x81/0x120
[ 5376.582186] [      C0]  __napi_poll+0x2b/0x1c0
[ 5376.582498] [      C0]  net_rx_action+0x145/0x2c0
[ 5376.582843] [      C0]  ? activate_task+0x61/0x120
[ 5376.583186] [      C0]  ? wakeup_preempt+0x33/0x70
[ 5376.583526] [      C0]  __do_softirq+0xbe/0x2aa
[ 5376.583848] [      C0]  do_softirq.part.0+0x3d/0x60
[ 5376.584198] [      C0]  </IRQ>
[ 5376.584390] [      C0]  <TASK>
[ 5376.584583] [      C0]  __local_bh_enable_ip+0x64/0x70
[ 5376.584974] [      C0]  netif_rx+0xdf/0xf0
[ 5376.585260] [      C0]  macvlan_broadcast+0xed/0x190 [macvlan]
[ 5376.585695] [      C0]  macvlan_process_broadcast+0xe9/0x150 [macvlan]
[ 5376.586187] [      C0]  process_one_work+0x174/0x340
[ 5376.586543] [      C0]  worker_thread+0x277/0x390
[ 5376.586886] [      C0]  ? __pfx_worker_thread+0x10/0x10
[ 5376.587265] [      C0]  kthread+0xe4/0x110
[ 5376.587551] [      C0]  ? __pfx_kthread+0x10/0x10
[ 5376.587885] [      C0]  ret_from_fork+0x30/0x50
[ 5376.588205] [      C0]  ? __pfx_kthread+0x10/0x10
[ 5376.588539] [      C0]  ret_from_fork_asm+0x1b/0x30
[ 5376.588911] [      C0]  </TASK>


I could reproduce the bug a few times then had much more trouble reproducing once work day was over, but looking at the trace this seems to be about broadcast (my previous trace wasn't so clear, but the first trace given in the original repot also says broadcast) -- with that in mind my ping probably wasn't of any help and I was relying on background noise from other folks on the same subnet. That'd also explain why it's hard to reproduce in lab if something specific happens at that time... I've dumped all the broadcasts I could see on the hypervisor around the time of the crash but there isn't much odd; I'll try to isolate the VM and replay the pcap later to see if I can force it and not just "wait a few hours"


Now the reason I don't think this warning is related to the nf_nat_setup_info bug, is that this warning can happen when containers all are idle just from outside traffic, last message referred to mcast teardown but it was just a fluke, I've reproduced on pure idling with containers up. OTOH the nat setup info and other crashes I've seen were all shortly after a container started, so something about network creation/teardown is probably related (which is why I had been running a docker run in a loop)

I've also started to think I don't really need macvlan for my setup, so the easiest way out of this mess will probably just be to convert to something simpler... But curiosity killed the cat so I'll probably keep looking a bit further. I might not have much lead about the weird crash, but this warning can definitely be understood with a little bit more time -- if I find some trigger payload it'll be easy to simplify the setup one step at a time and have something you could look at.
Comment 7 Dominique Martinet 2024-02-05 07:19:33 UTC
> Now the reason I don't think this warning is related to the nf_nat_setup_info
> bug, is that this warning can happen when containers all are idle just from
> outside traffic, last message referred to mcast teardown but it was just a
> fluke, I've reproduced on pure idling with containers up. OTOH the nat setup
> info and other crashes I've seen were all shortly after a container started,
> so something about network creation/teardown is probably related (which is
> why I had been running a docker run in a loop)

This might just have been a wrong guess; I got a warn that's similar to the crash without starting containers (still on master):

[35150.155883] [     C14] ------------[ cut here ]------------
[35150.156167] [     C14] WARNING: CPU: 14 PID: 139 at net/netfilter/nf_nat_core.c:678 nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.156816] [     C14] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xf
rm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter overlay macvlan bridge stp llc binfmt_misc ppdev bochs drm_vram_helper drm_kms_helper drm_ttm_helper ttm 
parport_pc parport pcspkr joydev button evdev sg serio_raw drm dm_mod fuse msr loop efi_pstore configfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 bt
rfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq sd_mod ata_generic t10_pi virtio_net crc64_rocksoft crc64 ata_piix crc_t10dif crct10dif_generic crct10dif_commo
n libata net_failover failover virtio_scsi scsi_mod psmouse virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev floppy i2c_piix4 virtio_ring scsi_common
[35150.161299] [     C14] CPU: 14 PID: 139 Comm: kworker/u40:4 Kdump: loaded Not tainted 6.8.0-rc2+ #2
[35150.161818] [     C14] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[35150.162429] [     C14] Workqueue: events_unbound macvlan_process_broadcast [macvlan]
[35150.162879] [     C14] RIP: 0010:nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.163240] [     C14] Code: 89 87 80 00 00 00 48 85 ff 0f 84 c7 fa ff ff 0f b6 07 84 c0 0f 85 67 fa ff ff e9 b0 fa ff ff 0f 0b f6 c4 01 0f 84 22 f8 ff ff <0f> 0b 31 
c0 e9 bb fa ff ff 48 8b 74 24 20 48 8d 54 24 38 4c 89 cf
[35150.164409] [     C14] RSP: 0018:ffffb13780420c70 EFLAGS: 00010282
[35150.164746] [     C14] RAX: 0000000000000180 RBX: 0000000000000000 RCX: ffff8c530e6f2ae0
[35150.165209] [     C14] RDX: 0000000000000000 RSI: ffffb13780420d6c RDI: ffff8c533ccd0f00
[35150.165672] [     C14] RBP: ffffb13780420d50 R08: 00000000850110ac R09: 0000000000000000
[35150.166125] [     C14] R10: 0000000000000000 R11: ffff8c53025be9c0 R12: 0000000000000000
[35150.166591] [     C14] R13: ffffb13780420d6c R14: ffff8c533ccd0f00 R15: ffff8c533ccd0f00
[35150.167074] [     C14] FS:  0000000000000000(0000) GS:ffff8c537fb80000(0000) knlGS:0000000000000000
[35150.167592] [     C14] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35150.167952] [     C14] CR2: 00005580c7b6f008 CR3: 000000000d578000 CR4: 00000000000006f0
[35150.168418] [     C14] Call Trace:
[35150.168583] [     C14]  <IRQ>
[35150.168715] [     C14]  ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.169055] [     C14]  ? __warn+0x7d/0x130
[35150.169254] [     C14]  ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.169581] [     C14]  ? report_bug+0x16d/0x1a0
[35150.169815] [     C14]  ? handle_bug+0x41/0x70
[35150.170035] [     C14]  ? exc_invalid_op+0x13/0x60
[35150.170283] [     C14]  ? asm_exc_invalid_op+0x16/0x20
[35150.170564] [     C14]  ? nf_nat_setup_info+0xb10/0xd90 [nf_nat]
[35150.170898] [     C14]  ? rt_dst_alloc+0x2a/0xb0
[35150.171147] [     C14]  __nf_nat_alloc_null_binding+0x5e/0x90 [nf_nat]
[35150.171510] [     C14]  nf_nat_inet_fn+0x2c3/0x320 [nf_nat]
[35150.171822] [     C14]  nf_nat_ipv4_local_in+0x54/0x160 [nf_nat]
[35150.172154] [     C14]  nf_hook_slow+0x41/0xf0
[35150.172386] [     C14]  ip_local_deliver+0xce/0x110
[35150.172645] [     C14]  ? __pfx_ip_local_deliver_finish+0x10/0x10
[35150.173001] [     C14]  __netif_receive_skb_one_core+0x89/0xa0
[35150.173312] [     C14]  process_backlog+0x81/0x120
[35150.173564] [     C14]  __napi_poll+0x2b/0x1c0
[35150.173796] [     C14]  net_rx_action+0x145/0x2c0
[35150.174042] [     C14]  ? activate_task+0x61/0x120
[35150.174287] [     C14]  ? wakeup_preempt+0x33/0x70
[35150.174529] [     C14]  __do_softirq+0xbe/0x2aa
[35150.174765] [     C14]  do_softirq.part.0+0x3d/0x60
[35150.175018] [     C14]  </IRQ>
[35150.175158] [     C14]  <TASK>
[35150.175302] [     C14]  __local_bh_enable_ip+0x64/0x70
[35150.175572] [     C14]  netif_rx+0xdf/0xf0
[35150.175784] [     C14]  macvlan_broadcast+0xed/0x190 [macvlan]
[35150.176098] [     C14]  macvlan_process_broadcast+0xe9/0x150 [macvlan]
[35150.176461] [     C14]  process_one_work+0x174/0x340
[35150.176727] [     C14]  worker_thread+0x277/0x390
[35150.176964] [     C14]  ? __pfx_worker_thread+0x10/0x10
[35150.177234] [     C14]  kthread+0xe4/0x110
[35150.177447] [     C14]  ? __pfx_kthread+0x10/0x10
[35150.177680] [     C14]  ret_from_fork+0x30/0x50
[35150.177908] [     C14]  ? __pfx_kthread+0x10/0x10
[35150.178149] [     C14]  ret_from_fork_asm+0x1b/0x30
[35150.178412] [     C14]  </TASK>


Looking at the panic-on-warn crash, we have:
nf_nat_inet_fn()
  going in the !nf_nat_initialized() check
  lpriv->entries is NULL at the time of the crash, skipping sub hooks
  nf_nat_alloc_null_binding -> nf_nat_setup_info
  hitting WARN_ON(nf_nat_initialized)

Unfortunately by the time we got there there was no trace of the other thread that would have processed this nf_conn, but if timing had been slightly different the warn might not have hit and I'm not surprised this can crash just after that as I've seen.

So we have two independent threads operating on the same "ct"?
it comes from nf_nat_inet_fn's `ct = nf_ct_get(skb, &ctinfo)`, so I guess we have two threads processing two skbs sharing the same "_nfct"...?
Snooping around with bpftrace (the check filters on IP_CT_NEW to reduce spam), I see that seems to be the case:
# bpftrace -e 'kfunc:nf_nat_inet_fn { $ct = args->skb->_nfct; if (($ct & 7) == 2) {
   printf("%d/%s: %x: %x: %s\n", tid, comm, args->skb, args->skb->_nfct, kstack);
}}'
(trimmed ouptut a bit for brievty: format shows pid/comm (because swappers all have pid 0...), followed by skb, then our 'ct' field with ctinfo, and a stack)

0/swapper/10: 42f37200: 44b82502: 
        nf_nat_inet_fn+5
        nf_nat_ipv4_pre_routing+69
        nf_hook_slow+65
        br_nf_pre_routing+1288
        br_handle_frame+543
        __netif_receive_skb_core.constprop.0+521
        __netif_receive_skb_list_core+314
        netif_receive_skb_list_internal+461
        napi_complete_done+110
        virtnet_poll+1009
        __napi_poll+43
        net_rx_action+332
        __do_softirq+232

0/swapper/10: 42f37200: 44b82502: 
        (same stack as above)

138/kworker/u40:3: 42e70300: 44b82502:
        nf_nat_inet_fn+5
        nf_nat_ipv4_pre_routing+69
        nf_hook_slow+65
        ip_rcv+302
        __netif_receive_skb_one_core+137
        process_backlog+129
        __napi_poll+43
        net_rx_action+332
        __do_softirq+232
        do_softirq.part.0+114
        __local_bh_enable_ip+129
        netif_rx+216
        macvlan_broadcast+242
        macvlan_process_broadcast+239
        process_one_work+372
        worker_thread+645

138/kworker/u40:3: 42e70300: 44b82502: 
        (same stack as above)

I'm not sure if being listed twice in a row is an artifact of something but it's not a problem, having two different workers working the the same ct though definitely looks like it is.

@Florian (or anyone else), does that information rings a bell anywhere?

I've unfortunately not been able to reproduce when disconnecting from the real subnet yet, it looks like it's not just a matter of spamming broadcast packets into the VM; I'll try to spend a bit more time on a reproducer if you need one but if this is enough to figure something I'll be happy to just try any patch in my environment -- my VM hits the warning ~3-4 times/day so it's good enough to validate something if not in a hurry.

Thanks!
Comment 8 Florian Westphal 2024-02-05 13:53:13 UTC
Created attachment 305824 [details]
Add bridge netfilter specific workaround for broadcast race bug

Could someone with a problematic setup test the attached patch?  It adds an (ugly) workaround for broad/multicast packets.
Comment 9 Dominique Martinet 2024-02-05 23:49:37 UTC
Thanks for the prompt reply!

I've rebuilt a kernel with this patch and will let the VM sit for a while; will report back in ~12h as that should be enough time to assess if problems are gone as I normally get 3-4 warnings during the work day.

(at least feature-wise I'm not sure I understand the implication of disabling NAT/conntrack helpers but it doesn't look like anything I care about broke)
Comment 10 Dominique Martinet 2024-02-06 08:52:55 UTC
The VM hasn't encountered any warning so far (~10 hours)

I'll leave it up for the next few days and periodically check to report if there's any trouble.
I'm away this week so can't upgrade the real machine but will make some time next week to upgrade (patch backports almost cleanly to 6.1) and see if that helps with the mysterious crashes we get once a month...!
Comment 11 Dominique Martinet 2024-02-19 04:49:04 UTC
Sorry for the lack of update... I've had zero warnings on the VM since the 6th -- the host crashed (unpatched) on the 9th and rebooted itself into my 6.1 backport of this patch and also didn't exhibit any warning since either.

I'd say this gets rid of this class of warnings :)


I'll keep an eye on further crashes -- these only occurred once a month or so -- but I have all the reasons to believe that's also fixed. Thanks a lot!

I'd appreciate if you cc me when submitting the patch just for follow-up (I'll retest & send Tested-by at least); not sure if we want to backport something so big for 6.1 but I can live with rebuilding security updates for a while.
Comment 12 Artem S. Tashkinov 2024-02-21 08:02:30 UTC
Florian, any updates on this? You've not pushed the commit yet, it seems.
Comment 13 Dominique Martinet 2024-04-19 04:43:33 UTC
Something close to the patch I tested made it as 62e7151ae3eb465e0ab52a20c941ff33bb6332e9 in v6.8, and has also been backported to semi-recent stable trees:
cb734975b0ffa688ff6cc0eed463865bf07b6c01 in v6.7.9
80cd0487f630b5382734997c3e5e3003a77db315 in v6.6.21
2b1414d5e94e477edff1d2c79030f1d742625ea0 in v6.1.81
7c3f28599652acf431a2211168de4a583f30b6d5 in v5.15.151

I'll be switching to it later, but assuming it works until proven otherwise.

Closing as resolved. Thanks Florian & Pablo!
Comment 14 Etienne Lessard 2024-05-22 21:26:42 UTC
After upgrading from 5.15.148 to 5.15.155, we saw a change in behavior in our docker + IP multicast network setup, and we've identified that commit 7c3f28599652acf431a2211168de4a583f30b6d5 was responsible for the change.

I'm still a bit early in the investigation, but wanted to share what I found in (in case someone else ends up with the same issue), and because I'm also interested in any suggestions on how this could be worked around.

The problem is the following: on a host, we have a "docker bridge" network (composed of vethX interfaces + a bridge interface) and an eth0 NIC connected to an external network. The "docker bridge" network is on a 172.17.0.0/16 subnet, and eth0 has a 192.168.1.10/24 IP. IP forwarding is enabled, so a container connected to that "docker bridge" network can reach hosts on the external network by using the bridge interface's IP as the gateway, and the packets gets masqueraded with eth0's IP before being sent on the external network (via the iptables rules added by docker). Nothing special here, and that part is still working fine.

Where it gets interesting: we are also doing IP multicast routing between the docker network and eth0. We're running smcrouted, with a (*, G) "multicast route" that says to forward the traffic for the 239.255.255.250 multicast group from the bridge interface (IIF) to eth0 (OIF).

Before this patch (e.g. 5.15.148), the forwarded multicast packet that was going out on eth0 had its source IP address NATed. After this patch, the source IP address is not NATed anymore, so instead the source IP is the "IP of the docker container".

I haven't tried on a mainline kernel... yet.

Is this the expected behavior from now on ? Any suggestions ?

Thanks