Each time when I start IPoIB with any 3.1-rcX git version I tested so far I get a kernel panic. This didn't happen in 3.0 yet. fslab2 login: [ 114.392408] EXT4-fs (sdc): barriers disabled [ 114.449737] EXT4-fs (sdc): mounted filesystem with writeback data mode. Opts: journal_async_commit,barrier=0,data=writeback [ 240.944030] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 [ 240.948007] IP: [<ffffffffa0366ce9>] ipoib_start_xmit+0x39/0x280 [ib_ipoib] [ 240.948007] PGD 1f964f067 PUD 1f9bf2067 PMD 0 [ 240.948007] Oops: 0000 [#1] SMP [ 240.948007] CPU 1 [ 240.948007] Modules linked in: ext4 mbcache jbd2 crc16 nfsd ib_umad rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_ipoib sg ib_cm ib_sa ipv6 sd_mod crc_t10dif loop arcmsr md_mod pcspkr ib_mthca ib_mad ib_core 8250_pnp fuse af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc btrfs lzo_decompress lzo_compress zlib_deflate crc32c libcrc32c crypto_hash crypto_algapi ata_generic pata_acpi e1000 pata_amd sata_nv libata scsi_mod unix [last unloaded: scsi_wait_scan] [ 240.948007] [ 240.948007] Pid: 0, comm: kworker/0:0 Not tainted 3.1.0-rc2+ #29 Supermicro H8DCE/H8DCE [ 240.948007] RIP: 0010:[<ffffffffa0366ce9>] [<ffffffffa0366ce9>] ipoib_start_xmit+0x39/0x280 [ib_ipoib] [ 240.948007] RSP: 0018:ffff8801ffc03c10 EFLAGS: 00010246 [ 240.948007] RAX: 0000000000000000 RBX: ffff8801f99ea000 RCX: 0000000000004420 [ 240.948007] RDX: 0000000000000000 RSI: ffff8801f99ea000 RDI: ffff8801f9afd500 [ 240.948007] RBP: ffff8801ffc03c40 R08: ffff8801f940d49c R09: ffff8801f9852240 [ 240.948007] R10: 0000000000000000 R11: 0000000000000020 R12: ffff8801f9afd500 [ 240.948007] R13: 0000000000000050 R14: ffff8801f99ea600 R15: ffff8801f9852280 [ 240.948007] FS: 00007f0b66016700(0000) GS:ffff8801ffc00000(0000) knlGS:0000000000000000 [ 240.948007] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 240.948007] CR2: 0000000000000040 CR3: 00000001f9a65000 CR4: 00000000000006e0 [ 240.948007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 240.948007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 240.948007] Process kworker/0:0 (pid: 0, threadinfo ffff8800bfe82000, task ffff8800bfe6f1a0) [ 240.948007] Stack: [ 240.948007] 0000000000000010 ffff8801f9afd500 0000000000004420 0000000000000050 [ 240.948007] ffff8801f99ea000 ffff8801f9852280 ffff8801ffc03ca0 ffffffff812cd5e0 [ 240.948007] 0000000000000001 ffffffffa03721a0 ffffffff8131f680 ffff8801fa110540 [ 240.948007] Call Trace: [ 240.948007] <IRQ> [ 240.948007] [<ffffffff812cd5e0>] dev_hard_start_xmit+0x2a0/0x590 [ 240.948007] [<ffffffff8131f680>] ? arp_create+0x70/0x200 [ 240.948007] [<ffffffff812e8e1f>] sch_direct_xmit+0xef/0x1c0 [ 240.948007] [<ffffffff812cd9f9>] dev_queue_xmit+0x129/0x3b0 [ 240.948007] [<ffffffff8131f853>] arp_send+0x43/0x50 [ 240.948007] [<ffffffff8131f96b>] arp_solicit+0x10b/0x240
(gdb) l *(ipoib_start_xmit+0x39) 0x1d19 is in ipoib_start_xmit (include/net/dst.h:91). 86 }; 87 }; 88 89 static inline struct neighbour *dst_get_neighbour(struct dst_entry *dst) 90 { 91 return rcu_dereference(dst->_neighbour); 92 } 93 94 static inline struct neighbour *dst_get_neighbour_raw(struct dst_entry *dst) 95 {
Seems to causes by commit 69cce1d1404968f78b177a0314f5822d5afdbbfb. After resolving dev_hard_start_xmit+0x2a0 and then checking where .ndo_start_xmit is set in ipob, I see it happens in ipoib_start_xmit(). While I'm not familiar with that code at all, the right fix seems to be to test for likely(skb_dst(skb) and only then to n = dst_get_neighbour(skb_dst(skb)) Btw, would it possibly not to use single variable letters? Just checking where 'n' is used, is horrible, as searching the code for a single letter does not work well. Thanks, Bernd
Created attachment 68942 [details] Patch to fix the problem.
Created attachment 68952 [details] Rename 'n' into something more sane.
Could someone please also check the usage of likely()? As I'm running into an unlikely() condition, maybe it is not that unlikely? Thanks, Bernd
Can you submit that patch from comment #3 for review to netdev@vger.kernel.org please?
and cc the relevant maintainers/mailinglists (scripts/get_maintainer.pl will help you there)
I already sent the patch today to netdev and linux-rdma Cheers, Bernd
Fixed by commit 22cfb0bf6721bb1f865f67bc21e3c36c272faf36.
Thx.