Bug 41212

Summary: [regression] [3.1-git] ipoib causes kernel panic (NULL pointer dereference)
Product: Networking Reporter: Bernd Schubert (bernd.schubert)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED CODE_FIX    
Severity: normal CC: davem, florian, linux-rdma, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.1-git Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 40982    
Attachments: Patch to fix the problem.
Rename 'n' into something more sane.

Description Bernd Schubert 2011-08-15 15:40:52 UTC
Each time when I start IPoIB with any 3.1-rcX git version I tested so far I get a kernel panic. This didn't happen in 3.0 yet.


fslab2 login: [  114.392408] EXT4-fs (sdc): barriers disabled
[  114.449737] EXT4-fs (sdc): mounted filesystem with writeback data mode. Opts: journal_async_commit,barrier=0,data=writeback
[  240.944030] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[  240.948007] IP: [<ffffffffa0366ce9>] ipoib_start_xmit+0x39/0x280 [ib_ipoib]
[  240.948007] PGD 1f964f067 PUD 1f9bf2067 PMD 0
[  240.948007] Oops: 0000 [#1] SMP
[  240.948007] CPU 1
[  240.948007] Modules linked in: ext4 mbcache jbd2 crc16 nfsd ib_umad rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_ipoib sg ib_cm ib_sa ipv6 sd_mod crc_t10dif loop arcmsr md_mod pcspkr ib_mthca ib_mad ib_core 8250_pnp fuse af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc btrfs lzo_decompress lzo_compress zlib_deflate crc32c libcrc32c crypto_hash crypto_algapi ata_generic pata_acpi e1000 pata_amd sata_nv libata scsi_mod unix [last unloaded: scsi_wait_scan]
[  240.948007]
[  240.948007] Pid: 0, comm: kworker/0:0 Not tainted 3.1.0-rc2+ #29 Supermicro H8DCE/H8DCE
[  240.948007] RIP: 0010:[<ffffffffa0366ce9>]  [<ffffffffa0366ce9>] ipoib_start_xmit+0x39/0x280 [ib_ipoib]
[  240.948007] RSP: 0018:ffff8801ffc03c10  EFLAGS: 00010246
[  240.948007] RAX: 0000000000000000 RBX: ffff8801f99ea000 RCX: 0000000000004420
[  240.948007] RDX: 0000000000000000 RSI: ffff8801f99ea000 RDI: ffff8801f9afd500
[  240.948007] RBP: ffff8801ffc03c40 R08: ffff8801f940d49c R09: ffff8801f9852240
[  240.948007] R10: 0000000000000000 R11: 0000000000000020 R12: ffff8801f9afd500
[  240.948007] R13: 0000000000000050 R14: ffff8801f99ea600 R15: ffff8801f9852280
[  240.948007] FS:  00007f0b66016700(0000) GS:ffff8801ffc00000(0000) knlGS:0000000000000000
[  240.948007] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  240.948007] CR2: 0000000000000040 CR3: 00000001f9a65000 CR4: 00000000000006e0
[  240.948007] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  240.948007] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  240.948007] Process kworker/0:0 (pid: 0, threadinfo ffff8800bfe82000, task ffff8800bfe6f1a0)
[  240.948007] Stack:
[  240.948007]  0000000000000010 ffff8801f9afd500 0000000000004420 0000000000000050
[  240.948007]  ffff8801f99ea000 ffff8801f9852280 ffff8801ffc03ca0 ffffffff812cd5e0
[  240.948007]  0000000000000001 ffffffffa03721a0 ffffffff8131f680 ffff8801fa110540
[  240.948007] Call Trace:
[  240.948007]  <IRQ>
[  240.948007]  [<ffffffff812cd5e0>] dev_hard_start_xmit+0x2a0/0x590
[  240.948007]  [<ffffffff8131f680>] ? arp_create+0x70/0x200
[  240.948007]  [<ffffffff812e8e1f>] sch_direct_xmit+0xef/0x1c0
[  240.948007]  [<ffffffff812cd9f9>] dev_queue_xmit+0x129/0x3b0
[  240.948007]  [<ffffffff8131f853>] arp_send+0x43/0x50
[  240.948007]  [<ffffffff8131f96b>] arp_solicit+0x10b/0x240
Comment 1 Bernd Schubert 2011-08-15 15:43:49 UTC
(gdb) l *(ipoib_start_xmit+0x39)
0x1d19 is in ipoib_start_xmit (include/net/dst.h:91).
86              };
87      };
88
89      static inline struct neighbour *dst_get_neighbour(struct dst_entry *dst)
90      {
91              return rcu_dereference(dst->_neighbour);
92      }
93
94      static inline struct neighbour *dst_get_neighbour_raw(struct dst_entry *dst)
95      {
Comment 2 Bernd Schubert 2011-08-15 15:58:38 UTC
Seems to causes by commit 69cce1d1404968f78b177a0314f5822d5afdbbfb. 

After resolving dev_hard_start_xmit+0x2a0 and then checking where .ndo_start_xmit is set in ipob, I see it happens in ipoib_start_xmit(). While I'm not familiar with that code at all, the right fix seems to be to test for 

likely(skb_dst(skb) 

and only then to 

n = dst_get_neighbour(skb_dst(skb))


Btw, would it possibly not to use single variable letters? Just checking where 'n' is used, is horrible, as searching the code for a single letter does not work well.


Thanks,
Bernd
Comment 3 Bernd Schubert 2011-08-15 16:27:43 UTC
Created attachment 68942 [details]
Patch to fix the problem.
Comment 4 Bernd Schubert 2011-08-15 16:28:12 UTC
Created attachment 68952 [details]
Rename 'n' into something more sane.
Comment 5 Bernd Schubert 2011-08-15 16:29:06 UTC
Could someone please also check the usage of likely()? As I'm running into an unlikely() condition, maybe it is not that unlikely?


Thanks,
Bernd
Comment 6 Florian Mickler 2011-08-16 14:55:35 UTC
Can you submit that patch from comment #3 for review to netdev@vger.kernel.org please?
Comment 7 Florian Mickler 2011-08-16 15:07:59 UTC
and cc the relevant maintainers/mailinglists (scripts/get_maintainer.pl will help you there)
Comment 8 Bernd Schubert 2011-08-16 15:09:51 UTC
I already sent the patch today to netdev and linux-rdma


Cheers,
Bernd
Comment 9 Bernd Schubert 2011-08-19 14:54:32 UTC
Fixed by commit 22cfb0bf6721bb1f865f67bc21e3c36c272faf36.
Comment 10 Florian Mickler 2011-08-19 16:40:40 UTC
Thx.