Latest working kernel version: (not known) Earliest failing kernel version: 2.6.24 Distribution: Ubuntu 7.10 server Hardware Environment: Intel S5000PAL motherboard + Mellanox MT25204 InfiniBand HCA Software Environment: Ubuntu 7.10 server + OFED 1.2.5.5 user space components. Problem Description: ib_write_bw triggers kernel bug in ib_mthca Steps to reproduce: 1) Compile the 2.6.24.2 kernel with kernel debugging enabled. 2) Download, compile and install the OFED 1.2.5.5 userspace components. 3) Run the following commands: modprobe ib_uverbs modprobe rdma_ucm /bin/mkdir -p /dev/infiniband if [ ! -e /dev/infiniband/uverbs0 ]; then /bin/mknod /dev/infiniband/uverbs0 c $(cat /sys/class/infiniband_verbs/uverbs0/dev | sed 's/:/ /g') fi if [ ! -e /dev/infiniband/rdma_cm ]; then /bin/mknod /dev/infiniband/rdma_cm c $(cat /sys/class/misc/rdma_cm/dev | sed 's/:/ /g') fi ib_write_bw Result: ------------[ cut here ]------------ kernel BUG at include/linux/scatterlist.h:59! invalid opcode: 0000 [1] SMP CPU 1 Modules linked in: ib_iser iscsi_tcp libiscsi scsi_transport_iscsi rdma_ucm rdma_cm iw_cm ib_addr ib_uverbs ib_ipoib ib_cm ib_sa ipv6 parport_pc lp parport loop af_packet ib_mthca ib_mad ib_core pcspkr psmouse serio_raw iTCO_wdt iTCO_vendor_support shpchp pci_hotplug evdev ext3 jbd mbcache sg sd_mod sr_mod cdrom ata_piix ata_generic libata scsi_mod ehci_hcd uhci_hcd usbcore e1000 fuse Pid: 4343, comm: ib_write_bw Not tainted 2.6.24 #3 RIP: 0010:[<ffffffff881a0191>] [<ffffffff881a0191>] :ib_mthca:mthca_map_user_db+0x2c1/0x2f0 RSP: 0018:ffff81007b76dd58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff81007b7e9e58 RCX: ffff81007b7e9e48 RDX: ffff810003de3100 RSI: 0000000087654321 RDI: ffff810003e5d270 RBP: ffff81007c8c0000 R08: 800000007a5d8067 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 R13: 000000000060c000 R14: ffff81007b7e9000 R15: ffff81007dcc3980 FS: 00002b21dcc8bc60(0000) GS:ffff81007f40d580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000060e000 CR3: 000000007b751000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ib_write_bw (pid: 4343, threadinfo ffff81007b76c000, task ffff81007b7f0000) Stack: ffff81007b76dd98 0000000000000000 ffffffff80332e4d 0000000000000002 ffff81007b7e9e48 ffff81007b7e9e78 ffff81007b7e9dc8 0000003f00000046 ffff810003de3100 0000000000000046 ffff81007d3ec750 000000000000007f Call Trace: [<ffffffff80332e4d>] __down_write_trylock+0x1d/0x60 [<ffffffff8819e2b0>] :ib_mthca:mthca_create_cq+0xb0/0x200 [<ffffffff8825a334>] :ib_uverbs:ib_uverbs_create_cq+0x1b4/0x2d0 [<ffffffff8825639e>] :ib_uverbs:ib_uverbs_write+0xbe/0xe0 [<ffffffff8029d27d>] vfs_write+0xdd/0x190 [<ffffffff8029d983>] sys_write+0x53/0x90 [<ffffffff8020c2ee>] system_call+0x7e/0x83 Code: 0f 0b eb fe 0f 0b eb fe 48 8b 74 24 30 4c 89 ae 88 00 00 00 RIP [<ffffffff881a0191>] :ib_mthca:mthca_map_user_db+0x2c1/0x2f0 RSP <ffff81007b76dd58> ---[ end trace 28761e105a0be450 ]---
Created attachment 14778 [details] Kernel config (.config)
Looks like the various sg-list conversions left out some needed calls to sg_init_table(). I'll fix this up.
Created attachment 14782 [details] add missing sg_init_table() call Please try this patch and let me know if this fixes things.
The attached patch solves this issue. Thanks for the quick fix !
Roland, I can't see this commit neither in Linus nor in IB tree. Is it planned to be submitted?
Sorry, the upstream fix is in a slightly different spot. It is fe174357 ("IB/mthca: Add missing sg_init_table() in mthca_map_user_db()"), which went into Linus's tree just after 2.6.25-rc1.