Bug 118031 - kernel panic in ipmi driver
Summary: kernel panic in ipmi driver
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-11 20:25 UTC by NUXI
Modified: 2016-05-11 20:25 UTC (History)
0 users

See Also:
Kernel Version: 3.19 to 4.5.4
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description NUXI 2016-05-11 20:25:01 UTC
After an upgrade from 3.2 to 4.1 some of my equipment starting kernel panicing in the IPMI handler (often within a few minutes of booting but sometimes taking a few hours)

[  337.167974] general protection fault: 0000 [#1] PREEMPT SMP 
[  337.235887] Modules linked in:
[  337.272453] CPU: 6 PID: 40 Comm: ksoftirqd/6 Not tainted 4.5.4 #3
[  337.345555] Hardware name: RadiSys Corp. ATCA-4600/ATCA-4600           , BIOS A4600 0x1.0x0.00.00-0x3 03/27/2012
[  337.467720] task: ffff8806719555c0 ti: ffff880671b14000 task.ti: ffff880671b14000
[  337.557516] RIP: 0010:[<ffffffffbe396f56>]  [<ffffffffbe396f56>] handle_new_recv_msgs+0x98/0x14a
.[  337.662989] RSP: 0018:ffff880671b17cb8  EFLAGS: 00010046
[  337.727735] RAX: dead000000000100 RBX: ffff880670803000 RCX: 0000000000000007
[  337.813364] RDX: dead000000000200 RSI: 0000000000000246 RDI: dead000000000200
[  337.898984] RBP: ffff88066e7e3000 R08: dead000000000100 R09: 0000000000000430
[  337.984617] R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000246
[  338.070253] R13: 0000000000000000 R14: ffff880670803cb4 R15: ffff880670803cb8
[  338.155889] FS:  0000000000000000(0000) GS:ffff88067fa00000(0000) knlGS:0000000000000000
[  338.253000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  338.321914] CR2: 00007fcf12c87945 CR3: 000000003f00b000 CR4: 00000000000406e0
[  338.407551] Stack:
[  338.431577]  ffff880600000000 000000007fa16300 ffff88066e7e3000 ffff8806719555c0
[  338.520356]  ffff880670803d00 ffff8806719555c0 0000000000000000 ffff88037016c280
[  338.609143]  ffff88007915fe78 ffffffffbe940726 0000000000000000 ffff880670720c40
[  338.697922] Call Trace:
[  338.727177]  [<ffffffffbe940726>] ? __schedule+0x8a4/0x91b
[  338.792962]  [<ffffffffbe3970ef>] ? smi_recv_tasklet+0xe7/0xf0
[  338.862936]  [<ffffffffbe32d698>] ? blk_done_softirq+0x88/0x9b
[  338.932909]  [<ffffffffbe3f7ff0>] ? kbd_bh+0x79/0x85
[  338.992441]  [<ffffffffbe05f1e3>] ? tasklet_action+0x72/0xc5
[  339.060317]  [<ffffffffbe05f851>] ? __do_softirq+0x122/0x28d
[  339.128207]  [<ffffffffbe076e89>] ? smpboot_create_threads+0x5c/0x5c
[  339.204432]  [<ffffffffbe05f9d7>] ? run_ksoftirqd+0x1b/0x40
[  339.271269]  [<ffffffffbe07701e>] ? smpboot_thread_fn+0x195/0x19a
[  339.344374]  [<ffffffffbe07457e>] ? kthread+0xc3/0xcb
[  339.404944]  [<ffffffffbe0744bb>] ? kthread_freezable_should_stop+0x5c/0x5c
[  339.488503]  [<ffffffffbe943ccf>] ? ret_from_fork+0x3f/0x70
[  339.555336]  [<ffffffffbe0744bb>] ? kthread_freezable_should_stop+0x5c/0x5c
[  339.638874] Code: e8 9a c6 5a 00 49 89 c4 83 7c 24 0c 00 7f 4c 48 8b 45 00 48 8b 55 08 49 b8 00 01 00 00 00 00 ad de 48 bf 00 02 00 00 00 00 ad de <48> 89 50 08 48 89 44 24 20 48 89 02 4c 89 45 00 48 89 7d 08 75 
[  339.865890] RIP  [<ffffffffbe396f56>] handle_new_recv_msgs+0x98/0x14a
[  339.943166]  RSP <ffff880671b17cb8>
[  339.984952] ---[ end trace ef78791815fa859c ]---
[  340.040306] Kernel panic - not syncing: Fatal exception in interrupt
[  341.143442] Shutting down cpus with NMI
[  341.195113] IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 34, got netfn 7 cmd 33
[  341.315195] IPMI message received with no owner. This
[  341.315195] could be because of a malformed message, or
[  341.315195] because of a hardware error.  Contact your
[  341.315195] hardware vender for assistance
[  341.549096] general protection fault: 0000 [#2] PREEMPT SMP 
[  341.616984] Modules linked in:
[  341.653542] CPU: 6 PID: 40 Comm: ksoftirqd/6 Tainted: G      D         4.5.4 #3
[  341.741257] Hardware name: RadiSys Corp. ATCA-4600/ATCA-4600           , BIOS A4600 0x1.0x0.00.00-0x3 03/27/2012
[  341.863421] task: ffff8806719555c0 ti: ffff880671b14000 task.ti: ffff880671b14000
[  341.953220] RIP: 0010:[<ffffffffbe396f56>]  [<ffffffffbe396f56>] handle_new_recv_msgs+0x98/0x14a
[  342.058691] RSP: 0018:ffff880671b17858  EFLAGS: 00010046
[  342.122393] RAX: dead000000000100 RBX: ffff880670803000 RCX: 0000000000000007
[  342.208021] RDX: dead000000000200 RSI: 0000000000000046 RDI: dead000000000200
[  342.293649] RBP: ffff88037009f400 R08: dead000000000100 R09: 0000000000000459
[  342.379277] R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000000
[  342.464898] R13: 0000000000000001 R14: ffff880670803cb4 R15: ffff880670803cb8
[  342.550526] FS:  0000000000000000(0000) GS:ffff88067fa00000(0000) knlGS:0000000000000000
[  342.647631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  342.716547] CR2: 00007fcf12c87945 CR3: 000000003f00b000 CR4: 00000000000406e0
[  342.802176] Stack:
[  342.826196]  ffff313433343432 00000000be34f1c4 ffff88037009f400 ffffffffbf295137
[  342.914975]  ffff880670803d00 ffff880671b17918 ffffffffbf295137 ffffffffbecff089
[  343.003754]  ffffffffbecff087 ffff880671b17918 ffffffffbf29513d ffffffffbe350e54
[  343.092540] Call Trace:
[  343.121794]  [<ffffffffbe350e54>] ? vsnprintf+0x83/0x3d1
[  343.185492]  [<ffffffffbe3970ef>] ? smi_recv_tasklet+0xe7/0xf0
[  343.255460]  [<ffffffffbe0a3c97>] ? mod_timer+0x184/0x196
[  343.320214]  [<ffffffffbe407a6f>] ? wait_for_xmitr+0x1a/0x7d
[  343.388088]  [<ffffffffbe397365>] ? ipmi_smi_msg_received+0x26d/0x28a
[  343.465366]  [<ffffffffbe39c02b>] ? smi_event_handler+0x3f9/0x54e
[  343.538472]  [<ffffffffbe093ecf>] ? console_unlock+0x3d5/0x40e
[  343.608436]  [<ffffffffbe39c19f>] ? flush_messages+0x1f/0x26
[  343.676315]  [<ffffffffbe395b83>] ? panic_event+0xe5/0x10c
[  343.742105]  [<ffffffffbe09481c>] ? vprintk_emit+0x3b2/0x3b4
[  343.809992]  [<ffffffffbe074fa2>] ? notifier_call_chain+0x3e/0x6d
[  343.883090]  [<ffffffffbe075447>] ? __atomic_notifier_call_chain+0x3a/0x4d
[  343.965603]  [<ffffffffbe101088>] ? panic+0xe9/0x1fe
[  344.025131]  [<ffffffffbe016944>] ? oops_end+0x8a/0x99
[  344.086752]  [<ffffffffbe9459d8>] ? general_protection+0x28/0x30
[  344.158805]  [<ffffffffbe396f56>] ? handle_new_recv_msgs+0x98/0x14a
[  344.233991]  [<ffffffffbe396f30>] ? handle_new_recv_msgs+0x72/0x14a
[  344.309172]  [<ffffffffbe940726>] ? __schedule+0x8a4/0x91b
[  344.374955]  [<ffffffffbe3970ef>] ? smi_recv_tasklet+0xe7/0xf0
[  344.444915]  [<ffffffffbe32d698>] ? blk_done_softirq+0x88/0x9b
[  344.514875]  [<ffffffffbe3f7ff0>] ? kbd_bh+0x79/0x85
[  344.574394]  [<ffffffffbe05f1e3>] ? tasklet_action+0x72/0xc5
[  344.642265]  [<ffffffffbe05f851>] ? __do_softirq+0x122/0x28d
[  344.710138]  [<ffffffffbe076e89>] ? smpboot_create_threads+0x5c/0x5c
[  344.786363]  [<ffffffffbe05f9d7>] ? run_ksoftirqd+0x1b/0x40
[  344.853191]  [<ffffffffbe07701e>] ? smpboot_thread_fn+0x195/0x19a
[  344.926282]  [<ffffffffbe07457e>] ? kthread+0xc3/0xcb
[  344.986846]  [<ffffffffbe0744bb>] ? kthread_freezable_should_stop+0x5c/0x5c
[  345.070379]  [<ffffffffbe943ccf>] ? ret_from_fork+0x3f/0x70
[  345.137206]  [<ffffffffbe0744bb>] ? kthread_freezable_should_stop+0x5c/0x5c
[  345.220738] Code: e8 9a c6 5a 00 49 89 c4 83 7c 24 0c 00 7f 4c 48 8b 45 00 48 8b 55 08 49 b8 00 01 00 00 00 00 ad de 48 bf 00 02 00 00 00 00 ad de <48> 89 50 08 48 89 44 24 20 48 89 02 4c 89 45 00 48 89 7d 08 75 
[  345.447735] RIP  [<ffffffffbe396f56>] handle_new_recv_msgs+0x98/0x14a
[  345.525009]  RSP <ffff880671b17858>
[  345.566787] ---[ end trace ef78791815fa859d ]---
[  345.622137] Kernel panic - not syncing: Fatal exception in interrupt
[  345.698377] Kernel Offset: 0x3d000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)


I managed to bisect it to this commit - https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ea0ed2b5be81781ba976bc03414ef5da76270b9 - before that commit they are perfectly stable and after it they crash.

Here is the crash with DEBUG_LIST enabled. After 2 minutes the ipmi watchdog will kick in and reboot the system because its no longer processing IPMI messages correctly.

[ 3450.354706] IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 34, got netfn 7 cmd 33
[ 3450.354712] IPMI message received with no owner. This
[ 3450.354712] could be because of a malformed message, or
[ 3450.354712] because of a hardware error.  Contact your
[ 3450.354712] hardware vender for assistance
[ 3450.354716] ------------[ cut here ]------------
[ 3450.354726] WARNING: CPU: 11 PID: 66 at lib/list_debug.c:53 __list_del_entry+0x8d/0x9b()
[ 3450.354728] list_del corruption, ffff88066e2ba800->next is LIST_POISON1 (dead000000000100)
[ 3450.354730] Modules linked in:
[ 3450.354735] CPU: 11 PID: 66 Comm: ksoftirqd/11 Not tainted 4.5.4 #2
[ 3450.354737] Hardware name: RadiSys Corp. ATCA-4600/ATCA-4600           , BIOS A4600 0x1.0x0.00.00-0x3 03/27/2012
[ 3450.354739]  0000000000000000 ffffffff84d3009f ffffffff84341f8b ffffffff84d3009f
[ 3450.354743]  0000000000000082 ffff88067104fc38 ffffffff84d3009f ffff88067104fc38
[ 3450.354746]  ffffffff8405b4f0 0000000000000000 ffffffff84357b98 ffff88065156e6d0
[ 3450.354749] Call Trace:
[ 3450.354757]  [<ffffffff84341f8b>] ? dump_stack+0x63/0x8c
[ 3450.354763]  [<ffffffff8405b4f0>] ? warn_slowpath_common+0x99/0xb2
[ 3450.354766]  [<ffffffff84357b98>] ? __list_del_entry+0x8d/0x9b
[ 3450.354769]  [<ffffffff8405b5aa>] ? warn_slowpath_fmt+0x45/0x4d
[ 3450.354775]  [<ffffffff84084b56>] ? dequeue_task_fair+0x6d2/0x6e1
[ 3450.354781]  [<ffffffff840132de>] ? __switch_to+0x406/0x47f
[ 3450.354783]  [<ffffffff84357b98>] ? __list_del_entry+0x8d/0x9b
[ 3450.354786]  [<ffffffff84357baf>] ? list_del+0x9/0x26
[ 3450.354792]  [<ffffffff84391322>] ? handle_new_recv_msgs+0x83/0x123
[ 3450.354800]  [<ffffffff849324e6>] ? __schedule+0x8a4/0x91b
[ 3450.354803]  [<ffffffff843914a4>] ? smi_recv_tasklet+0xc1/0xcc
[ 3450.354807]  [<ffffffff843283ec>] ? blk_done_softirq+0x81/0x99
[ 3450.354811]  [<ffffffff8405ed5b>] ? tasklet_action+0x72/0xc5
[ 3450.354813]  [<ffffffff8405f3c9>] ? __do_softirq+0x122/0x28d
[ 3450.354819]  [<ffffffff84076567>] ? smpboot_create_threads+0x5c/0x5c
[ 3450.354822]  [<ffffffff8405f54f>] ? run_ksoftirqd+0x1b/0x40
[ 3450.354825]  [<ffffffff840766fc>] ? smpboot_thread_fn+0x195/0x19a
[ 3450.354828]  [<ffffffff84073dad>] ? kthread+0xc3/0xcb
[ 3450.354831]  [<ffffffff84073cea>] ? kthread_freezable_should_stop+0x5c/0x5c
[ 3450.354835]  [<ffffffff849356cf>] ? ret_from_fork+0x3f/0x70
[ 3450.354838]  [<ffffffff84073cea>] ? kthread_freezable_should_stop+0x5c/0x5c
[ 3450.354840] ---[ end trace 823e65229bb291df ]---
[ 3450.354842] IPMI message received with no owner. This
[ 3450.354842] could be because of a malformed message, or
[ 3450.354842] because of a hardware error.  Contact your
[ 3450.354842] hardware vender for assistance
[ 3450.354846] ------------[ cut here ]------------
[ 3450.354849] WARNING: CPU: 11 PID: 66 at lib/list_debug.c:56 __list_del_entry+0x8d/0x9b()
[ 3450.354851] list_del corruption, ffff88066e2ba800->prev is LIST_POISON2 (dead000000000200)
[ 3450.354852] Modules linked in:
[ 3450.354855] CPU: 11 PID: 66 Comm: ksoftirqd/11 Tainted: G        W       4.5.4 #2
[ 3450.354856] Hardware name: RadiSys Corp. ATCA-4600/ATCA-4600           , BIOS A4600 0x1.0x0.00.00-0x3 03/27/2012
[ 3450.354858]  0000000000000000 ffffffff84d3009f ffffffff84341f8b ffffffff84d3009f
[ 3450.354861]  0000000000000082 ffff88067104fc38 ffffffff84d3009f ffff88067104fc38
[ 3450.354864]  ffffffff8405b4f0 0000000000000000 ffffffff84357b98 ffff88065156e6d0
[ 3450.354867] Call Trace:
[ 3450.354869]  [<ffffffff84341f8b>] ? dump_stack+0x63/0x8c
[ 3450.354873]  [<ffffffff8405b4f0>] ? warn_slowpath_common+0x99/0xb2
[ 3450.354876]  [<ffffffff84357b98>] ? __list_del_entry+0x8d/0x9b
[ 3450.354879]  [<ffffffff8405b5aa>] ? warn_slowpath_fmt+0x45/0x4d
[ 3450.354881]  [<ffffffff84084b56>] ? dequeue_task_fair+0x6d2/0x6e1
[ 3450.354884]  [<ffffffff840132de>] ? __switch_to+0x406/0x47f
[ 3450.354887]  [<ffffffff84357b98>] ? __list_del_entry+0x8d/0x9b
[ 3450.354890]  [<ffffffff84357baf>] ? list_del+0x9/0x26
[ 3450.354892]  [<ffffffff84391322>] ? handle_new_recv_msgs+0x83/0x123
[ 3450.354896]  [<ffffffff849324e6>] ? __schedule+0x8a4/0x91b
[ 3450.354899]  [<ffffffff843914a4>] ? smi_recv_tasklet+0xc1/0xcc
[ 3450.354901]  [<ffffffff843283ec>] ? blk_done_softirq+0x81/0x99
[ 3450.354903]  [<ffffffff8405ed5b>] ? tasklet_action+0x72/0xc5
[ 3450.354906]  [<ffffffff8405f3c9>] ? __do_softirq+0x122/0x28d
[ 3450.354909]  [<ffffffff84076567>] ? smpboot_create_threads+0x5c/0x5c
[ 3450.354912]  [<ffffffff8405f54f>] ? run_ksoftirqd+0x1b/0x40
[ 3450.354915]  [<ffffffff840766fc>] ? smpboot_thread_fn+0x195/0x19a
[ 3450.354917]  [<ffffffff84073dad>] ? kthread+0xc3/0xcb
[ 3450.354920]  [<ffffffff84073cea>] ? kthread_freezable_should_stop+0x5c/0x5c
[ 3450.354923]  [<ffffffff849356cf>] ? ret_from_fork+0x3f/0x70
[ 3450.354926]  [<ffffffff84073cea>] ? kthread_freezable_should_stop+0x5c/0x5c
[ 3450.354928] ---[ end trace 823e65229bb291e0 ]---

Note You need to log in before you can comment on or make changes to this bug.