Bug 204071 - Dell PowerEdge R630 / 02C2CP - BUG: soft lockup - CPU stuck - Due to IPMI Driver dmi_add_platform_ipmi() patch
Summary: Dell PowerEdge R630 / 02C2CP - BUG: soft lockup - CPU stuck - Due to IPMI Dr...
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-05 15:16 UTC by Torsten Rabold
Modified: 2020-01-16 16:36 UTC (History)
4 users (show)

See Also:
Kernel Version: v5.1-rc6+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg, lshw, dmidecode, firmware and strace of ipmitool (739.38 KB, text/plain)
2019-07-05 15:16 UTC, Torsten Rabold
Details

Description Torsten Rabold 2019-07-05 15:16:29 UTC
Created attachment 283555 [details]
dmesg, lshw, dmidecode,  firmware and strace of ipmitool

Linux kernel since v5.1-rc6 up to latest 5.2-rc7 have an issue in the
ipmi driver that leads to CPU lockup on Dell PowerEdge R630/02C2CP BIOS 2.9.1 12/04/2018

Systems become completely unusable.

watchdog: BUG: soft lockup - CPU#5 stuck
-> more dmesg, hw/fw , strace info attached

The issue persists up to latest 5.2-rc kernels.
Issue was introduced with bd2e98b351b668fa91
   ipmi: Fix failure on SMBIOS specified devices                                                                                                                                                                                               

CPU stall can be triggered by:
$ ipmitool sensor list

Issue does not show up on DELL640 servers.

5.1.16 kernels with reverted bd2e98b35 run flawless.

---

Anyway the patch seems to only remove one of the hacks in drivers/char/ipmi/ipmi_dmi.c
// SPDX-License-Identifier: GPL-2.0+                                                                                                                                                                                                           
/*                                                                                                                                                                                                                                             
 * A hack to create a platform device from a DMI entry.  This will                                                                                                                                                                             
 * allow autoloading of the IPMI drive based on SMBIOS entries.                                                                                                                                                                                
 */                                                                                                                                                                                                                                            
 ...
Comment 1 Dan Poltawski 2019-07-10 13:31:29 UTC
I believe i'm seeing a similar issue on 5.1.16 on a DELL620
Comment 2 Dan Poltawski 2019-07-10 13:31:46 UTC
*Poweredge R620
Comment 3 April 2019-08-06 23:57:40 UTC
If desperate and if this is a blocker, you can add the following cmdline parameters to mitigate:


ipmi_si.trydmi=0 ipmi_si.tryacpi=0 ipmi_si.tryplatform=0
Comment 4 Torsten Rabold 2019-08-08 15:07:43 UTC
Mitigation boot cmdline parameters applied to a 5.2.7 kernel.
Outcome:
The CPU lockups do not occur but the ipmi_si kernel module is not loading.
IPMI sensor data can't be read.
Comment 5 Torsten Rabold 2019-08-08 15:59:57 UTC
The last released kernel with working IPMI and no CPU lockup on the R630 is  5.0.9.

Regression starts with 5.1.0-rc1. With this version the ipmi_si module is already not loading. Lockups do not occur.

The CPU lockups/stalls appear with v5.1-rc6.

5.3-rc3 still has the CPU stalls/lockups. 
dmesg:
[  150.878229] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.884891] rcu:     5-...0: (2 ticks this GP) idle=092/0/0x1 softirq=3032/3032 fqs=2578 
[  150.893768] rcu:     12-...0: (2 GPs behind) idle=686/0/0x1 softirq=4613/4613 fqs=2579 
[  150.902450]  (detected by 0, t=5258 jiffies, g=21757, q=1886)
[  150.908891] Sending NMI from CPU 0 to CPUs 5:
[  150.914759] NMI backtrace for cpu 5
[  150.914760] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.3.0-rc3-dpx #1
[  150.914761] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[  150.914761] RIP: 0010:queued_spin_lock_slowpath+0x17c/0x1d0
[  150.914763] Code: 48 03 34 c5 80 07 de 81 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 18 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 31 c0 39 c1 74 2a 48 85 f6 c6 07
[  150.914763] RSP: 0018:ffffc900065c0e80 EFLAGS: 00000002
[  150.914765] RAX: 0000000000180101 RBX: 0000000000000206 RCX: 0000000000180000
[  150.914765] RDX: ffff88a03cd297c0 RSI: 0000000000000000 RDI: ffff88903b73c620
[  150.914766] RBP: ffff88903b73c600 R08: 0000000000180000 R09: 0000000000000000
[  150.914766] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88903b73c6e8
[  150.914767] R13: ffff88903b73c620 R14: 00000000ffff5609 R15: 0000000000000000
[  150.914768] FS:  0000000000000000(0000) GS:ffff88a03cd00000(0000) knlGS:0000000000000000
[  150.914768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  150.914769] CR2: 00007fe252887ef8 CR3: 000000207f60a004 CR4: 00000000001606e0
[  150.914769] Call Trace:
[  150.914770]  <IRQ>
[  150.914770]  _raw_spin_lock_irqsave+0x27/0x30
[  150.914771]  smi_timeout+0x24/0xc0 [ipmi_si]
[  150.914771]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
[  150.914772]  call_timer_fn+0x2d/0x140
[  150.914772]  run_timer_softirq+0x1e5/0x430
[  150.914772]  ? tick_sched_handle+0x25/0x60
[  150.914773]  ? tick_sched_timer+0x37/0x70
[  150.914773]  ? __hrtimer_run_queues+0x10c/0x270
[  150.914774]  __do_softirq+0x117/0x2d0
[  150.914774]  irq_exit+0x92/0xa0
[  150.914775]  smp_apic_timer_interrupt+0x6c/0x130
[  150.914775]  apic_timer_interrupt+0xf/0x20
[  150.914775]  </IRQ>
[  150.914776] RIP: 0010:cpuidle_enter_state+0xc5/0x400
[  150.914777] Code: c7 0f 1f 44 00 00 31 ff e8 08 8a bc ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 15 03 00 00 31 ff e8 9f 9d c1 ff fb 45 85 f6 <0f> 88 7d 02 00 00 4c 2b 7c 24 10 49 63 ce 48 ba cf f7 53 e3 a5 9b
[  150.914778] RSP: 0018:ffffc90006463e68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  150.914779] RAX: ffff88a03cd28bc0 RBX: ffffe8ffff9118c0 RCX: 000000000000001f
[  150.914779] RDX: 0000000000000000 RSI: 000e5a451fa0e6ac RDI: 0000000000000000
[  150.914780] RBP: ffffffff820bebc0 R08: fffa7aef6f49d076 R09: 0000000000001f19
[  150.914780] R10: ffff88a03cd27c24 R11: 0000000000000018 R12: 0000000000000005
[  150.914781] R13: 0000000000000005 R14: 0000000000000004 R15: 0000001e3af927cd
[  150.914781]  ? cpuidle_enter_state+0xa8/0x400
[  150.914782]  cpuidle_enter+0x29/0x40
[  150.914782]  do_idle+0x1e2/0x220
[  150.914783]  cpu_startup_entry+0x19/0x20
[  150.914783]  start_secondary+0x153/0x1a0
[  150.914784]  secondary_startup_64+0xa4/0xb0
[  150.914791] Sending NMI from CPU 0 to CPUs 12:
[  151.208741] NMI backtrace for cpu 12
[  151.208742] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 5.3.0-rc3-dpx #1
[  151.208742] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[  151.208743] RIP: 0010:queued_spin_lock_slowpath+0x61/0x1d0
[  151.208744] Code: f0 0f ba 2f 08 0f 82 6b 01 00 00 8b 37 81 e6 ff 00 ff ff 09 f0 a9 00 ff ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 ff 75 04 c6 47 01 00
[  151.208745] RSP: 0018:ffffc900066f4e38 EFLAGS: 00000002
[  151.208746] RAX: 0000000000180101 RBX: 0000000000000002 RCX: 0000000000000000
[  151.208746] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88903b73c620
[  151.208747] RBP: ffff88903b73c620 R08: 0000000000000001 R09: 0000000000000000
[  151.208747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff889039fcdec8
[  151.208748] R13: 0000000000000086 R14: ffff889039fcd1f0 R15: 00000000000000cc
[  151.208748] FS:  0000000000000000(0000) GS:ffff88903f900000(0000) knlGS:0000000000000000
[  151.208749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.208750] CR2: 00007fe1923b0050 CR3: 000000207f60a002 CR4: 00000000001606e0
[  151.208750] Call Trace:
[  151.208750]  <IRQ>
[  151.208751]  _raw_spin_lock_irqsave+0x27/0x30
[  151.208751]  set_need_watch+0x27/0x60 [ipmi_si]
[  151.208752]  smi_remove_watch+0x9e/0x100 [ipmi_msghandler]
[  151.208752]  ipmi_smi_msg_received+0x1ed/0x300 [ipmi_msghandler]
[  151.208753]  smi_event_handler+0x13f/0x5f0 [ipmi_si]
[  151.208753]  ipmi_si_irq_handler+0x35/0x70 [ipmi_si]
[  151.208754]  __handle_irq_event_percpu+0x81/0x190
[  151.208754]  handle_irq_event_percpu+0x30/0x80
[  151.208755]  handle_irq_event+0x2d/0x50
[  151.208755]  handle_edge_irq+0x93/0x200
[  151.208756]  handle_irq+0x1f/0x30
[  151.208756]  do_IRQ+0x41/0xd0
[  151.208756]  common_interrupt+0xf/0xf
[  151.208757]  </IRQ>
[  151.208757] RIP: 0010:cpuidle_enter_state+0xc5/0x400
[  151.208758] Code: c7 0f 1f 44 00 00 31 ff e8 08 8a bc ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 15 03 00 00 31 ff e8 9f 9d c1 ff fb 45 85 f6 <0f> 88 7d 02 00 00 4c 2b 7c 24 10 49 63 ce 48 ba cf f7 53 e3 a5 9b
[  151.208759] RSP: 0018:ffffc9000649be68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdc
[  151.208760] RAX: ffff88903f928bc0 RBX: ffffe8f0025118c0 RCX: 000000000000001f
[  151.208760] RDX: 0000000000000000 RSI: 000e5a451e816b09 RDI: 0000000000000000
[  151.208761] RBP: ffffffff820bebc0 R08: fffa7aef6f49d076 R09: 0000001deb083ab6
[  151.208762] R10: ffff88903f927c24 R11: 0000000000000007 R12: 000000000000000c
[  151.208762] R13: 000000000000000c R14: 0000000000000004 R15: 0000001e3a8a95bf
[  151.208763]  ? cpuidle_enter_state+0xa8/0x400
[  151.208763]  cpuidle_enter+0x29/0x40
[  151.208764]  do_idle+0x1e2/0x220
[  151.208764]  cpu_startup_entry+0x19/0x20
[  151.208765]  start_secondary+0x153/0x1a0
[  151.208765]  secondary_startup_64+0xa4/0xb0
[  184.258082] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:1:189]
Comment 6 Daniel Suchy 2019-10-18 08:54:37 UTC
Similar issue observed on Dell R410 after loading ipmi_si module. When module is blacklisted, no problems are observed; lockup appears shortly after ipmi_si is loaded.
Comment 7 Torsten Rabold 2020-01-16 16:36:36 UTC
I tried again with v5.4.8 and DELL R630.                                                                                                                                                                                                       
                                                                                                                                                                                                                                               
The CPU lockup did not show up.                                                                                                                                                                                                                
                                                                                                                                                                                                                                               
$ ipmitool sensor list                                                                                                                                                                                                                         
runs without error and shows the wanted results.                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
I can't say in which version the issue disappeared. The last known                                                                                                                                                                             
version for me that did not work was v5.3.1.                                                                                                                                                                                                   
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
The issue is gone with at least v5.4.8 and higher.

Note You need to log in before you can comment on or make changes to this bug.