Bug 204071

Summary: Dell PowerEdge R630 / 02C2CP - BUG: soft lockup - CPU stuck - Due to IPMI Driver dmi_add_platform_ipmi() patch
Product: Drivers Reporter: Torsten Rabold (torsten.rabold)
Component: OtherAssignee: drivers_other
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: april, dan.poltawski, danny, walecha99
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: v5.1-rc6+ Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg, lshw, dmidecode, firmware and strace of ipmitool

Description Torsten Rabold 2019-07-05 15:16:29 UTC
Created attachment 283555 [details]
dmesg, lshw, dmidecode,  firmware and strace of ipmitool

Linux kernel since v5.1-rc6 up to latest 5.2-rc7 have an issue in the
ipmi driver that leads to CPU lockup on Dell PowerEdge R630/02C2CP BIOS 2.9.1 12/04/2018

Systems become completely unusable.

watchdog: BUG: soft lockup - CPU#5 stuck
-> more dmesg, hw/fw , strace info attached

The issue persists up to latest 5.2-rc kernels.
Issue was introduced with bd2e98b351b668fa91
   ipmi: Fix failure on SMBIOS specified devices                                                                                                                                                                                               

CPU stall can be triggered by:
$ ipmitool sensor list

Issue does not show up on DELL640 servers.

5.1.16 kernels with reverted bd2e98b35 run flawless.

---

Anyway the patch seems to only remove one of the hacks in drivers/char/ipmi/ipmi_dmi.c
// SPDX-License-Identifier: GPL-2.0+                                                                                                                                                                                                           
/*                                                                                                                                                                                                                                             
 * A hack to create a platform device from a DMI entry.  This will                                                                                                                                                                             
 * allow autoloading of the IPMI drive based on SMBIOS entries.                                                                                                                                                                                
 */                                                                                                                                                                                                                                            
 ...
Comment 1 Dan Poltawski 2019-07-10 13:31:29 UTC
I believe i'm seeing a similar issue on 5.1.16 on a DELL620
Comment 2 Dan Poltawski 2019-07-10 13:31:46 UTC
*Poweredge R620
Comment 3 April 2019-08-06 23:57:40 UTC
If desperate and if this is a blocker, you can add the following cmdline parameters to mitigate:


ipmi_si.trydmi=0 ipmi_si.tryacpi=0 ipmi_si.tryplatform=0
Comment 4 Torsten Rabold 2019-08-08 15:07:43 UTC
Mitigation boot cmdline parameters applied to a 5.2.7 kernel.
Outcome:
The CPU lockups do not occur but the ipmi_si kernel module is not loading.
IPMI sensor data can't be read.
Comment 5 Torsten Rabold 2019-08-08 15:59:57 UTC
The last released kernel with working IPMI and no CPU lockup on the R630 is  5.0.9.

Regression starts with 5.1.0-rc1. With this version the ipmi_si module is already not loading. Lockups do not occur.

The CPU lockups/stalls appear with v5.1-rc6.

5.3-rc3 still has the CPU stalls/lockups. 
dmesg:
[  150.878229] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.884891] rcu:     5-...0: (2 ticks this GP) idle=092/0/0x1 softirq=3032/3032 fqs=2578 
[  150.893768] rcu:     12-...0: (2 GPs behind) idle=686/0/0x1 softirq=4613/4613 fqs=2579 
[  150.902450]  (detected by 0, t=5258 jiffies, g=21757, q=1886)
[  150.908891] Sending NMI from CPU 0 to CPUs 5:
[  150.914759] NMI backtrace for cpu 5
[  150.914760] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.3.0-rc3-dpx #1
[  150.914761] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[  150.914761] RIP: 0010:queued_spin_lock_slowpath+0x17c/0x1d0
[  150.914763] Code: 48 03 34 c5 80 07 de 81 48 89 16 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 07 0f 18 0e eb 02 f3 90 <8b> 07 66 85 c0 75 f7 41 89 c0 66 31 c0 39 c1 74 2a 48 85 f6 c6 07
[  150.914763] RSP: 0018:ffffc900065c0e80 EFLAGS: 00000002
[  150.914765] RAX: 0000000000180101 RBX: 0000000000000206 RCX: 0000000000180000
[  150.914765] RDX: ffff88a03cd297c0 RSI: 0000000000000000 RDI: ffff88903b73c620
[  150.914766] RBP: ffff88903b73c600 R08: 0000000000180000 R09: 0000000000000000
[  150.914766] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88903b73c6e8
[  150.914767] R13: ffff88903b73c620 R14: 00000000ffff5609 R15: 0000000000000000
[  150.914768] FS:  0000000000000000(0000) GS:ffff88a03cd00000(0000) knlGS:0000000000000000
[  150.914768] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  150.914769] CR2: 00007fe252887ef8 CR3: 000000207f60a004 CR4: 00000000001606e0
[  150.914769] Call Trace:
[  150.914770]  <IRQ>
[  150.914770]  _raw_spin_lock_irqsave+0x27/0x30
[  150.914771]  smi_timeout+0x24/0xc0 [ipmi_si]
[  150.914771]  ? ipmi_si_irq_handler+0x70/0x70 [ipmi_si]
[  150.914772]  call_timer_fn+0x2d/0x140
[  150.914772]  run_timer_softirq+0x1e5/0x430
[  150.914772]  ? tick_sched_handle+0x25/0x60
[  150.914773]  ? tick_sched_timer+0x37/0x70
[  150.914773]  ? __hrtimer_run_queues+0x10c/0x270
[  150.914774]  __do_softirq+0x117/0x2d0
[  150.914774]  irq_exit+0x92/0xa0
[  150.914775]  smp_apic_timer_interrupt+0x6c/0x130
[  150.914775]  apic_timer_interrupt+0xf/0x20
[  150.914775]  </IRQ>
[  150.914776] RIP: 0010:cpuidle_enter_state+0xc5/0x400
[  150.914777] Code: c7 0f 1f 44 00 00 31 ff e8 08 8a bc ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 15 03 00 00 31 ff e8 9f 9d c1 ff fb 45 85 f6 <0f> 88 7d 02 00 00 4c 2b 7c 24 10 49 63 ce 48 ba cf f7 53 e3 a5 9b
[  150.914778] RSP: 0018:ffffc90006463e68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[  150.914779] RAX: ffff88a03cd28bc0 RBX: ffffe8ffff9118c0 RCX: 000000000000001f
[  150.914779] RDX: 0000000000000000 RSI: 000e5a451fa0e6ac RDI: 0000000000000000
[  150.914780] RBP: ffffffff820bebc0 R08: fffa7aef6f49d076 R09: 0000000000001f19
[  150.914780] R10: ffff88a03cd27c24 R11: 0000000000000018 R12: 0000000000000005
[  150.914781] R13: 0000000000000005 R14: 0000000000000004 R15: 0000001e3af927cd
[  150.914781]  ? cpuidle_enter_state+0xa8/0x400
[  150.914782]  cpuidle_enter+0x29/0x40
[  150.914782]  do_idle+0x1e2/0x220
[  150.914783]  cpu_startup_entry+0x19/0x20
[  150.914783]  start_secondary+0x153/0x1a0
[  150.914784]  secondary_startup_64+0xa4/0xb0
[  150.914791] Sending NMI from CPU 0 to CPUs 12:
[  151.208741] NMI backtrace for cpu 12
[  151.208742] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 5.3.0-rc3-dpx #1
[  151.208742] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.9.1 12/04/2018
[  151.208743] RIP: 0010:queued_spin_lock_slowpath+0x61/0x1d0
[  151.208744] Code: f0 0f ba 2f 08 0f 82 6b 01 00 00 8b 37 81 e6 ff 00 ff ff 09 f0 a9 00 ff ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 ff 75 04 c6 47 01 00
[  151.208745] RSP: 0018:ffffc900066f4e38 EFLAGS: 00000002
[  151.208746] RAX: 0000000000180101 RBX: 0000000000000002 RCX: 0000000000000000
[  151.208746] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88903b73c620
[  151.208747] RBP: ffff88903b73c620 R08: 0000000000000001 R09: 0000000000000000
[  151.208747] R10: 0000000000000000 R11: 0000000000000000 R12: ffff889039fcdec8
[  151.208748] R13: 0000000000000086 R14: ffff889039fcd1f0 R15: 00000000000000cc
[  151.208748] FS:  0000000000000000(0000) GS:ffff88903f900000(0000) knlGS:0000000000000000
[  151.208749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.208750] CR2: 00007fe1923b0050 CR3: 000000207f60a002 CR4: 00000000001606e0
[  151.208750] Call Trace:
[  151.208750]  <IRQ>
[  151.208751]  _raw_spin_lock_irqsave+0x27/0x30
[  151.208751]  set_need_watch+0x27/0x60 [ipmi_si]
[  151.208752]  smi_remove_watch+0x9e/0x100 [ipmi_msghandler]
[  151.208752]  ipmi_smi_msg_received+0x1ed/0x300 [ipmi_msghandler]
[  151.208753]  smi_event_handler+0x13f/0x5f0 [ipmi_si]
[  151.208753]  ipmi_si_irq_handler+0x35/0x70 [ipmi_si]
[  151.208754]  __handle_irq_event_percpu+0x81/0x190
[  151.208754]  handle_irq_event_percpu+0x30/0x80
[  151.208755]  handle_irq_event+0x2d/0x50
[  151.208755]  handle_edge_irq+0x93/0x200
[  151.208756]  handle_irq+0x1f/0x30
[  151.208756]  do_IRQ+0x41/0xd0
[  151.208756]  common_interrupt+0xf/0xf
[  151.208757]  </IRQ>
[  151.208757] RIP: 0010:cpuidle_enter_state+0xc5/0x400
[  151.208758] Code: c7 0f 1f 44 00 00 31 ff e8 08 8a bc ff 80 7c 24 0f 00 74 12 9c 58 f6 c4 02 0f 85 15 03 00 00 31 ff e8 9f 9d c1 ff fb 45 85 f6 <0f> 88 7d 02 00 00 4c 2b 7c 24 10 49 63 ce 48 ba cf f7 53 e3 a5 9b
[  151.208759] RSP: 0018:ffffc9000649be68 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdc
[  151.208760] RAX: ffff88903f928bc0 RBX: ffffe8f0025118c0 RCX: 000000000000001f
[  151.208760] RDX: 0000000000000000 RSI: 000e5a451e816b09 RDI: 0000000000000000
[  151.208761] RBP: ffffffff820bebc0 R08: fffa7aef6f49d076 R09: 0000001deb083ab6
[  151.208762] R10: ffff88903f927c24 R11: 0000000000000007 R12: 000000000000000c
[  151.208762] R13: 000000000000000c R14: 0000000000000004 R15: 0000001e3a8a95bf
[  151.208763]  ? cpuidle_enter_state+0xa8/0x400
[  151.208763]  cpuidle_enter+0x29/0x40
[  151.208764]  do_idle+0x1e2/0x220
[  151.208764]  cpu_startup_entry+0x19/0x20
[  151.208765]  start_secondary+0x153/0x1a0
[  151.208765]  secondary_startup_64+0xa4/0xb0
[  184.258082] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [kworker/6:1:189]
Comment 6 Daniel Suchy 2019-10-18 08:54:37 UTC
Similar issue observed on Dell R410 after loading ipmi_si module. When module is blacklisted, no problems are observed; lockup appears shortly after ipmi_si is loaded.
Comment 7 Torsten Rabold 2020-01-16 16:36:36 UTC
I tried again with v5.4.8 and DELL R630.                                                                                                                                                                                                       
                                                                                                                                                                                                                                               
The CPU lockup did not show up.                                                                                                                                                                                                                
                                                                                                                                                                                                                                               
$ ipmitool sensor list                                                                                                                                                                                                                         
runs without error and shows the wanted results.                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
I can't say in which version the issue disappeared. The last known                                                                                                                                                                             
version for me that did not work was v5.3.1.                                                                                                                                                                                                   
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                               
The issue is gone with at least v5.4.8 and higher.