73181 – full system hang when resetting HP DL380 Gen 7 ILO

Bug 73181 - full system hang when resetting HP DL380 Gen 7 ILO

Summary: full system hang when resetting HP DL380 Gen 7 ILO

Status:	NEEDINFO

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	Other (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	drivers_other

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-03-30 13:12 UTC by Patrick Schaaf
Modified:	2022-01-06 23:34 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:	3.10.34, 3.10.40
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Patrick Schaaf 2014-03-30 13:12:26 UTC

Context: vanilla kernel 3.10.34, ILO driver compiled in but not consciously used in any way. HP DL380 Gen 7 servers (system ROM P67) with ILO 3 (firmware 1.28)

Trigger: changing the NTP server setting through ILO web interface, confirming the ILO reboot that appears neccessary for such a deep change...

Result: ILO resets, as seen by pings, comes back up, and about 5-10 seconds after ILO comes back up, server freezes completely. During this freeze I can access the ILO console again and see the machine sitting at the login prompt, but it doesn't take any keystrokes. No kernel messages visible (OS is openSUSE 11.4, no systemd, I had kernel crash dumps show in the past but not here)

Repeatable: yes. Happened on one system, repeated it two times on an identical other system, one time with the slightly older 3.10.32 kernel we ran previously.

I'm pretty sure that this did not happen back when we ran 3.0.x kernels.

Comment 1 Patrick Schaaf 2014-03-30 13:27:44 UTC

Tested on a third system, a DL360p Gen8 with ILO 4, same kernel - NO ISSUE THERE

Comment 2 Alan 2014-04-08 10:18:31 UTC

Is it reproducable without the ILO driver loaded ?

Comment 3 Patrick Schaaf 2014-05-20 10:12:18 UTC

(In reply to Alan from comment #2)
> Is it reproducable without the ILO driver loaded ?

Got around to test that now. Kernel 3.10.40, this time without the ILO driver and without the ILO Watchdog driver.

Result: kernel does NOT hang anymore.

However, at about the time of the previous hang, a ping test shows 2 + 1 second latency before returning to normal, and the kernel gives the following NMI error:

[582464.058346] NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0.
[582464.058355] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I  3.10.40-eig
htball #1
[582464.058360] Hardware name: HP ProLiant DL380 G7, BIOS P67 05/05/2011
[582464.058364] task: ffffffff81a10440 ti: ffffffff81a00000 task.ti: ffffffff81a
00000
[582464.058368] RIP: 0010:[<ffffffff8128ddc3>]  [<ffffffff8128ddc3>] intel_idle+
0xd3/0x140
[582464.058380] RSP: 0018:ffffffff81a01e88  EFLAGS: 00000046
[582464.058383] RAX: 0000000000000020 RBX: 0000000000000008 RCX: 000000000000000
1
[582464.058388] RDX: 0000000000000000 RSI: ffffffff81a342e0 RDI: 000000000000000
0
[582464.058392] RBP: ffffffff81a01eb0 R08: 0000000000049390 R09: 000000000000001
8
[582464.058396] R10: 00000000000075b8 R11: 0000000000000000 R12: 000000000000000
4
[582464.058401] R13: 0000000000000020 R14: 0000000000000003 R15: ffffffff81a3445
8
[582464.058405] FS:  0000000000000000(0000) GS:ffff88481fa00000(0000) knlGS:0000
000000000000
[582464.058410] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[582464.058414] CR2: 00007fe5e8304000 CR3: 0000000001a0b000 CR4: 00000000000007e
0
[582464.058418] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 000000000000000
0
[582464.058423] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000000040
0
[582464.058426] Stack:
[582464.058429]  0000000081a01eb0 ffff88481fa17500 ffffffff81a342e0 0002118109a35048
[582464.058436]  0000000000000004 ffffffff81a01ee8 ffffffff8145462b ffff88481fa17500
[582464.058442]  0000000000000004 ffffffff81a342e0 0000000000000000 ffffffff81a01fd8
[582464.058449] Call Trace:
[582464.058460]  [<ffffffff8145462b>] cpuidle_enter_state+0x3b/0xc0
[582464.058463]  [<ffffffff8145475d>] cpuidle_idle_call+0xad/0x150
[582464.058467]  [<ffffffff8103f689>] arch_cpu_idle+0x9/0x20
[582464.058472]  [<ffffffff810b5a6d>] cpu_startup_entry+0x8d/0x170
[582464.058476]  [<ffffffff81572a3d>] rest_init+0x6d/0x70
[582464.058481]  [<ffffffff81a88df1>] start_kernel+0x386/0x392
[582464.058485]  [<ffffffff81a88874>] ? repair_env_string+0x5c/0x5c
[582464.058488]  [<ffffffff81a885ad>] x86_64_start_reservations+0x2a/0x2c
[582464.058491]  [<ffffffff81a8867b>] x86_64_start_kernel+0xcc/0xcf
[582464.058493] Code: 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 0f ae f0 65 48 8b 04 25 b0 b7 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <85> 1d 07 65 7a 00 75 0e 48 8d 75 dc bf 05 00 00 00 e8 87 e6 e2 
[582466.063969] dmar: DRHD: handling fault status reg 2
[582466.063974] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582466.063974] DMAR:[fault reason 06] PTE Read access is not set

Comment 4 Patrick Schaaf 2014-05-20 10:14:36 UTC

The 3 "dmar" message then continue in an endless loop (while the system appears to be usable, ssh connections at least survived)

I don't trust it to start its VMs again, though - will reboot now.

[582466.063969] dmar: DRHD: handling fault status reg 2
[582466.063974] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582466.063974] DMAR:[fault reason 06] PTE Read access is not set
[582584.557186] dmar: DRHD: handling fault status reg 102
[582584.557199] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.557199] DMAR:[fault reason 06] PTE Read access is not set
[582584.557787] dmar: DRHD: handling fault status reg 202
[582584.557807] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.557807] DMAR:[fault reason 06] PTE Read access is not set
[582584.560458] dmar: DRHD: handling fault status reg 302
[582584.560468] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.560468] DMAR:[fault reason 06] PTE Read access is not set
[582584.560966] dmar: DRHD: handling fault status reg 402
[582584.560995] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.560995] DMAR:[fault reason 06] PTE Read access is not set
[582584.561416] dmar: DRHD: handling fault status reg 502
[582584.561427] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.561427] DMAR:[fault reason 06] PTE Read access is not set
[582584.561711] dmar: DRHD: handling fault status reg 602
[582584.561721] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.561721] DMAR:[fault reason 06] PTE Read access is not set
[582584.562514] dmar: DRHD: handling fault status reg 702
[582584.562543] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 
[582584.562543] DMAR:[fault reason 06] PTE Read access is not set
... and so on

Comment 5 Patrick Schaaf 2014-05-20 13:46:44 UTC

Retested with that 3.10.40 kernel and ILO4 / DL360 Gen8 :
no system hang
that DMAR message (three lines) appears _once_

Also tested with the same kernel and an ILO2 / DL360 Gen6 :
no issues at all

Comment 6 Patrick Schaaf 2014-05-20 14:09:57 UTC

Sorry, error in previous message: an ILO2 / DL360 Gen5 (not Gen6) shows no symptoms

Final data point: ILO2 / DL380 Gen6 (now for real) :

no system hang, but messages:

[594728.833980] dmar: DRHD: handling fault status reg 2
[594728.834050] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834050] DMAR:[fault reason 06] PTE Read access is not set
[594728.834060] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834060] DMAR:[fault reason 06] PTE Read access is not set
[594728.834069] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834069] DMAR:[fault reason 06] PTE Read access is not set
[594728.834078] NMI: PCI system error (SERR) for reason a1 on CPU 0.
[594728.834080] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834080] DMAR:[fault reason 06] PTE Read access is not set
[594728.834086] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834086] DMAR:[fault reason 06] PTE Read access is not set
[594728.834090] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834090] DMAR:[fault reason 06] PTE Read access is not set
[594728.834094] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834094] DMAR:[fault reason 06] PTE Read access is not set
[594728.834099] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834099] DMAR:[fault reason 06] PTE Read access is not set
[594728.834104] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834104] DMAR:[fault reason 06] PTE Read access is not set
... repeats a bit more ...
[594728.834427] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 
[594728.834427] DMAR:[fault reason 06] PTE Read access is not set
[594728.834430] dmar: DRHD: handling fault status reg 200
[594728.834451] Dazed and confused, but trying to continue

Comment 7 Roland Kletzing 2022-01-06 23:34:00 UTC

https://bugzilla.kernel.org/show_bug.cgi?id=70371

Note You need to log in before you can comment on or make changes to this bug.