Bug 73181
Summary: | full system hang when resetting HP DL380 Gen 7 ILO | ||
---|---|---|---|
Product: | Drivers | Reporter: | Patrick Schaaf (kernelorg) |
Component: | Other | Assignee: | drivers_other |
Status: | NEEDINFO --- | ||
Severity: | normal | CC: | alan, devzero |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.10.34, 3.10.40 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Patrick Schaaf
2014-03-30 13:12:26 UTC
Tested on a third system, a DL360p Gen8 with ILO 4, same kernel - NO ISSUE THERE Is it reproducable without the ILO driver loaded ? (In reply to Alan from comment #2) > Is it reproducable without the ILO driver loaded ? Got around to test that now. Kernel 3.10.40, this time without the ILO driver and without the ILO Watchdog driver. Result: kernel does NOT hang anymore. However, at about the time of the previous hang, a ping test shows 2 + 1 second latency before returning to normal, and the kernel gives the following NMI error: [582464.058346] NMI: IOCK error (debug interrupt?) for reason 61 on CPU 0. [582464.058355] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G I 3.10.40-eig htball #1 [582464.058360] Hardware name: HP ProLiant DL380 G7, BIOS P67 05/05/2011 [582464.058364] task: ffffffff81a10440 ti: ffffffff81a00000 task.ti: ffffffff81a 00000 [582464.058368] RIP: 0010:[<ffffffff8128ddc3>] [<ffffffff8128ddc3>] intel_idle+ 0xd3/0x140 [582464.058380] RSP: 0018:ffffffff81a01e88 EFLAGS: 00000046 [582464.058383] RAX: 0000000000000020 RBX: 0000000000000008 RCX: 000000000000000 1 [582464.058388] RDX: 0000000000000000 RSI: ffffffff81a342e0 RDI: 000000000000000 0 [582464.058392] RBP: ffffffff81a01eb0 R08: 0000000000049390 R09: 000000000000001 8 [582464.058396] R10: 00000000000075b8 R11: 0000000000000000 R12: 000000000000000 4 [582464.058401] R13: 0000000000000020 R14: 0000000000000003 R15: ffffffff81a3445 8 [582464.058405] FS: 0000000000000000(0000) GS:ffff88481fa00000(0000) knlGS:0000 000000000000 [582464.058410] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [582464.058414] CR2: 00007fe5e8304000 CR3: 0000000001a0b000 CR4: 00000000000007e 0 [582464.058418] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 000000000000000 0 [582464.058423] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 000000000000040 0 [582464.058426] Stack: [582464.058429] 0000000081a01eb0 ffff88481fa17500 ffffffff81a342e0 0002118109a35048 [582464.058436] 0000000000000004 ffffffff81a01ee8 ffffffff8145462b ffff88481fa17500 [582464.058442] 0000000000000004 ffffffff81a342e0 0000000000000000 ffffffff81a01fd8 [582464.058449] Call Trace: [582464.058460] [<ffffffff8145462b>] cpuidle_enter_state+0x3b/0xc0 [582464.058463] [<ffffffff8145475d>] cpuidle_idle_call+0xad/0x150 [582464.058467] [<ffffffff8103f689>] arch_cpu_idle+0x9/0x20 [582464.058472] [<ffffffff810b5a6d>] cpu_startup_entry+0x8d/0x170 [582464.058476] [<ffffffff81572a3d>] rest_init+0x6d/0x70 [582464.058481] [<ffffffff81a88df1>] start_kernel+0x386/0x392 [582464.058485] [<ffffffff81a88874>] ? repair_env_string+0x5c/0x5c [582464.058488] [<ffffffff81a885ad>] x86_64_start_reservations+0x2a/0x2c [582464.058491] [<ffffffff81a8867b>] x86_64_start_kernel+0xcc/0xcf [582464.058493] Code: 48 89 d1 48 2d c8 1f 00 00 0f 01 c8 0f ae f0 65 48 8b 04 25 b0 b7 00 00 48 8b 80 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <85> 1d 07 65 7a 00 75 0e 48 8d 75 dc bf 05 00 00 00 e8 87 e6 e2 [582466.063969] dmar: DRHD: handling fault status reg 2 [582466.063974] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582466.063974] DMAR:[fault reason 06] PTE Read access is not set The 3 "dmar" message then continue in an endless loop (while the system appears to be usable, ssh connections at least survived) I don't trust it to start its VMs again, though - will reboot now. [582466.063969] dmar: DRHD: handling fault status reg 2 [582466.063974] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582466.063974] DMAR:[fault reason 06] PTE Read access is not set [582584.557186] dmar: DRHD: handling fault status reg 102 [582584.557199] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.557199] DMAR:[fault reason 06] PTE Read access is not set [582584.557787] dmar: DRHD: handling fault status reg 202 [582584.557807] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.557807] DMAR:[fault reason 06] PTE Read access is not set [582584.560458] dmar: DRHD: handling fault status reg 302 [582584.560468] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.560468] DMAR:[fault reason 06] PTE Read access is not set [582584.560966] dmar: DRHD: handling fault status reg 402 [582584.560995] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.560995] DMAR:[fault reason 06] PTE Read access is not set [582584.561416] dmar: DRHD: handling fault status reg 502 [582584.561427] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.561427] DMAR:[fault reason 06] PTE Read access is not set [582584.561711] dmar: DRHD: handling fault status reg 602 [582584.561721] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.561721] DMAR:[fault reason 06] PTE Read access is not set [582584.562514] dmar: DRHD: handling fault status reg 702 [582584.562543] dmar: DMAR:[DMA Read] Request device [02:00.2] fault addr e8000 [582584.562543] DMAR:[fault reason 06] PTE Read access is not set ... and so on Retested with that 3.10.40 kernel and ILO4 / DL360 Gen8 : no system hang that DMAR message (three lines) appears _once_ Also tested with the same kernel and an ILO2 / DL360 Gen6 : no issues at all Sorry, error in previous message: an ILO2 / DL360 Gen5 (not Gen6) shows no symptoms Final data point: ILO2 / DL380 Gen6 (now for real) : no system hang, but messages: [594728.833980] dmar: DRHD: handling fault status reg 2 [594728.834050] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834050] DMAR:[fault reason 06] PTE Read access is not set [594728.834060] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834060] DMAR:[fault reason 06] PTE Read access is not set [594728.834069] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834069] DMAR:[fault reason 06] PTE Read access is not set [594728.834078] NMI: PCI system error (SERR) for reason a1 on CPU 0. [594728.834080] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834080] DMAR:[fault reason 06] PTE Read access is not set [594728.834086] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834086] DMAR:[fault reason 06] PTE Read access is not set [594728.834090] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834090] DMAR:[fault reason 06] PTE Read access is not set [594728.834094] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834094] DMAR:[fault reason 06] PTE Read access is not set [594728.834099] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834099] DMAR:[fault reason 06] PTE Read access is not set [594728.834104] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834104] DMAR:[fault reason 06] PTE Read access is not set ... repeats a bit more ... [594728.834427] dmar: DMAR:[DMA Read] Request device [00:1e.0] fault addr e8000 [594728.834427] DMAR:[fault reason 06] PTE Read access is not set [594728.834430] dmar: DRHD: handling fault status reg 200 [594728.834451] Dazed and confused, but trying to continue |