While running: ipmitool mc reset cold ------------------------------------- We get this error when ilo returns: (dmesg) ------------------------------------------- [ 2587.523381] usb 2-1.3: USB disconnect, device number 5 [ 2628.170982] dmar: DRHD: handling fault status reg 302 [ 2628.171204] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000 [ 2628.171204] DMAR:[fault reason 06] PTE Read access is not set [ 2628.171583] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000 [ 2628.171583] DMAR:[fault reason 06] PTE Read access is not set [ 2628.171987] dmar: DRHD: handling fault status reg 500 [ 2628.173830] dmar: DRHD: handling fault status reg 502 [ 2628.174041] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000 [ 2628.174041] DMAR:[fault reason 06] PTE Read access is not set After rebuilding the kernel with CONFIG_DMA_API_DEBUG we get this. ------------------------------------------------------------------ [ 145.037947] NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0. [ 145.038248] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 3.12.8-nsas #3 [ 145.038250] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013 [ 145.038251] task: ffffffff81c13450 ti: ffffffff81c00000 task.ti: ffffffff81c00000 [ 145.038252] RIP: 0010:[<ffffffff810b157f>] [<ffffffff810b157f>] update_ts_time_stats+0x65/0x6a [ 145.038255] RSP: 0018:ffff880fffa03f30 EFLAGS: 00000092 [ 145.038257] RAX: 0000000000000000 RBX: ffff880fffa0e700 RCX: 0000000000000000 [ 145.038258] RDX: 000000203901cc1d RSI: ffff880fffa0e700 RDI: 0000000000000000 [ 145.038259] RBP: 000000203901cc1d R08: 0000000000009610 R09: 0000000000000f75 [ 145.038260] R10: 000000000000b9e7 R11: 000000000000b9e7 R12: 0000000000000000 [ 145.038262] R13: 000000000022b47f R14: 000000000008c800 R15: 0000000000000000 [ 145.038263] FS: 0000000000000000(0000) GS:ffff880fffa00000(0000) knlGS:0000000000000000 [ 145.038264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 145.038266] CR2: 00007f1f048b4470 CR3: 0000000001c0c000 CR4: 00000000000407f0 [ 145.038267] Stack: [ 145.038267] 0000000000000000 ffffffff81c01dc8 ffffffff810b15a4 ffff880fffa0e700 [ 145.038270] ffffffff810b1c28 0000000000000000 ffffffffffffffae 0000000000000005 [ 145.038273] ffffffff81074d5a 0000000000000000 ffffffff8103cd7e 0000001fe5d61a00 [ 145.038276] Call Trace: [ 145.038277] <IRQ> [ 145.038278] [<ffffffff810b15a4>] ? tick_nohz_stop_idle+0x20/0x2f [ 145.038282] [<ffffffff810b1c28>] ? tick_check_idle+0x43/0x9c [ 145.038285] [<ffffffff81074d5a>] ? irq_enter+0x43/0x5d [ 145.038287] [<ffffffff8103cd7e>] ? do_IRQ+0x25/0xa7 [ 145.038290] [<ffffffff816af46d>] ? common_interrupt+0x6d/0x6d [ 145.038291] <EOI> [ 145.038292] [<ffffffff8156c8c0>] ? cpuidle_enter_state+0x46/0xb2 [ 145.038295] [<ffffffff8156c8b9>] ? cpuidle_enter_state+0x3f/0xb2 [ 145.038298] [<ffffffff8156c9fe>] ? cpuidle_idle_call+0xd2/0x10f [ 145.038300] [<ffffffff81042e35>] ? arch_cpu_idle+0x6/0x1f [ 145.038302] [<ffffffff810a4718>] ? cpu_startup_entry+0xfb/0x163 [ 145.038304] [<ffffffff81caad0f>] ? start_kernel+0x3ab/0x3b6 [ 145.038306] [<ffffffff81caa77d>] ? repair_env_string+0x54/0x54 [ 145.038308] [<ffffffff81caa120>] ? early_idt_handlers+0x120/0x120 [ 145.038310] [<ffffffff81caa120>] ? early_idt_handlers+0x120/0x120 [ 145.038312] [<ffffffff81caa59d>] ? x86_64_start_kernel+0xf0/0xfd [ 145.038313] Code: 00 00 00 48 89 ab 80 00 00 00 4d 85 e4 74 16 48 89 ef e8 a0 2a fc ff 48 69 c0 40 42 0f 00 48 01 c2 49 89 14 24 48 83 c4 28 5b 5d <41> 5c 41 5d c3 48 63 c7 53 48 c7 c3 00 e7 00 00 48 03 1c c5 40 [ 147.026662] dmar: DRHD: handling fault status reg 2 [ 147.026947] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000 [ 147.026947] DMAR:[fault reason 06] PTE Read access is not set PCI Address points to: (lspci) ------------------------------ 01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 05) Subsystem: Hewlett-Packard Company iLO4 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 7 Region 0: I/O ports at 3800 [size=256] Region 1: Memory at f6ff0000 (32-bit, non-prefetchable) [size=256] Region 2: Memory at f6e00000 (32-bit, non-prefetchable) [size=1M] Region 3: Memory at f6d80000 (32-bit, non-prefetchable) [size=512K] Region 4: Memory at f6d70000 (32-bit, non-prefetchable) [size=32K] Region 5: Memory at f6d60000 (32-bit, non-prefetchable) [size=32K] [virtual] Expansion ROM at f6d00000 [disabled] [size=64K] Capabilities: [78] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [c0] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1 <4us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- The involved kernel module is: (lspci -k -s 01:00.2) ---------------------------------------------------- 01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 05) Subsystem: Hewlett-Packard Company iLO4 Kernel modules: hpilo => When disable the "Processor Power and Utilization Monitoring (Bios:Service Options)" the system still has the error but the kernel does not freez anymore. => When disabling iommu in kernel, the bug disapears. => With ILO FW 1.20, the error appears in dmesg, but the ILO health remains green.
https://bugzilla.kernel.org/show_bug.cgi?id=73181