Bug 70371 - iommu: Kernel error while doing ipmitool mc reset warm on HP DL380 Gen8 with ILO Firmware > 1.20
Summary: iommu: Kernel error while doing ipmitool mc reset warm on HP DL380 Gen8 with ...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-11 12:42 UTC by Michel Pelzer
Modified: 2022-01-06 23:36 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.12.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Michel Pelzer 2014-02-11 12:42:58 UTC
While running: ipmitool mc reset cold
-------------------------------------

We get this error when ilo returns: (dmesg)
-------------------------------------------
[ 2587.523381] usb 2-1.3: USB disconnect, device number 5
[ 2628.170982] dmar: DRHD: handling fault status reg 302
[ 2628.171204] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000
[ 2628.171204] DMAR:[fault reason 06] PTE Read access is not set                                                                                                                                                                     
[ 2628.171583] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000
[ 2628.171583] DMAR:[fault reason 06] PTE Read access is not set
[ 2628.171987] dmar: DRHD: handling fault status reg 500
[ 2628.173830] dmar: DRHD: handling fault status reg 502
[ 2628.174041] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000
[ 2628.174041] DMAR:[fault reason 06] PTE Read access is not set

After rebuilding the kernel with CONFIG_DMA_API_DEBUG we get this.
------------------------------------------------------------------
[  145.037947] NMI: IOCK error (debug interrupt?) for reason 71 on CPU 0.
[  145.038248] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W    3.12.8-nsas #3
[  145.038250] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 12/20/2013
[  145.038251] task: ffffffff81c13450 ti: ffffffff81c00000 task.ti: ffffffff81c00000
[  145.038252] RIP: 0010:[<ffffffff810b157f>]  [<ffffffff810b157f>] update_ts_time_stats+0x65/0x6a
[  145.038255] RSP: 0018:ffff880fffa03f30  EFLAGS: 00000092
[  145.038257] RAX: 0000000000000000 RBX: ffff880fffa0e700 RCX: 0000000000000000
[  145.038258] RDX: 000000203901cc1d RSI: ffff880fffa0e700 RDI: 0000000000000000
[  145.038259] RBP: 000000203901cc1d R08: 0000000000009610 R09: 0000000000000f75
[  145.038260] R10: 000000000000b9e7 R11: 000000000000b9e7 R12: 0000000000000000
[  145.038262] R13: 000000000022b47f R14: 000000000008c800 R15: 0000000000000000
[  145.038263] FS:  0000000000000000(0000) GS:ffff880fffa00000(0000) knlGS:0000000000000000
[  145.038264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  145.038266] CR2: 00007f1f048b4470 CR3: 0000000001c0c000 CR4: 00000000000407f0
[  145.038267] Stack:
[  145.038267]  0000000000000000 ffffffff81c01dc8 ffffffff810b15a4 ffff880fffa0e700
[  145.038270]  ffffffff810b1c28 0000000000000000 ffffffffffffffae 0000000000000005
[  145.038273]  ffffffff81074d5a 0000000000000000 ffffffff8103cd7e 0000001fe5d61a00
[  145.038276] Call Trace:
[  145.038277]  <IRQ>
[  145.038278]  [<ffffffff810b15a4>] ? tick_nohz_stop_idle+0x20/0x2f
[  145.038282]  [<ffffffff810b1c28>] ? tick_check_idle+0x43/0x9c
[  145.038285]  [<ffffffff81074d5a>] ? irq_enter+0x43/0x5d
[  145.038287]  [<ffffffff8103cd7e>] ? do_IRQ+0x25/0xa7
[  145.038290]  [<ffffffff816af46d>] ? common_interrupt+0x6d/0x6d
[  145.038291]  <EOI>
[  145.038292]  [<ffffffff8156c8c0>] ? cpuidle_enter_state+0x46/0xb2
[  145.038295]  [<ffffffff8156c8b9>] ? cpuidle_enter_state+0x3f/0xb2
[  145.038298]  [<ffffffff8156c9fe>] ? cpuidle_idle_call+0xd2/0x10f
[  145.038300]  [<ffffffff81042e35>] ? arch_cpu_idle+0x6/0x1f
[  145.038302]  [<ffffffff810a4718>] ? cpu_startup_entry+0xfb/0x163
[  145.038304]  [<ffffffff81caad0f>] ? start_kernel+0x3ab/0x3b6
[  145.038306]  [<ffffffff81caa77d>] ? repair_env_string+0x54/0x54
[  145.038308]  [<ffffffff81caa120>] ? early_idt_handlers+0x120/0x120
[  145.038310]  [<ffffffff81caa120>] ? early_idt_handlers+0x120/0x120
[  145.038312]  [<ffffffff81caa59d>] ? x86_64_start_kernel+0xf0/0xfd
[  145.038313] Code: 00 00 00 48 89 ab 80 00 00 00 4d 85 e4 74 16 48 89 ef e8 a0 2a fc ff 48 69 c0 40 42 0f 00 48 01 c2 49 89 14 24 48 83 c4 28 5b 5d <41> 5c 41 5d c3 48 63 c7 53 48 c7 c3 00 e7 00 00 48 03 1c c5 40
[  147.026662] dmar: DRHD: handling fault status reg 2
[  147.026947] dmar: DMAR:[DMA Read] Request device [01:00.2] fault addr e9000
[  147.026947] DMAR:[fault reason 06] PTE Read access is not set

PCI Address points to: (lspci)
------------------------------
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 05)
        Subsystem: Hewlett-Packard Company iLO4
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 7
        Region 0: I/O ports at 3800 [size=256]
        Region 1: Memory at f6ff0000 (32-bit, non-prefetchable) [size=256]
        Region 2: Memory at f6e00000 (32-bit, non-prefetchable) [size=1M]
        Region 3: Memory at f6d80000 (32-bit, non-prefetchable) [size=512K]
        Region 4: Memory at f6d70000 (32-bit, non-prefetchable) [size=32K]
        Region 5: Memory at f6d60000 (32-bit, non-prefetchable) [size=32K]
        [virtual] Expansion ROM at f6d00000 [disabled] [size=64K]
        Capabilities: [78] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [b0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [c0] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1 <4us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-


The involved kernel module is: (lspci -k -s 01:00.2)
----------------------------------------------------
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 05)
        Subsystem: Hewlett-Packard Company iLO4
        Kernel modules: hpilo


=> When disable the "Processor Power and Utilization Monitoring (Bios:Service Options)" the system still has the error but the kernel does not freez anymore.


=> When disabling iommu in kernel, the bug disapears. 

=> With ILO FW 1.20, the error appears in dmesg, but the ILO health remains green.
Comment 2 Roland Kletzing 2022-01-06 23:36:02 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=73181

Note You need to log in before you can comment on or make changes to this bug.