I find this bug in IPMI driver but after root cause I think it is an ACPI related bug.So post it here and say sorry in advance if it is not the proper place for it. Distribution: Red Hat Enterprise Linux AS release 4 Hardware Environment: HPCX 2600 Software Environment: Problem Description: Fail to load ipmi_si driver with oop. More details can be found in PS2. After debugging with the driver, I found the OOP is caused by an invalid address.In IPMI driver, we will call acpi_get_firmware_table in try_init_acpi function to get the information of SPMI table. In this information, there is a memory pointered by spmi->addr.address and it is invalid. It seems acpi map a wrong memory and send itout to IPMI but from the boot log of system ACPI is enabled properly. I think in this case, an error code should be returned by acpi_get_firmware_table instead of AE_OK, just as what we see in IPMI driver (See PS1) Steps to reproduce: modprobe ipmi_msghandler modprobe ipmi_si kcs_addrs=0xff5b0ca2 PS1: Unable to handle kernel paging request at virtual address 00000000ff5b0caa insmod[3942]: Oops 8813272891392 [1] Modules linked in: ipmi_si(U) ipmi_msghandler(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core vfat fat dm_mod button joydev ohci_hcd ehci_hcd e100 mii tg3 ext3 jbd mptscsih mptbase sd_mod scsi_mod Pid: 3942, CPU 1, comm: insmod psr : 0000101008126030 ifs : 8000000000000693 ip : [<a00000020048a110>] Not tainted ip is at init_one_smi+0x19b0/0x2760 [ipmi_si] unat: 0000000000000000 pfs : 0000000000000693 rsc : 0000000000000003 rnat: 0000000000000060 bsps: 0000000000000040 pr : 0000000005559a59 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a00000020048a0b0 b6 : a00000010022cba0 b7 : a000000100237780 f6 : 000000000000000000000 f7 : 0ffe1f000000000000000 f8 : 10005f000000000000000 f9 : 100028000000000000000 f10 : 10002effffffff1000000 f11 : 1003e000000000000000f r1 : a000000200684000 r2 : e00000000487c384 r3 : e00000003fb20688 r8 : 0000000000000000 r9 : a0007ffffff28ad8 r10 : a0007ffffff28ab0 r11 : 0000000000055156 r12 : e000000032da7e00 r13 : e000000032da0000 r14 : e00000000487c5c0 r15 : 0000000000000078 r16 : 0000000000000230 r17 : 0000000000000047 r18 : e00000000487c390 r19 : a0007fffffc80000 r20 : 00000000000613d0 r21 : 000000000000c27a r22 : e000000032da0dd4 r23 : 00000000ff5b0caa r24 : 00000000ff5b0ca2 r25 : 0000000000000000 r26 : 00000000ff5b0ca2 r27 : 0000000000000000 r28 : e00000003fb344a4 r29 : e00000003fb344a8 r30 : e00000003fb34470 r31 : a000000200491cf0 Call Trace: [<a000000100016a40>] show_stack+0x80/0xa0 sp=e000000032da79b0 bsp=e000000032da1098 [<a000000100017350>] show_regs+0x890/0x8c0 sp=e000000032da7b80 bsp=e000000032da1050 [<a00000010003c970>] die+0x150/0x240 sp=e000000032da7ba0 bsp=e000000032da1010 [<a00000010005d870>] ia64_do_page_fault+0x9f0/0xba0 sp=e000000032da7ba0 bsp=e000000032da0fa0 [<a00000010000f480>] ia64_leave_kernel+0x0/0x260 sp=e000000032da7c30 bsp=e000000032da0fa0 [<a00000020048a110>] init_one_smi+0x19b0/0x2760 [ipmi_si] sp=e000000032da7e00 bsp=e000000032da0f08 [<a000000200338290>] init_ipmi_si+0x230/0x560 [ipmi_si] sp=e000000032da7e30 bsp=e000000032da0ed0 [<a0000001000af320>] sys_init_module+0x420/0x620 sp=e000000032da7e30 bsp=e000000032da0e60 [<a00000010000f320>] ia64_ret_from_syscall+0x0/0x20 sp=e000000032da7e30 bsp=e000000032da0e60 [<a000000000010640>] 0xa000000000010640 sp=e000000032da8000 bsp=e000000032da0e60 PS2:(from dmesg) Linux version 2.6.9-5.EL (bhcompile@bullwinkle.build.redhat.com) (gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)) #1 SMP Wed Jan 5 19:23:24 EST 2005 EFI v1.10 by HP: SALsystab=0x3fb38000 ACPI 2.0=0x3fb2c000 SMBIOS=0x3fb3a000 HCDP=0x3fb2a000 booting generic kernel on platform hpzx1 ACPI: RSDP (v002 HP ) @ 0x000000003fb2c000 ACPI: XSDT (v001 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb2c02c ACPI: FADT (v003 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb342b0 ACPI: SPCR (v001 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb343e8 ACPI: DBGP (v001 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb34438 ACPI: MADT (v001 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb34510 ACPI: SPMI (v004 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb34470 ACPI: CPEP (v001 HP cx2600 0x00000000 HP 0x00000000) @ 0x000000003fb344c0 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb31940 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb31ad0 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb31dd0 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb32640 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb32eb0 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb33720 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb33f90 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb34120 ACPI: SSDT (v001 HP cx2600 0x00000006 INTL 0x02012044) @ 0x000000003fb341e0 ACPI: DSDT (v001 HP cx2600 0x00000007 INTL 0x02012044) @ 0x0000000000000000 Warning: acpi_table_parse(ACPI_SRAT) returned 0! Warning: acpi_table_parse(ACPI_SLIT) returned 0! Thanx.
Did you give correct flags for acpi_get_firmware_table?
The oops address '00000000ff5b0caa' looks like address in 'modprobe ipmi_si kcs_addrs=0xff5b0ca2'. What's the kcs_addrs?
The acpi function is called as follows: status = acpi_get_firmware_table("SPMI", intf_num+1, ACPI_LOGICAL_ADDRESSING, (struct acpi_table_header **) &spmi); As for the second question, kcs_addrs is the memory address we will use but mapped by acpi function. If we don't pass the address to ipmi_si driver the driver will find the same address from acpi SPMI table, and the same as 0xff5b0ca2. Thanx. Aaron
Created attachment 4736 [details] debug patch Hi, it seems the IPMI table is invalid, the returned address's length is 0, after adding the check (see the driver), my test doesn't show oops again. But I'm not sure if I found the real cause (maybe ACPI give wrong table, but I check other fileds, and they are ok). Could you please tell me if the system works before?
You are right that the register width is 0 but other field seems right.It is a little strange! Zhengyu said that another HP machine ever worked but now it is crashes down for H/W issue. And other 2 HP machine face the same bug now. Thanx. Aaron
is this machine in production? is it running the latest BIOS? Does Windows boot on it?
Yes, it is in production and you can find it on hp's website. As for the BIOS, not sure it is the latest one and I will check it soon. Finally, we havenot try to install windows on it and don't know whether it can work. Thanx. Aaron
Patch is on base kernel. If the bug isn't a BIOS bug (eg, confirmed it works on other previous kernel version or the vendor said it's not), please reopen it.
It is actually a bug of IPMI driver. The root cause is the IPMI driver will treat a 0 value of regsize field as an invalid value and will not allocate memory for it. After treate 0 value of regsize as default regsize -- 1 , the problem disappeares. Thanx. Aaron