Distribution: RHL9.0 Hardware Environment: S7505VB2, 2GB RAM, SATA, 5 NICs Software Environment: 2.4.22 kernel with ACPI enabled Problem Description: # init 0 provokes an NMI The system sometimes powers off, and sometimes resets. ... Turning off swap: Turning off quotas: Unmounting file systems: Halting system... flushing ide devices: hdc hde Power down. Uhhuh. NMI received. Dazed and Reproducible: yes
Created attachment 860 [details] serial console messages
Reproduced on a 2nd system, this one running BIOSv 1.01 and 1 physical processor. Power-off worked, but the NMI is still there. Turning off swap: Turning off quotas: Unmounting file systems: Halting system... flushing ide devices: hda hdc Power down. Uhhuh. NMI received. Dazed and confused, but trying to continue e100: config WOL failed You probably have a hardware problem with your RAM chips Uhhuh. NMI received for unknown reason 35. Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? hwsleep-0257 [16] acpi_enter_sleep_state: Entering sleep state [S5]
no difference with latest BIOS (1.06) and simplified kernel to UP noapic. though UP re-orders the messages and noapic causes reason-code to be 25 rather than 35. Power-down failure only seen on the 2 (physical) processor box at this point; but NMI is seen in all configs. Power down. Uhhuh. NMI received. Dazed and confused, but trying to continue You probably have a hardware problem with your RAM chips Uhhuh. NMI received for unknown reason 25. Dazed and confused, but trying to continue Do you have a strange power saving mode enabled? e100: config WOL failed hwsleep-0257 [21] acpi_enter_sleep_state: Entering sleep state [S5]
> e100: config WOL failed is from e100_do_wol() -- called from e100_suspend(). In a box with four pro/100's, I get 4 of these. I configure WOL off by default, perhaps the e100 driver doesn't expect it? > Uhhuh. NMI received. Dazed and confused, but trying to continue > You probably have a hardware problem with your RAM chips is from mem_parity_error(), called from do_nmi() when reason & 0x80 > Uhhuh. NMI received for unknown reason 25. > Dazed and confused, but trying to continue > Do you have a strange power saving mode enabled? is from unknown_nmi_error(), called from do_num() when !(reason & 0xc0). decimal 25 and 35 are 0x23 and 0x19, neither of which have 0xc0 bits set.
Would you please have workaroud patch at bug 1141 a try? Thanks a lot!
2.6 powers down normally, w/ no NMI message i tested 2.6.5/FC2 2,6.8.1 and 2.6.9 2.4 powers down normally, but still gets the NMI i tested as recently as 2.4.28-rc2 BIOS is the latest -- 1.10 9/7/04
Created attachment 4032 [details] patch against 2.4.28 The NMI is provoked by pci_pm_suspend_bus() called by pci_pm_suspend() called by acpi_system_save_state() called by acpi_power_off(). Unclear why acpi_system_save_state() was added to acpi_power_off() in 2.4.25 but evidently it was not a good idea. This patch removes the offending call.
shipped in 2.4.28-rc4 - closing.