Hello! If the lockup detector / NMI watchdog is enabled in the kernel configuration (CONFIG_LOCKUP_DETECTOR=y and CONFIG_HARDLOCKUP_DETECTOR=y) and such a kernel is run on an SMP machine containing multiple Pentium 3 or Pentium 3 XEON CPUs (system examples: IBM Netfinity 5000 (dual P3), IBM Netfinity 7000 M10 (quad P3 XEON)), the kernel will reproducibly cause hard lockups after a system uptime between 1 to 12 days. In my experience it occured most likely on the dual CPU system, but also could be seen on the quad CPU machines a lott more than just once. In rare cases the uptime could be extended to about 20 days before the next lockup was to be expected. These lockups are extremely hard to diagnose and their symptoms are misleading into the direction of faulty hardware, there is not even any output if a serial terminal is used for loggging. No "oops" message, no lockup detection message, just a sudden stop of any function. Thanks to the CPU activity LEDs of the aforementioned IBM machines, it can be seen that, depending on the exact kernel version and also on chance, the lock shows signs of a livelock or a deadlock. In case of the livelock, at least two activity LEDs show a constantly-dimmed light without any modulation (which would occur if some regular processing took place). This was the regular behaviour up too and including kernel version 3.3 - later kernels (3.4.2 was my next attempt after several 3.3 versions) more often show "deadlock" behaviour which can be seen by a single active CPU without others interfering. WORKAROUND: adding "nowatchdog" to the kernel boot parameters, thus disabling the whole mechanism, resolves the problem permanently and allows "unlimited" uptime again. ADDITIONAL DESCRIPTION: As I am using Debian, I already have published the bug on the Debian bug reporting platform: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=678443 DEBIAN-SPECIFIC: The problem might already exist in kernel versions prior to 2.6.38, but this was the version when the Debian maintainers decided to enable the watchdog mechanism - and I used pre-packaged kernels up to version 3.3. Currently I am running kernel version 3.5.6, but after having discovered the workaround while using 3.4.2, I have not re-checked the presence of the bug as I need the affected systems to run reliably. As a basic set of information I will attach a typical boot-up log and the interrupt configuration of a dual CPU system. It seems that only one attachment can be added as a file, so I will paste the interrupt configuration directly below my message. Boot-up log and Please feel free to ask me for any further information which might be helpful - I will try to gather and publish it. Thanks and best regards, Hans-Juergen Mauser /proc/interrupts: CPU0 CPU1 0: 49 0 IO-APIC-edge timer 1: 2 1 IO-APIC-edge i8042 6: 2 1 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 2 2 IO-APIC-edge i8042 14: 2415400 2473360 IO-APIC-edge ata_generic 15: 0 0 IO-APIC-edge ata_generic 16: 46 51 IO-APIC-fasteoi aic7xxx, aic7xxx 17: 21960782 21940892 IO-APIC-fasteoi eth0 18: 775413 780536 IO-APIC-fasteoi megaraid, ohci_hcd:usb2 19: 8940622 8953436 IO-APIC-fasteoi eth1 22: 13807027 13831018 IO-APIC-fasteoi ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3 NMI: 1 1 Non-maskable interrupts LOC: 72738388 80818104 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RTR: 2 0 APIC ICR read retries RES: 2154053 2107264 Rescheduling interrupts CAL: 382769 454405 Function call interrupts TLB: 207689 195211 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 3148 3148 Machine check polls ERR: 0 MIS: 0
Created attachment 90611 [details] dmesg boot-up log from an IBM Netfinity 5000
Created attachment 90621 [details] Current kernel configuration for 3.5.6, only mandatory things changed since 3.4.2