Bug 52431

Summary: NMI watchdog / lockup detector causes hard lockups on Dual/Quad Pentium 3 / P3 Xeon systems
Product: Platform Specific/Hardware Reporter: Hans-Juergen Mauser (hjmauser)
Component: i386Assignee: platform_i386
Status: NEW ---    
Severity: normal CC: hjmauser
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all from 2.6.38 up to now (at least to 3.4.2) Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg boot-up log from an IBM Netfinity 5000
Current kernel configuration for 3.5.6, only mandatory things changed since 3.4.2

Description Hans-Juergen Mauser 2013-01-07 21:17:57 UTC
Hello!

If the lockup detector / NMI watchdog is enabled in the kernel configuration (CONFIG_LOCKUP_DETECTOR=y and CONFIG_HARDLOCKUP_DETECTOR=y) and such a kernel is run on an SMP machine containing multiple Pentium 3 or Pentium 3 XEON CPUs (system examples: IBM Netfinity 5000 (dual P3), IBM Netfinity 7000 M10 (quad P3 XEON)), the kernel will reproducibly cause hard lockups after a system uptime between 1 to 12 days. In my experience it occured most likely on the dual CPU system, but also could be seen on the quad CPU machines a lott more than just once. In rare cases the uptime could be extended to about 20 days before the next lockup was to be expected.

These lockups are extremely hard to diagnose and their symptoms are misleading into the direction of faulty hardware, there is not even any output if a serial terminal is used for loggging. No "oops" message, no lockup detection message, just a sudden stop of any function.

Thanks to the CPU activity LEDs of the aforementioned IBM machines, it can be seen that, depending on the exact kernel version and also on chance, the lock shows signs of a livelock or a deadlock. In case of the livelock, at least two activity LEDs show a constantly-dimmed light without any modulation (which would occur if some regular processing took place). This was the regular behaviour up too and including kernel version 3.3 - later kernels (3.4.2 was my next attempt after several 3.3 versions) more often show "deadlock" behaviour which can be seen by a single active CPU without others interfering.


WORKAROUND: adding "nowatchdog" to the kernel boot parameters, thus disabling the whole mechanism, resolves the problem permanently and allows "unlimited" uptime again.


ADDITIONAL DESCRIPTION: As I am using Debian, I already have published the bug on the Debian bug reporting platform:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=678443


DEBIAN-SPECIFIC: The problem might already exist in kernel versions prior to 2.6.38, but this was the version when the Debian maintainers decided to enable the watchdog mechanism - and I used pre-packaged kernels up to version 3.3.

Currently I am running kernel version 3.5.6, but after having discovered the workaround while using 3.4.2, I have not re-checked the presence of the bug as I need the affected systems to run reliably.

As a basic set of information I will attach a typical boot-up log and the interrupt configuration of a dual CPU system.

It seems that only one attachment can be added as a file, so I will paste the interrupt configuration directly below my message. Boot-up log and 

Please feel free to ask me for any further information which might be helpful - I will try to gather and publish it.


Thanks and best regards,

Hans-Juergen Mauser




/proc/interrupts:

           CPU0       CPU1       
  0:         49          0   IO-APIC-edge      timer
  1:          2          1   IO-APIC-edge      i8042
  6:          2          1   IO-APIC-edge      floppy
  7:          1          0   IO-APIC-edge      parport0
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          2          2   IO-APIC-edge      i8042
 14:    2415400    2473360   IO-APIC-edge      ata_generic
 15:          0          0   IO-APIC-edge      ata_generic
 16:         46         51   IO-APIC-fasteoi   aic7xxx, aic7xxx
 17:   21960782   21940892   IO-APIC-fasteoi   eth0
 18:     775413     780536   IO-APIC-fasteoi   megaraid, ohci_hcd:usb2
 19:    8940622    8953436   IO-APIC-fasteoi   eth1
 22:   13807027   13831018   IO-APIC-fasteoi   ehci_hcd:usb1, ohci_hcd:usb3, ohci_hcd:usb4, eth2, eth3
NMI:          1          1   Non-maskable interrupts
LOC:   72738388   80818104   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
IWI:          0          0   IRQ work interrupts
RTR:          2          0   APIC ICR read retries
RES:    2154053    2107264   Rescheduling interrupts
CAL:     382769     454405   Function call interrupts
TLB:     207689     195211   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:       3148       3148   Machine check polls
ERR:          0
MIS:          0
Comment 1 Hans-Juergen Mauser 2013-01-07 21:19:15 UTC
Created attachment 90611 [details]
dmesg boot-up log from an IBM Netfinity 5000
Comment 2 Hans-Juergen Mauser 2013-01-07 21:21:36 UTC
Created attachment 90621 [details]
Current kernel configuration for 3.5.6, only mandatory things changed since 3.4.2