Bug 40112
Summary: | Kernel hangs in timer.c | ||
---|---|---|---|
Product: | Timers | Reporter: | Rob de Wit (rdewit) |
Component: | Other | Assignee: | john stultz (john.stultz) |
Status: | CLOSED INVALID | ||
Severity: | normal | CC: | andi-bz, rdewit |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.39.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Kernel config 2.6.39.2
Kernel config 2.6.37.6 |
Description
Rob de Wit
2011-07-26 08:40:46 UTC
Created attachment 66652 [details]
Kernel config 2.6.39.2
Created attachment 66662 [details]
Kernel config 2.6.37.6
*** Bug 38372 has been marked as a duplicate of this bug. *** From the logs, you can see the tainted flags are set:"Tainted: G M" This means the system experienced a machine check, which means a likely hardware issue. You might check your /var/log/messages or dmesg output to see if there are any indications of what hardware has been having issues. Hi John, Thanks for looking into this. Indeed, one of the systems has ATA errors prior to the crash in its log. Could that be the cause of kernel dump 2 above? Furthermore I noticed that while booting the following messages are logged: Booting Node 0, Processors #1 #2 #3 #4 [Hardware Error]: No human readable MCE decoding support on this CPU type. [Hardware Error]: Run the message through 'mcelog --ascii' to decode. Disabling lock debugging due to kernel taint #5 #6 #7 Ok. Brought up 8 CPUs Total of 8 processors activated (36267.21 BogoMIPS). What message should I run through mcelog? Is this also a cause of the kernel being tainted, because I have not seen any other hardware issues on the other systems? I would be happy to supply more info if needed. (In reply to comment #6) > Hi John, > > Thanks for looking into this. > > Indeed, one of the systems has ATA errors prior to the crash in its log. > Could > that be the cause of kernel dump 2 above? Well, when hardware acts up it can manifest in strange ways. The fact that /sys/block/sdd/queue/scheduler is in all the dumps also aligns with the (S?)ATA errors. However, it could be something else as well. > What message should I run through mcelog? Honestly, I'm not sure. I'm not very familiar with the mce framework. CC'ing Andi to see if he has any thoughts. > Is this also a cause of the kernel being tainted, because I have not seen any > other hardware issues on the other systems? So, looking back over the original report you're seeing this on 3 different systems? All dual-socket quad cores? Since you're getting the [Hardware Error] message when initializing the cpus, right between sockets, I'm curious if your cpus are mis-matched? You are using identical processors in both sockets, right? We experiece these problems with four identical systems. All of them are dual quad-core Xeon E5520s with 6x8GB+4x4GB RAM The SATA errors are present at only one of the systems. The cpus are identical, but the DIMMs might not be properly populated amongst CPU-channels. As these machines are in a remote datacenter I cannot check this easily. Would it be possible the machine check exception is raised when DIMMs are not balanced? e.g: P1_1A=4GB P1_1B=8GB P2_1A=8GB P2_1B=4GB Mysteriously, we also have the mcelog entry on a fifth system, identical to the other four but with only 8x2GB and no SSD drive and that system has not shown any instability issues. It's on a different load but is still quite heavily used. Machine checks are usually not software or kernel problems. You have to talk to whoever sold you the system. This bugzilla is likely the wrong place. Per Andi's comment, I'm going to close this as invalid. Please re-open if there's any data pointing to a kernel issue instead of a hardware problem. |