Bug 43205 - processor cache hardware errors every 300s
Summary: processor cache hardware errors every 300s
Status: CLOSED INVALID
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-04 18:53 UTC by Charles Lindsay
Modified: 2014-07-16 12:27 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.4.0, 3.2.0
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
post-boot dmesg output (34.34 KB, text/plain)
2012-05-04 18:53 UTC, Charles Lindsay
Details
lspci output (21.12 KB, text/plain)
2012-05-04 18:54 UTC, Charles Lindsay
Details
cat /proc/cpuinfo (3.47 KB, text/plain)
2012-05-04 18:54 UTC, Charles Lindsay
Details
cat /proc/interrupts (2.64 KB, text/plain)
2012-05-04 18:55 UTC, Charles Lindsay
Details

Description Charles Lindsay 2012-05-04 18:53:29 UTC
Created attachment 73180 [details]
post-boot dmesg output

I recently installed Ubuntu 12.04 on my machine that was previously running 10.04.  With the new install, I get hardware errors in dmesg every 300s on the dot:

[ 300.804131] [Hardware Error]: CPU:1 MC0_STATUS[-|CE|-|-|AddrV|CECC]: 0x9467400000000136
[ 300.804147] [Hardware Error]: MC0_ADDR: 0x00000003f2f7c5c0
[ 300.804152] [Hardware Error]: Data Cache Error: during L1 linefill from L2.
[ 300.804160] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
[ 300.804171] [Hardware Error]: CPU:1 MC1_STATUS[Over|CE|-|-|-]: 0xd000000000000171
[ 300.804178] [Hardware Error]: Instruction Cache Error: Copyback Parity/Victim error.
[ 300.804184] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: EV
[ 300.804194] [Hardware Error]: CPU:1 MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040000000018a
[ 300.804202] [Hardware Error]: MC2_ADDR: 0x00000003e7d745c0
[ 300.804207] [Hardware Error]: Bus Unit Error: SNP error during data copyback.
[ 300.804213] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP

Same again at 600.804..., 900.804..., etc. for as long as my computer is on.

I'm running Ubuntu kernel linux-image-3.2.0-24-generic 3.2.0-24.37, and I've confirmed it still happens running from a deb of linux-image-3.4.0-030400rc5-generic_3.4.0-030400rc5.201205011817_amd64 as well.

There's a downstream launchpad bug report here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/993758

I'll attach some relevant info from that bug report.
Comment 1 Charles Lindsay 2012-05-04 18:54:04 UTC
Created attachment 73181 [details]
lspci output
Comment 2 Charles Lindsay 2012-05-04 18:54:27 UTC
Created attachment 73182 [details]
cat /proc/cpuinfo
Comment 3 Charles Lindsay 2012-05-04 18:55:15 UTC
Created attachment 73183 [details]
cat /proc/interrupts
Comment 4 Alan 2012-05-12 00:26:04 UTC
Note the words "Hardware Error"

These appear to be logged machine checks. It may be the older Ubuntu did't have the support for monitoring the CPU.

Check with your vendor/board manufacturer for diagnostic tools (and check the CPU fan)
Comment 5 Charles Lindsay 2012-05-12 22:18:33 UTC
I'm assuming it's an issue somewhere in the hardware driver stack or maybe with the software doing the reporting, for two reasons:

1) The major kernel/system software change.  For a while I was dual-booting Ubuntu 10.04 (kernel 2.6.something) and 12.04 (as seen above), and the error occurred only in 12.04.  This points to the software changing.  It didn't occur to me that the older kernel may not have had the right monitoring support, but if that's possible I'll take your word for it.

2) The occurrence every 5 minutes plus or minus a few milliseconds seems:
 a) Too far apart for an issue with the processor/motherboard alone.  I'd expect more frequent errors if it was indeed a processor/motherboard issue, but I'm not a hardware expert.
 b) Too regular to be a true hardware issue.  In my experience, if hardware is failing, it doesn't obey regular patterns.
 c) Too imprecise for a processor issue.  It's off by several milliseconds each time, which is years off by processor standards.

It seems far more likely to me some routine check in the kernel is turning up a spurious issue, or some daemon is triggering something bogus in the kernel and it's showing up as errors.

I have a limited knowledge of the hardware and kernel/system stack, though, so if those assumptions are invalid, I'll defer to your wisdom.
Comment 6 Alan 2012-05-12 22:35:43 UTC
Tthe hardware is reporting and logging it as a CPU problem. The machine check is triggered by the processor (not software) and the values being logged are consistent and sensible decodes of error logging registers in the processor.

The every five minutes is because mcelog polls for events every five minutes and reports those that have occurred in that period.

The log also consistently shows it being the same CPU and the same patterns. So I think you need to contact your vendor and get CPU 1 swapped out.
Comment 7 Charles Lindsay 2012-05-14 07:50:56 UTC
Ah, all right.  Part of my logic was also that I haven't noticed any oddities like freezes or crashes or kernel panics that I would expect from a faulty processor.

I'll take your word that it's not a bug, but in fact a faulty CPU core.  Anyway, thanks!
Comment 8 Roger Wolff 2012-05-24 08:02:38 UTC
Alan, 

I can't find a system-installed mcelog application. From the start of my dmesg, these exceptions are happening every 300 seconds. So with 99% certainty I can say more than one is happening every 5 minutes. I'd like to increase the logging frequency to see how often this happens. 

My CPU vendor will say: "we tested it and it works just fine". Which is more or less what I can say. No unexpected crashes etc etc. What IS wrong here? Can you explain in English what went wrong?

For reference I get: 

[145707.804050] [Hardware Error]: MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400020000136
[145707.805918] [Hardware Error]: Data Cache Error: during L1 linefill from L2.
[145707.807750] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
[145707.809615] [Hardware Error]: MC2_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040000000018a
[145707.811475] [Hardware Error]: Bus Unit Error: SNP error during data copyback.
[145707.813325] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP

[update] OK. I got mcelog installed, and it prints a bit more infrmation. L2 cache ECC error. bit46 corrected ecc. 

Just checked the MCELOG sources. "bit46" is not that it ecc-corrected bit46, but bit46 means "corrected ecc". I'm also getting "multiple errors" (bit62). 

So my L2 cache is correcting errors at once every two seconds. I'll try and get it exchanged.
Comment 9 Roger Wolff 2012-05-24 08:08:18 UTC
Just let me add that when a non-expert goes into a computer-shop with the story: 
"It works just fine, but Linux says that my hardware is broken, can I have a new CPU please? "

they will get laughed out of the shop. More detailed error messages increase the chances of acceptance. If the kernel had printed: "ECC event in L2 cache: recovered data", that'd been the end of it: I can intepret that, I can tell that to my shop.
Comment 10 Alan 2012-05-24 11:13:57 UTC
Bugzilla is for dealing with kernel problems not poor quality shops I'm afraid. If you have a problem with the shop I would contact your CPU suppliers support forums and ask there, then print out the response from the CPU vendor and take it into the shop.

Basically your processor is logging problems (currently ones it manged to fix up) in the cache.
Comment 11 Roger Wolff 2012-05-24 12:16:12 UTC
The BUG/ enhancement-request that comes out of this is: 

- Improve readability of the messages that the kernel prints. 
- This includes printing that the MCE occurred up to 300 seconds before the messages are printed. 

You know I'm not a newbie. It took me several hours to figure out that mcelog prints more readable messages. The usability of the kernel needs to be improved here. 

If you're afraid of the huge amount of messages that this will pull into the kernel, those could be included in a module. Either pull in the module when the CPU model has been determined, or pull it in when the first message needs to be printed. (Print the binary MCE registers, then attempt to pull in the module for the user readable message).
Comment 12 Alan 2012-05-24 12:33:28 UTC
The kernel MCE code is intentionally very simple because it is very hard to test and it is critical that the actual data gets out because in many cases you won't just recover and continue. It's not the kernel's fault thar your distro doesn't ship with MCE logging enabled. Thats a distro bug.

It's also btw quite complicated to interpret them on some processors because there a family and machine specifc bits involved.

As an aside there is a project trying to tidy up the MCE logging, EDAC logging etc for the kernel.  Talk to mchehab@redhat.com about that might be the best for the future.
Comment 13 caesar 2014-07-16 12:27:25 UTC
This became a real problem in Kernel 3.15 see this bug: https://bugzilla.kernel.org/show_bug.cgi?id=78781
My CPU is unusable with kernels > 3.14 So I guess tehre was some change in the MCE Code

Note You need to log in before you can comment on or make changes to this bug.