Bug 14204
Summary: | MCE prevent booting on my computer(pentium iii @500Mhz) | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | GNUtoo (GNUtoo) |
Component: | i386 | Assignee: | platform_i386 |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | normal | CC: | andi-bz, florian, hpa, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13615 | ||
Attachments: |
mce debug patch
dmesg |
Description
GNUtoo
2009-09-21 20:36:45 UTC
disabling all mce options make it boot... Need more information. What kind of hardware is that. Please post /proc/cpuinfo Did 2.6.30/32bit or a earlier kernel with MCE enabled work? Please attach the boot log from such a successfull boot. Thanks. Thanks for your report. Could you please try the latest -tip tree: http://people.redhat.com/mingo/tip.git/README With the MCE options enabled? It should not crash anymore due to this robustness fix: 11868a2: x86: mce: Use safer ways to access MCE registers If the patch works as intended then you should be getting a warning message during bootup - please post that bootlog. Thanks, Ingo Ingo, as I explained several times rdmsrl_safe() is very unlikely to the correct fix. If the MSR code reads MSRs that are not there something is going wrong in the bank or feature discovery and we need to root cause what it is, not hack around it at the wrong level. I think if you spend 5 minutes reading the respective chapters in the architecture manual you will get to the same conclusion. The fault seems to happen on MSR 407 which is the MISC register in bank 2. Bank 2 should be really there, but it might have no MISC register, which is optional (iirc P3 really didn't have a MISC register) The MISC register normally is specified by setting the MISCV bit, which might be set incorrectly here. BTW I think this can also only happen if there is a machine check logged at time of boot. If it happens on every boot (does it?) then most likely the BIOS leaves junk in the machine check registers, including one with a MISCV bit, but MISC isn't there. So in this case the correct fix would be to add that system to the "don't log boot mces" blacklist. To add it I would need the cpuinfo output. It would be also interesting to see what the status register say. Created attachment 23153 [details]
mce debug patch
Here's a debug patch that prints all the registers on a MISC dump.
Please apply it to 2.6.31 (reenabling mces in the config) and add the boot
log. It should not crash. Thanks.
Also another workaround would be to boot with mce=nobootlog
> Ingo, as I explained several times rdmsrl_safe() is very unlikely to
> the correct fix. [...]
Andi, as i told you before, this is a robustness improvement: a 'turn
nasty boot crash into a debuggable boot warning' patch.
It does not fix this bug, obviously, nor does it claim to.
It is a robustness fix that you failed to add and which even today you
fail to realize the significance of.
There's a world of a difference between a boot crash and a boot warning.
Boot warnings can be debugged and reported much more easily than boot
crashes. Who knows how many boxes crashed silently due to the bug which
people were unable to report in a meaningful way.
Ingo
here are the requested informations...sorry for not beeing able to respond at once router ~ # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 7 model name : Pentium III (Katmai) stepping : 3 cpu MHz : 501.167 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1002.33 clflush size : 32 power management: and here's my last kernel(vanilla 2.6.29) router ~ # uname -a Linux router 2.6.29_router #1 Wed Mar 25 15:47:10 CET 2009 i686 Pentium III (Katmai) GenuineIntel GNU/Linux Re #6: Ingo, ok if you don't treat it as the final fix for this I have no objections. It just didn't sound like that from your first comment, but I might have misread. The important part is that it gets reported, but the WARN_ON will insure that. Re #7: Gnutoo, Can you please apply the patch I attached to this bug to 2.6.31, reenable machine checks and attach the resulting boot log after boot? Thanks. this time it booted with the patch: root@router ~ # zcat /proc/config.gz | grep MCE zcat /proc/config.gz | grep MCE CONFIG_X86_MCE=y # CONFIG_X86_OLD_MCE is not set CONFIG_X86_NEW_MCE=y CONFIG_X86_MCE_INTEL=y CONFIG_X86_MCE_AMD=y # CONFIG_X86_ANCIENT_MCE is not set CONFIG_X86_MCE_THRESHOLD=y # CONFIG_X86_MCE_INJECT is not set I'll attach the dmesg Created attachment 23169 [details]
dmesg
> Re #6: Ingo, ok if you don't treat it as the final fix for this I have > no objections. It just didn't sound like that from your first comment, > but I might have misread. The important part is that it gets reported, > but the WARN_ON will insure that. Andi, you either have serious reading comprehension problems, or you pretend that you didnt get the point i made. To quote the very plain language of the commit: ------------> commit 11868a2dc4f5e4f2f652bfd259e1360193fcee62 Author: Ingo Molnar <mingo@elte.hu> Date: Wed Sep 23 17:49:55 2009 +0200 x86: mce: Use safer ways to access MCE registers [...] So WARN_ONCE() instead of crashing the box. <------------ Ingo GNUtoo: the latest dmesg you sent with the debug patch applied does not have an MCE event printed in it. That suggests sensitivity of this MCE event on the specific kernel layout - or on other random factors. What is the output of /proc/interrupts after bootup? It should contain an 'MCE' line like this: MCE: 0 Machine check exceptions It will be non-zero if you got MCE exceptions. Does this system have any history of thermal instability? Also, could you check the latest -tip tree at: http://people.redhat.com/mingo/tip.git/README and attach the dmesg with that booted. Do you get an MCE exception with that, and does the new WARN() trigger? Thanks, Ingo router ~ # cat /proc/interrupts CPU0 0: 58287 XT-PIC-XT timer 1: 2 XT-PIC-XT i8042 2: 0 XT-PIC-XT cascade 3: 1 XT-PIC-XT uhci_hcd:usb1 4: 510 XT-PIC-XT serial 7: 2 XT-PIC-XT 9: 0 XT-PIC-XT acpi 10: 1 XT-PIC-XT eth0 11: 810 XT-PIC-XT b43 12: 67 XT-PIC-XT CS46XX, eth1 14: 2930 XT-PIC-XT ata_piix 15: 0 XT-PIC-XT ata_piix NMI: 0 Non-mas Local timer interrupts SPU: 0 Spurious interrupts CNT: 0 Performance counter interrupts PND: 0 Performance pending work TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts MCE: maybe it was because it just compiled the kernel and was too hot Well, that could easily have triggered a machine check event, that if it was mishandled, would have crashed the machine... I'm resolving this ancient issue as "unreproducible". Might have been too hot or might have been cosmic rays. Please shout if this is incorrect. (I'm not shure if you have to shout, if you can reproduce this issue, but only while putting your computer in the oven. In that case it might be theoretically be more correct to resolve this issue as "invalid".. but I'm not shure we care that deeply.) |