Bug 201291 - Ryzen 2500U won't boot recent kernels without mce=off
Summary: Ryzen 2500U won't boot recent kernels without mce=off
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-30 03:50 UTC by clemej
Modified: 2018-12-05 20:50 UTC (History)
5 users (show)

See Also:
Kernel Version: All kernels 4.10 and up.
Tree: Mainline
Regression: No


Attachments
ACPI dump (519.79 KB, text/plain)
2018-09-30 03:50 UTC, clemej
Details
dmesg from booting from last good commit (94.71 KB, text/plain)
2018-09-30 04:01 UTC, clemej
Details
dmesg from normal debian 4.9 boot (65.28 KB, text/plain)
2018-09-30 04:07 UTC, clemej
Details

Description clemej 2018-09-30 03:50:49 UTC
Created attachment 278845 [details]
ACPI dump

New HP EliteBook 745 G5, BIOS version 1.03.01. Ryzen PRO 2500u.

Booting any modern kernel (4.10+) hangs at boot on this system with no kernel messages displayed unless you disable MCE support (via mce=off). 

Knowing Debian's 4.9 kernel boots fine, I bisected Linus's tree, and it appears this commit is the culprit:


    18807ddb7f88d4ac3797302bafb18143d573e66f is the first bad commit
    commit 18807ddb7f88d4ac3797302bafb18143d573e66f
    Author: Yazen Ghannam <Yazen.Ghannam@amd.com>
    Date:   Tue Nov 15 15:13:53 2016 -0600

    x86/mce/AMD: Reset Threshold Limit after logging error
    
    The error count field in MCA_MISC does not get reset by hardware when the
    threshold has been reached. Software is expected to reset it. Currently,
    the threshold limit only gets reset during init or when a user writes to
    sysfs.
    
    If the user is not monitoring threshold interrupts and resetting
    the limit then the user will only see 1 interrupt when the limit is first
    hit. So if, for example, the limit is set to 10 then only 1 interrupt will
    be recorded after 10 errors even if 100 errors have occurred. The user may
    then assume that only 10 errors have occurred.


.. although the previous few commits to this one also are all related to MCE support on AMD systems, so it may be a culmination of a few commits.
Comment 1 clemej 2018-09-30 04:01:15 UTC
Created attachment 278847 [details]
dmesg from booting from last good commit
Comment 2 clemej 2018-09-30 04:07:25 UTC
Created attachment 278849 [details]
dmesg from normal debian 4.9 boot
Comment 3 Cristian Aravena Romero 2018-10-08 15:00:42 UTC
Hello,

Original Report:
https://bugs.launchpad.net/bugs/1796443

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 4 Kai-Heng Feng 2018-10-11 07:18:57 UTC
I think it's better to mail to the patch author and cc x86 mailing list.
Comment 5 Amit Prakash Ambasta 2018-10-12 12:38:28 UTC
*** Bug 201213 has been marked as a duplicate of this bug. ***
Comment 6 Rafał Miłecki 2018-11-27 10:21:43 UTC
https://marc.info/?l=linux-edac&m=154331383121359&w=2

[PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them
Comment 7 Rafał Miłecki 2018-11-28 11:43:35 UTC
A proper fix has been provided by Borislav:

https://marc.info/?t=154334682000003&r=1&w=2

[PATCH] x86/MCE/AMD: Fix the thresholding machinery initialization order
Comment 8 Rafał Miłecki 2018-11-30 23:10:16 UTC
Fixed in Linus's tree with commit 60c8144afc28 ("x86/MCE/AMD: Fix the thresholding machinery initialization order").
Comment 9 Rafał Miłecki 2018-12-05 20:50:01 UTC
Fix became part of the following releases:
1) 4.20-rc5 (commit 60c8144afc28)
2) 4.19.7 (commit 00f91adf52af)
3) 4.14.86 (commit 855eefd9124a)

Note You need to log in before you can comment on or make changes to this bug.