Bug 197285 - Apply erratum #688 fix to broken Bobcat-based processors
Summary: Apply erratum #688 fix to broken Bobcat-based processors
Status: RESOLVED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 enhancement
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-16 20:14 UTC by mirh
Modified: 2018-09-10 15:54 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
test patch (1.99 KB, patch)
2017-10-21 10:16 UTC, Borislav Petkov
Details | Diff

Description mirh 2017-10-16 20:14:29 UTC
https://patchwork.kernel.org/patch/9390769/
Continues previous year discussion hopefully 

Supposed test cases for the thing could be
https://bugs.freedesktop.org/show_bug.cgi?id=98738
https://bugzilla.mozilla.org/show_bug.cgi?id=1335925#c1 (sadly confirmed under Windows only)
Comment 1 Borislav Petkov 2017-10-16 21:10:09 UTC
Please write in short first what exactly the problem is.

Thx.
Comment 2 mirh 2017-10-16 21:38:41 UTC
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63
Erratum 688: Processor May Cause Unpredictable Program Behavior Under 
Highly Specific Branch Conditions

This bug can cause very nasty behaviors as the ones reported in my first links. I.e. sporadic crashes that even when debugged down to assembly still don't make any sense. 

Some diligent OEM released bios with updated AGESA for affected systems, but:
a) not everybody updates his firmware
b) not everybody even can

Reported solution consists into setting MSRC001_1021[14]=1b and MSRC001_1021[3]=1b
conditional to D18F4x164[2]=0b

I thought you already had somewhere half of a patch, so I thought I didn't need many pleasantries. 
Sorry.
Comment 3 Borislav Petkov 2017-10-16 22:03:05 UTC
And does the patch help you?

I vaguely remember some issues. Patchwork has the thread with sonofagun@openmailbox.org saying ubuntu still fails and what not...?
Comment 4 mirh 2017-10-16 23:09:38 UTC
I didn't personally try any of your works, because luckily my ASUS laptop has the luxury of having a patched bios. 

Which (together with still faulty one) I indeed used to 'make' the aforementioned Windows test case. 

Even then though, I'm pissed to see my hardware bug reported and blacklisted everywhere for this stupid issue. 
If needed to advance it, I could try to see if I can find something reproducible for linux too then. 

Regards.
Comment 5 Borislav Petkov 2017-10-17 08:32:01 UTC
(In reply to mirh from comment #4)
> Even then though, I'm pissed to see my hardware bug reported and
> blacklisted everywhere for this stupid issue.

That sentence I don't understand. Please elaborate.
Comment 6 mirh 2017-10-17 12:09:10 UTC
(In reply to Borislav Petkov from comment #5)
> (In reply to mirh from comment #4)
> > Even then though, I'm pissed to see my hardware bug reported and
> > blacklisted everywhere for this stupid issue.
> 
> That sentence I don't understand. Please elaborate.

As I said, this is a very nasty bug. You can see in the links from the above mozilla bug report, that even competent debugger people have gave up trying to understand wtf was going on (since normally first rule is never to blame the hardware)

Therefore if this problem manifests more frequently when, say, hardware accelerating UI (or whatever other 'optional' feature) the best they can do is disabling it under systems that correlates more certainly with crashes. 
In firefox case this oddly meant blacklisting the APU's graphics card. 

TL;DR all of this to mean that even though my very PC has already a fix installed, the problem still find a way to affect me by these ways. 
Thus having kernel assuring that not to occur anymore would really help. 

Hope you can find some time to dedicate to it. 
Best.
Comment 7 Borislav Petkov 2017-10-17 16:34:21 UTC
(In reply to mirh from comment #6)
> TL;DR all of this to mean that even though my very PC has already a fix
> installed, the problem still find a way to affect me by these ways.

Well, if it is fixed on your machine, how does it affect you? Because of
other people or because other projects end up blacklisting the GPU in
the APU and this way you're affected too?

Or what exactly is the problem you're seeing on your machine?

> Thus having kernel assuring that not to occur anymore would really help. 
> 
> Hope you can find some time to dedicate to it.

Sure, I can but only if there is something to test the fix on.

IOW, we'd need an affected machine which has the issue and on which one
can test the fix and can confirm that it works.

Thx.
Comment 8 mirh 2017-10-17 16:51:34 UTC
(In reply to Borislav Petkov from comment #7)
> (In reply to mirh from comment #6)
> > TL;DR all of this to mean that even though my very PC has already a fix
> > installed, the problem still find a way to affect me by these ways.
> 
> Well, if it is fixed on your machine, how does it affect you? Because of
> other people or because other projects end up blacklisting the GPU in
> the APU and this way you're affected too?

Absolutely the latter. 

> Sure, I can but only if there is something to test the fix on.
> 
> IOW, we'd need an affected machine which has the issue and on which one
> can test the fix and can confirm that it works.

As I mentioned my PC can do it too at will. 
I have no confirmed reliable linux test case though atm. 
?Looking into that?
Comment 9 Borislav Petkov 2017-10-17 19:10:11 UTC
(In reply to mirh from comment #8)
> Absolutely the latter.

Well, the blacklisting should be more fine-grained to detect the case
where the fix is present in the CPU, i.e., blacklist the GPU only when
D18F4x164[2]) = 0b. Or better yet, apply the fix and *not* blacklist the
GPU.

:-)

> As I mentioned my PC can do it too at will.

What is "it" exactly?

You said your CPU has the fix. How then can it then trigger that bug
still?
Comment 10 mirh 2017-10-17 19:54:04 UTC
> Well, the blacklisting should be more fine-grained to detect the case
> where the fix is present in the CPU, i.e., blacklist the GPU only when
> D18F4x164[2]) = 0b. Or better yet, apply the fix and *not* blacklist the
> GPU.

Yes, of course. 
But first, not like this could be figured out by an engineer working on remote bug reports. 

But aside of that, even when fully aware of the situation.. How in the heck could a normal user-side code access MSR information?
So that's why I don't blame them. 

> > As I mentioned my PC can do it too at will.
> 
> What is "it" exactly?

it == affected machine w/ issue that can test+confirm the fix

> You said your CPU has the fix. How then can it then trigger that bug
> still?

I said I have a fixed *bios*, not cpu (which is even only ON-B0 stepping then). 
ASUS is so cute to also allow downgrading, so that's it.
Comment 11 Borislav Petkov 2017-10-17 20:12:09 UTC
(In reply to mirh from comment #10)
> But aside of that, even when fully aware of the situation.. How in the heck
> could a normal user-side code access MSR information?

It can - rdmsr/wrmsr but that above is a PCI register so you can use
setpci. JFYI anyway :-)

> I said I have a fixed *bios*, not cpu (which is even only ON-B0 stepping
> then). 
> ASUS is so cute to also allow downgrading, so that's it.

Ha, ok. Sure, if you can trigger a failure with downgraded BIOS - make
sure to check D18F4x164 by doing

setpci -s 18.4 0x164.l

and then look at bit 2 - then I can refresh the patch and you can try it
to see if the failure goes away...

Thx.
Comment 12 mirh 2017-10-17 21:38:44 UTC
(In reply to Borislav Petkov from comment #11)
> It can - rdmsr/wrmsr but that above is a PCI register so you can use
> setpci. JFYI anyway :-)

But you need root credentials to access all of that (kind whatever I was excluding with "normal") - and a browser certainly is not supposed to do that. 
 
> Ha, ok. Sure, if you can trigger a failure with downgraded BIOS - make
> sure to check D18F4x164 by doing
> 
> setpci -s 18.4 0x164.l
> 
> and then look at bit 2 - then I can refresh the patch and you can try it
> to see if the failure goes away...

setpci: Invalid width "1"

Then, sadly, I'm still looking for something just deciding to finally crash. 
Both firefox and [latest segfault scandal of] ryzen's utilities didn't help unfortunately.
Comment 13 Borislav Petkov 2017-10-17 21:58:46 UTC
(In reply to mirh from comment #12)
> But you need root credentials to access all of that (kind whatever I was
> excluding with "normal") - and a browser certainly is not supposed to do
> that.

I'm saying it is possible - not that it makes sense for a browser to do
it.

> > setpci -s 18.4 0x164.l
> > 
> > and then look at bit 2 - then I can refresh the patch and you can try it
> > to see if the failure goes away...
> 
> setpci: Invalid width "1"

That's a small "el" - not 1. Copy-paste the whole command and run as
root:

# setpci -s 18.4 0x164.l
00000003

is what I see here.


> Then, sadly, I'm still looking for something just deciding to finally crash.
> Both firefox and [latest segfault scandal of] ryzen's utilities didn't help
> unfortunately.

You need to be careful here. The Ryzen utilities - whatever they are -
almost certainly don't have anything to do with a Bobcat bug. If they
did, then that would be a very happy coincidence.

However, I can't think how you could reproduce something like that.
Perhaps ask one of the people on the forums if they have a reliable
reproducer...
Comment 14 mirh 2017-10-17 23:34:55 UTC
(In reply to Borislav Petkov from comment #13)
> (In reply to mirh from comment #12)
> 00000003
> 
> is what I see here.

Me too. 

> You need to be careful here. The Ryzen utilities - whatever they are -
> almost certainly don't have anything to do with a Bobcat bug. If they
> did, then that would be a very happy coincidence.

Sure thing. It's just that I haven't really a lot of ideas. 

> However, I can't think how you could reproduce something like that.
> Perhaps ask one of the people on the forums if they have a reliable
> reproducer...

Mhh, which forums? _Meaning_ mozilla's bugzilla?
They have been amazing to haunt this down 4 years ago, but I'm not sure nowadays this hardware is really all that fashionable..
Comment 15 mirh 2017-10-20 18:38:54 UTC
Ok so.. I tried every damn mem/cpu-tester, benchmark, fuzzer, scourge of god out there, with zero results. 

Ultimately I got the smartest idea: to.. wait for it, use Windows test case nonetheless! 
I tried both in a VMware virtual machine running W7, and "natively" under wine-staging 2.18 with XP prefix (closed gpu drivers are required)
100% can reproduce.
Comment 16 Borislav Petkov 2017-10-20 18:41:55 UTC
What does the kernel say when you(In reply to mirh from comment #15)
> I tried both in a VMware virtual machine running W7, and "natively" under
> wine-staging 2.18 with XP prefix (closed gpu drivers are required)
> 100% can reproduce.

What are the error messages you see when it happens? Or does the machine die?
Comment 17 mirh 2017-10-20 19:03:43 UTC
(In reply to Borislav Petkov from comment #16)
> What are the error messages you see when it happens? 

It says nothing. 
And I guess it can only make sense. 

I mean, in the VM case it's quite obvious why. 
Under wine instead, putting aside that I do believe the compatibility layer already takes care itself of catching exceptions, the point is that Firefox has its own crash reporter further catching hangs or whatever. 

In fact, even under Windows the OS doesn't trigger any specific notification to warn user. 

> Or does the machine die?

Nope. It never affected _system_ stability, and indeed not even AMD reported to have observed this.
Comment 18 Borislav Petkov 2017-10-20 21:34:35 UTC
So what does it do? What is the manifestation of the bug?

And then, how can you be sure it is caused by this erratum? It goes away
if you upgrade your BIOS?

But don't upgrade your BIOS. Try setting the MSR bits only, instead. As root:

# rdmsr -a 0xc0011021

Take value read and OR in bits 3 and 14:

# wrmsr -a 0xc0011021 $(( 0x<value> | (1 << 3) | (1 << 14) ))

And then rerun the tests.
Comment 19 mirh 2017-10-21 09:46:45 UTC
Well, what can I say?
Crashes freaking disappeared finally.
Comment 20 Borislav Petkov 2017-10-21 09:54:14 UTC
Lemme make sure I understand this correctly:

you ran the vm, it crashed, you set the bits in the MSR, reran the vm and no more crashes.

Correct?
Comment 21 mirh 2017-10-21 10:00:02 UTC
*the program inside the VM crashed
*reran the program

Yes that's it. Also the same with the "wine way".
Comment 22 Borislav Petkov 2017-10-21 10:15:54 UTC
Ok, then try the attached patch. It is against 4.14-rc5+
Comment 23 Borislav Petkov 2017-10-21 10:16:28 UTC
Created attachment 260309 [details]
test patch
Comment 24 mirh 2017-10-21 18:13:44 UTC
_took 6 hours to compile kernel on this potato_

It worked!! No crashes in both tests. 
(and yes reading the registry now gives 1000c008)
Comment 25 Borislav Petkov 2017-10-21 20:19:00 UTC
Ok, thanks for testing.

Would you like me to add your Tested-by: tag to the final patch?
Comment 26 mirh 2017-10-21 20:22:29 UTC
Ehrm.. Uh, sure, ok. 
Thank you for your time and help.
Comment 27 Borislav Petkov 2017-10-21 20:26:27 UTC
Tested-by: <mirh@protonmail.ch>

or would you like me to add a name too?
Comment 28 mirh 2017-10-22 10:54:20 UTC
Nah, thank you. 
I have seen the entry on the ML. 

By the way, I just gave a quick read to the errata document I linked in comment 2. 
I have no test-cases and all then, but what do you think of numbers 435, 530, 551, 560, 596 and 629?
Comment 29 mirh 2017-11-21 20:06:05 UTC
Fixed in bfc1168de949cd3e9ca18c3480b5085deff1ea7c
Comment 30 mirh 2017-11-25 11:33:32 UTC
Sorry to bother again Borislav. 
I just noticed that D18F4x164 value in affected CPUs could even be 00000001h, as well as 00000003h. 

Is your patch equally good and working for both?
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9
Comment 31 Borislav Petkov 2017-11-25 11:56:10 UTC
(In reply to mirh from comment #30)
> Sorry to bother again Borislav. 
> I just noticed that D18F4x164 value in affected CPUs could even be
> 00000001h, as well as 00000003h. 
> 
> Is your patch equally good and working for both?

I'd leave that as an exercise to the reader. ;^)
Comment 32 mirh 2017-11-25 12:17:02 UTC
The exercise would consist in understanding what that BIT function would be supposed to do I guess? 🙃
If it returns true when second bit is set, I guess it's fine. 

Thanks again.
Comment 33 mirh 2018-09-09 21:59:32 UTC
Ok I'm still here, doh. 
Could you check the backports?

It's super odd that 3.16 has it, but then nothing else til 4.13
https://www.spinics.net/lists/stable/msg195036.html
https://www.spinics.net/lists/stable/msg207448.html
Comment 34 Borislav Petkov 2018-09-10 05:44:37 UTC
(In reply to mirh from comment #33)
> It's super odd that 3.16 has it, but then nothing else til 4.13
> https://www.spinics.net/lists/stable/msg195036.html

Did you read the beginning of that mail? It explains why.

> https://www.spinics.net/lists/stable/msg207448.html

Ben backported that one for Debian.
Comment 35 mirh 2018-09-10 15:54:48 UTC
Yes, I meant odd as in counterintuitive, not unexplained.

Note You need to log in before you can comment on or make changes to this bug.