https://patchwork.kernel.org/patch/9390769/ Continues previous year discussion hopefully Supposed test cases for the thing could be https://bugs.freedesktop.org/show_bug.cgi?id=98738 https://bugzilla.mozilla.org/show_bug.cgi?id=1335925#c1 (sadly confirmed under Windows only)
Please write in short first what exactly the problem is. Thx.
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=63 Erratum 688: Processor May Cause Unpredictable Program Behavior Under Highly Specific Branch Conditions This bug can cause very nasty behaviors as the ones reported in my first links. I.e. sporadic crashes that even when debugged down to assembly still don't make any sense. Some diligent OEM released bios with updated AGESA for affected systems, but: a) not everybody updates his firmware b) not everybody even can Reported solution consists into setting MSRC001_1021[14]=1b and MSRC001_1021[3]=1b conditional to D18F4x164[2]=0b I thought you already had somewhere half of a patch, so I thought I didn't need many pleasantries. Sorry.
And does the patch help you? I vaguely remember some issues. Patchwork has the thread with sonofagun@openmailbox.org saying ubuntu still fails and what not...?
I didn't personally try any of your works, because luckily my ASUS laptop has the luxury of having a patched bios. Which (together with still faulty one) I indeed used to 'make' the aforementioned Windows test case. Even then though, I'm pissed to see my hardware bug reported and blacklisted everywhere for this stupid issue. If needed to advance it, I could try to see if I can find something reproducible for linux too then. Regards.
(In reply to mirh from comment #4) > Even then though, I'm pissed to see my hardware bug reported and > blacklisted everywhere for this stupid issue. That sentence I don't understand. Please elaborate.
(In reply to Borislav Petkov from comment #5) > (In reply to mirh from comment #4) > > Even then though, I'm pissed to see my hardware bug reported and > > blacklisted everywhere for this stupid issue. > > That sentence I don't understand. Please elaborate. As I said, this is a very nasty bug. You can see in the links from the above mozilla bug report, that even competent debugger people have gave up trying to understand wtf was going on (since normally first rule is never to blame the hardware) Therefore if this problem manifests more frequently when, say, hardware accelerating UI (or whatever other 'optional' feature) the best they can do is disabling it under systems that correlates more certainly with crashes. In firefox case this oddly meant blacklisting the APU's graphics card. TL;DR all of this to mean that even though my very PC has already a fix installed, the problem still find a way to affect me by these ways. Thus having kernel assuring that not to occur anymore would really help. Hope you can find some time to dedicate to it. Best.
(In reply to mirh from comment #6) > TL;DR all of this to mean that even though my very PC has already a fix > installed, the problem still find a way to affect me by these ways. Well, if it is fixed on your machine, how does it affect you? Because of other people or because other projects end up blacklisting the GPU in the APU and this way you're affected too? Or what exactly is the problem you're seeing on your machine? > Thus having kernel assuring that not to occur anymore would really help. > > Hope you can find some time to dedicate to it. Sure, I can but only if there is something to test the fix on. IOW, we'd need an affected machine which has the issue and on which one can test the fix and can confirm that it works. Thx.
(In reply to Borislav Petkov from comment #7) > (In reply to mirh from comment #6) > > TL;DR all of this to mean that even though my very PC has already a fix > > installed, the problem still find a way to affect me by these ways. > > Well, if it is fixed on your machine, how does it affect you? Because of > other people or because other projects end up blacklisting the GPU in > the APU and this way you're affected too? Absolutely the latter. > Sure, I can but only if there is something to test the fix on. > > IOW, we'd need an affected machine which has the issue and on which one > can test the fix and can confirm that it works. As I mentioned my PC can do it too at will. I have no confirmed reliable linux test case though atm. ?Looking into that?
(In reply to mirh from comment #8) > Absolutely the latter. Well, the blacklisting should be more fine-grained to detect the case where the fix is present in the CPU, i.e., blacklist the GPU only when D18F4x164[2]) = 0b. Or better yet, apply the fix and *not* blacklist the GPU. :-) > As I mentioned my PC can do it too at will. What is "it" exactly? You said your CPU has the fix. How then can it then trigger that bug still?
> Well, the blacklisting should be more fine-grained to detect the case > where the fix is present in the CPU, i.e., blacklist the GPU only when > D18F4x164[2]) = 0b. Or better yet, apply the fix and *not* blacklist the > GPU. Yes, of course. But first, not like this could be figured out by an engineer working on remote bug reports. But aside of that, even when fully aware of the situation.. How in the heck could a normal user-side code access MSR information? So that's why I don't blame them. > > As I mentioned my PC can do it too at will. > > What is "it" exactly? it == affected machine w/ issue that can test+confirm the fix > You said your CPU has the fix. How then can it then trigger that bug > still? I said I have a fixed *bios*, not cpu (which is even only ON-B0 stepping then). ASUS is so cute to also allow downgrading, so that's it.
(In reply to mirh from comment #10) > But aside of that, even when fully aware of the situation.. How in the heck > could a normal user-side code access MSR information? It can - rdmsr/wrmsr but that above is a PCI register so you can use setpci. JFYI anyway :-) > I said I have a fixed *bios*, not cpu (which is even only ON-B0 stepping > then). > ASUS is so cute to also allow downgrading, so that's it. Ha, ok. Sure, if you can trigger a failure with downgraded BIOS - make sure to check D18F4x164 by doing setpci -s 18.4 0x164.l and then look at bit 2 - then I can refresh the patch and you can try it to see if the failure goes away... Thx.
(In reply to Borislav Petkov from comment #11) > It can - rdmsr/wrmsr but that above is a PCI register so you can use > setpci. JFYI anyway :-) But you need root credentials to access all of that (kind whatever I was excluding with "normal") - and a browser certainly is not supposed to do that. > Ha, ok. Sure, if you can trigger a failure with downgraded BIOS - make > sure to check D18F4x164 by doing > > setpci -s 18.4 0x164.l > > and then look at bit 2 - then I can refresh the patch and you can try it > to see if the failure goes away... setpci: Invalid width "1" Then, sadly, I'm still looking for something just deciding to finally crash. Both firefox and [latest segfault scandal of] ryzen's utilities didn't help unfortunately.
(In reply to mirh from comment #12) > But you need root credentials to access all of that (kind whatever I was > excluding with "normal") - and a browser certainly is not supposed to do > that. I'm saying it is possible - not that it makes sense for a browser to do it. > > setpci -s 18.4 0x164.l > > > > and then look at bit 2 - then I can refresh the patch and you can try it > > to see if the failure goes away... > > setpci: Invalid width "1" That's a small "el" - not 1. Copy-paste the whole command and run as root: # setpci -s 18.4 0x164.l 00000003 is what I see here. > Then, sadly, I'm still looking for something just deciding to finally crash. > Both firefox and [latest segfault scandal of] ryzen's utilities didn't help > unfortunately. You need to be careful here. The Ryzen utilities - whatever they are - almost certainly don't have anything to do with a Bobcat bug. If they did, then that would be a very happy coincidence. However, I can't think how you could reproduce something like that. Perhaps ask one of the people on the forums if they have a reliable reproducer...
(In reply to Borislav Petkov from comment #13) > (In reply to mirh from comment #12) > 00000003 > > is what I see here. Me too. > You need to be careful here. The Ryzen utilities - whatever they are - > almost certainly don't have anything to do with a Bobcat bug. If they > did, then that would be a very happy coincidence. Sure thing. It's just that I haven't really a lot of ideas. > However, I can't think how you could reproduce something like that. > Perhaps ask one of the people on the forums if they have a reliable > reproducer... Mhh, which forums? _Meaning_ mozilla's bugzilla? They have been amazing to haunt this down 4 years ago, but I'm not sure nowadays this hardware is really all that fashionable..
Ok so.. I tried every damn mem/cpu-tester, benchmark, fuzzer, scourge of god out there, with zero results. Ultimately I got the smartest idea: to.. wait for it, use Windows test case nonetheless! I tried both in a VMware virtual machine running W7, and "natively" under wine-staging 2.18 with XP prefix (closed gpu drivers are required) 100% can reproduce.
What does the kernel say when you(In reply to mirh from comment #15) > I tried both in a VMware virtual machine running W7, and "natively" under > wine-staging 2.18 with XP prefix (closed gpu drivers are required) > 100% can reproduce. What are the error messages you see when it happens? Or does the machine die?
(In reply to Borislav Petkov from comment #16) > What are the error messages you see when it happens? It says nothing. And I guess it can only make sense. I mean, in the VM case it's quite obvious why. Under wine instead, putting aside that I do believe the compatibility layer already takes care itself of catching exceptions, the point is that Firefox has its own crash reporter further catching hangs or whatever. In fact, even under Windows the OS doesn't trigger any specific notification to warn user. > Or does the machine die? Nope. It never affected _system_ stability, and indeed not even AMD reported to have observed this.
So what does it do? What is the manifestation of the bug? And then, how can you be sure it is caused by this erratum? It goes away if you upgrade your BIOS? But don't upgrade your BIOS. Try setting the MSR bits only, instead. As root: # rdmsr -a 0xc0011021 Take value read and OR in bits 3 and 14: # wrmsr -a 0xc0011021 $(( 0x<value> | (1 << 3) | (1 << 14) )) And then rerun the tests.
Well, what can I say? Crashes freaking disappeared finally.
Lemme make sure I understand this correctly: you ran the vm, it crashed, you set the bits in the MSR, reran the vm and no more crashes. Correct?
*the program inside the VM crashed *reran the program Yes that's it. Also the same with the "wine way".
Ok, then try the attached patch. It is against 4.14-rc5+
Created attachment 260309 [details] test patch
_took 6 hours to compile kernel on this potato_ It worked!! No crashes in both tests. (and yes reading the registry now gives 1000c008)
Ok, thanks for testing. Would you like me to add your Tested-by: tag to the final patch?
Ehrm.. Uh, sure, ok. Thank you for your time and help.
Tested-by: <mirh@protonmail.ch> or would you like me to add a name too?
Nah, thank you. I have seen the entry on the ML. By the way, I just gave a quick read to the errata document I linked in comment 2. I have no test-cases and all then, but what do you think of numbers 435, 530, 551, 560, 596 and 629?
Fixed in bfc1168de949cd3e9ca18c3480b5085deff1ea7c
Sorry to bother again Borislav. I just noticed that D18F4x164 value in affected CPUs could even be 00000001h, as well as 00000003h. Is your patch equally good and working for both? http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf#page=9
(In reply to mirh from comment #30) > Sorry to bother again Borislav. > I just noticed that D18F4x164 value in affected CPUs could even be > 00000001h, as well as 00000003h. > > Is your patch equally good and working for both? I'd leave that as an exercise to the reader. ;^)
The exercise would consist in understanding what that BIT function would be supposed to do I guess? 🙃 If it returns true when second bit is set, I guess it's fine. Thanks again.
Ok I'm still here, doh. Could you check the backports? It's super odd that 3.16 has it, but then nothing else til 4.13 https://www.spinics.net/lists/stable/msg195036.html https://www.spinics.net/lists/stable/msg207448.html
(In reply to mirh from comment #33) > It's super odd that 3.16 has it, but then nothing else til 4.13 > https://www.spinics.net/lists/stable/msg195036.html Did you read the beginning of that mail? It explains why. > https://www.spinics.net/lists/stable/msg207448.html Ben backported that one for Debian.
Yes, I meant odd as in counterintuitive, not unexplained.