Bug 85911 - rdrand instruction fails after resume on AMD family 22 CPU
Summary: rdrand instruction fails after resume on AMD family 22 CPU
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-08 20:36 UTC by Hin-Tak Leung
Modified: 2019-05-09 17:33 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.16.3-200
Tree: Fedora
Regression: No


Attachments

Description Hin-Tak Leung 2014-10-08 20:36:12 UTC
After suspend/resume on a recent AMD CPU, the rdrand instruction fails.
symptoms are that openssl fails to generate keys (trying to do kernel
module signing), and ssh anywhere does not work.

The problem was eventually diagnosed by disabling that instruction in ssh - i.e. "OPENSSL_ia32cap=~0x4000000000000000 ssh ..." works.

proc/cpuinfo
https://bugzilla.redhat.com/attachment.cgi?id=944940

dmesg
https://bugzilla.redhat.com/attachment.cgi?id=944986

and more details are described in:
https://bugzilla.redhat.com/show_bug.cgi?id=1150286
Comment 1 Hin-Tak Leung 2014-10-10 22:42:06 UTC
The error message from running the module signing key of kernel build is:

Generating a 4096 bit RSA private key
.Error Generating Key
48012007979840:error:0307A071:bignum routines:BN_rand_range:too many iterations:bn_rand.c:269:
48012007979840:error:04081003:rsa routines:RSA_BUILTIN_KEYGEN:BN lib:rsa_gen.c:515:
make[1]: *** [signing_key.x509] Error 1
make: *** [kernel] Error 2
Comment 2 Hin-Tak Leung 2014-11-23 19:40:04 UTC
still a problem with  3.17.3-200.fc20.x86_64 .

it is also affecting "wget https://..." now. e.g.

$ wget -m https://gmplib.org/~tege/x86-timing.pdf
--2014-11-22 23:25:00--  https://gmplib.org/~tege/x86-timing.pdf
Resolving gmplib.org (gmplib.org)... 37.252.124.96
Connecting to gmplib.org (gmplib.org)|37.252.124.96|:443... connected.
OpenSSL: error:0307A071:bignum routines:BN_rand_range:too many iterations
OpenSSL: error:1409802B:SSL routines:SSL3_SEND_CLIENT_KEY_EXCHANGE:reason(43)
Unable to establish SSL connection.

The workaround "OPENSSL_ia32cap=~0x4000000000000000 ..." continues to work; I am inclined to set it systemwide until a fix can be found... Is there anything I can help? I am not against looking at the kernel source myself.
Comment 3 Hin-Tak Leung 2014-12-15 18:49:18 UTC
I don't know what changed, but I just upgraded from fedora 20 to fedora 21 yesterday, and therefore switching from 3.17.6-200.fc20.x86_64 to 3.17.6-300.fc21.x86_64 , and both ssh and generating the module signing key of kernel build also start to work. As it happened, I have started to script/alias wget to always have OPENSSL_ia32cap=~0x4000000000000000 yesterday merely 1/2 a day before upgrading, so it definitely did not work merely 10 hours before I booted f21 for the first time.

This can be closed, I think, though I would like to know whether it is userland or the kernel which affects it; I'll boot into the older f20 kernel with f21 userland soon, just to see which way it is.
Comment 4 H. Peter Anvin 2014-12-15 21:06:38 UTC
Given that it kicks in after suspend/resume it is probably a BIOS bug which calls for a kernel workaround.

In particular I am *guessing* that there is some AMD-specific feature control register (presumably something equivalent to MSR_IA32_MISC_ENABLE) which needed to be saved and restored?

Booting the fc20 kernel with the fc21 userland would be useful, as would examining the third-party patches that go into the -300.fc21 kernel.
Comment 5 Hin-Tak Leung 2014-12-16 00:20:14 UTC
The fc20 kernel actually behaves correctly under a fc21 userland, strangely enough; and I am certainly that it did not with fc20 userland, as I have been on the same kernel for a week, and had to do my usual OPENSSL_ia32cap=~0x4000000000000000 for wget/ssh until as recent as 10 hours before upgrade, I think.

The url for the patches for fedora is at http://pkgs.fedoraproject.org/cgit/kernel.git/ .
Comment 6 H. Peter Anvin 2014-12-16 00:28:23 UTC
That would imply that openssl contains either a fix or a workaround.  It is somewhat odd what kind of userspace fix could fix a suspend/resume problem, though...
Comment 7 H. Peter Anvin 2014-12-16 00:30:28 UTC
Oh no, openssl just disabled the use of RDRAND:

https://software.intel.com/en-us/blogs/2014/10/03/changes-to-rdrand-integration-in-openssl

So the problem still remains.  It would be good if someone with access to the relevant AMD hardware and documentation could look at this.
Comment 8 Hin-Tak Leung 2014-12-16 01:11:21 UTC
That makes sense - it had crossed my mind to put OPENSSL_ia32cap=~0x4000000000000000 or equivalent by putting it in early in /etc/sysconfig or replaced my /usr/lib64/libssl.so* , since it has been over two months of manually putting it on. They have just beaten me to it... I suspend intel won't mention or comment that their competitor doesn't implement that instruction correctly and/or needs special save+reset, so I'd rather hear why the change was made from the openssl people directly.

FWIW, I was on openssl-1.0.1e-40.fc20.x86_64 and now openssl-1.0.1j-1.fc21.x86_64 , so indeed I have moved from pre 1.0.1f to post 1.0.1f.

I'd be happy to test things if there are simple way of testing it independent of openssl...
Comment 9 Marc Bejarano 2019-05-08 09:00:05 UTC
Someone with the proper permission bits may want to update this bug title to refer to AMD family 22 and up the priority of this bug.

See https://github.com/systemd/systemd/issues/11810#issuecomment-490275562
Comment 10 Marc Bejarano 2019-05-08 09:05:35 UTC
Also, there's this patch from https://paste.fedoraproject.org/paste/Qhao0f9NszPj8K9EgCSbnw referenced in https://github.com/systemd/systemd/issues/11810#issuecomment-490284361

diff --git a/src/basic/random-util.c b/src/basic/random-util.c
index ca25fd2420..b7cfc1bc2d 100644
--- a/src/basic/random-util.c
+++ b/src/basic/random-util.c
@@ -58,6 +58,8 @@ int rdrand(unsigned long *ret) {
         msan_unpoison(&err, sizeof(err));
         if (!err)
                 return -EAGAIN;
+        if (*ret == 0 || *ret == ULONG_MAX) /* filter out obvious crap, in case of AMD */
+                return -EAGAIN;
 
         return 0;
 #else
Comment 11 Hin-Tak Leung 2019-05-09 17:33:05 UTC
Changed the title (me being the original reporter...).

Not sure if it is appropriate to change priority though - on the whole, I think the priority should be set by the people who are going to spend time attempting to fix the issue, rather than by the reporter(s)/users experiencing the issue.

Note You need to log in before you can comment on or make changes to this bug.