Bug 216502

Summary: slow crng initialization on Rockchip 3399 (Friendyarm NanoPi M4)
Product: Drivers Reporter: Mikhail Rudenko (mike.rudenko)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: normal CC: Jason, mike.rudenko, regressions
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.0.0-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg of the bad case
dmesg of the good case
kernel .config

Description Mikhail Rudenko 2022-09-17 22:59:40 UTC
Created attachment 301826 [details]
dmesg of the bad case

Crng initialization time has significantly increased on Rockchip 3399-based Friendyarm NanoPi M4 in v6.0.0-rc5 compared to v5.18.0 (from 2.36s since rootfs mounted to 8.55s since rootfs mounted). Bisection points at 78c768e619fb ("random: vary jitter iterations based on cycle counter speed"). Manually reverting that commit on top of v6.0.0-rc5 fixes the issue. Attached dmesg logs correspond to vanilla v6.0.0-rc5 ("bad") and v6.0.0-rc5 with 78c768e619fb reverted ("good").
Comment 1 Mikhail Rudenko 2022-09-17 23:00:20 UTC
Created attachment 301827 [details]
dmesg of the good case
Comment 2 Mikhail Rudenko 2022-09-17 23:01:06 UTC
Created attachment 301828 [details]
kernel .config
Comment 3 Jason A. Donenfeld 2022-09-19 09:19:34 UTC
Hmm. I'm not totally sure this is a regression. Before that commit, the behavior would be:

- r1 = read_counter()
- r2 = read_counter()
- if (r1 == r2) do not try to use jitter entropy.

With the commit you identified, the new behavior is that it jitters harder when there are repeat counters after reading a few thousand of them.

So what would happen before this change is that sometimes, the system would entirely fail to initialize using jitter, because r1==r2. Not it always succeeds, but makes sure to spend more time doing to so to make sure it's done well.
Comment 4 Jason A. Donenfeld 2022-09-19 09:20:00 UTC
s/Not it/Now it/
Comment 5 Mikhail Rudenko 2022-09-19 09:28:03 UTC
Well, this results in a big startup time increase: "multi-user.target reached after 4.477s in userspace" before the commit in question and 10.861s after. Does this qualify as a regression?
Comment 6 Jason A. Donenfeld 2022-09-19 09:44:03 UTC
I'm pretty sure that if you revert that commit, then some of the time, it will reach that target after many, many, many seconds, if ever. That seems worse, no?

Maybe in the future we can improve the jitter algo even more with some more advanced statistics and tests.
Comment 7 Mikhail Rudenko 2022-09-19 10:01:36 UTC
I just can say, I have never hit the bad case you speak about ("if ever") in practice on this hardware. I understand that the commit in question fixes an issue important for some configurations, but in my case of an embedded device boot time is critical. Could we add a choice (maybe a kconfig option) between quick'n'dirty and reliable'n' slow algorithms? Or I miss something and there is an existing workaround?
Comment 8 Jason A. Donenfeld 2022-09-19 10:07:02 UTC
No, not going to add an option. But I will look into whether your case can be improved. I think I have a RK3399 around here somewhere. Curious to learn: what's your userland like? Which version of systemd?
Comment 9 Mikhail Rudenko 2022-09-19 10:19:55 UTC
Thanks for looking into this! I did my tests with yocto 3.1 (dunfell) and systemd 244.5. If you suspect userspace issues, I can rerun the tests with some more recent and conventional distro (like Ubuntu lts) in a few days.
Comment 10 Jason A. Donenfeld 2022-09-19 10:29:19 UTC
More recent systemds will likely ameliorate the issue you're seeing. But that doesn't stop my being interested by the kernel issue.

As a preliminary test, think you could try out this patch I just cooked up?
https://git.zx2c4.com/linux-dev/patch/?id=rockchip

In theory, that should wire up the rk3399's hardware RNG up to Linux.
Comment 11 Jason A. Donenfeld 2022-09-19 10:31:21 UTC
And of course don't forget to set CONFIG_HW_RANDOM_ROCKCHIP=y after applying that.
Comment 12 Mikhail Rudenko 2022-09-19 10:39:51 UTC
Great, will test your patch this evening.
Comment 13 Jason A. Donenfeld 2022-09-19 10:41:36 UTC
Looking forward to that. If it does work, if you include in your reply the string `Tested-by: First Last <email@email.email>`, I'll stick that into the commit when I post it to LKML.
Comment 14 Mikhail Rudenko 2022-09-19 20:48:10 UTC
Looks like it worked out:

before 78c768e619fb: multi-user.target reached after 4.831s in userspace
after 78c768e619fb: multi-user.target reached after 11.360s in userspace
with your patch: multi-user.target reached after 3.333s in userspace

Thanks for the quick fix!

Tested-by: Mikhail Rudenko <mike.rudenko@gmail.com>
Comment 15 Jason A. Donenfeld 2022-09-19 20:56:39 UTC
Hey hey! Better than where we started. Now we're talking.

I'll post that to LKML shortly.
Comment 17 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-11-15 10:24:19 UTC
(In reply to Jason A. Donenfeld from comment #16)
> https://lore.kernel.org/linux-crypto/20220919210025.2376254-1-Jason@zx2c4.
> com/

I still have this issue on list of tracked regressions, as there was no patch or commit that linked to this ticket with a link tag.

Unless I missed something, Janson's driver never made it anywhere. And the other one linked in that thread might slowly making its way upstream afaics: https://lore.kernel.org/all/Y1tiN9GSMrudUG6d@gondor.apana.org.au/ 

Does that satisfy everyone for now?
Comment 18 Jason A. Donenfeld 2022-11-15 11:03:41 UTC
Am I "Janson" in this case?

If so, I understand that there's a better driver that supercedes the thing I whipped together real quick.
Comment 19 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-11-15 11:13:57 UTC
(In reply to Jason A. Donenfeld from comment #18)
> Am I "Janson" in this case?

Thx for the reply and apologies, I know your real name, but seems by fingers had something else in mind. :-/ 
 
> If so, I understand that there's a better driver that supercedes the thing I
> whipped together real quick.

Well, it's still a regression that ideally should have been fixed weeks ago -- but well, OTOH it's quite special case, so maybe waiting for that driver is the lesser evil if that's okay for Mikhail.
Comment 20 Mikhail Rudenko 2022-11-15 11:20:04 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #19)
> (In reply to Jason A. Donenfeld from comment #18)
> > Am I "Janson" in this case?
> 
> Thx for the reply and apologies, I know your real name, but seems by fingers
> had something else in mind. :-/ 
>  
> > If so, I understand that there's a better driver that supercedes the thing
> I
> > whipped together real quick.
> 
> Well, it's still a regression that ideally should have been fixed weeks ago
> -- but well, OTOH it's quite special case, so maybe waiting for that driver
> is the lesser evil if that's okay for Mikhail.

Well, I've applied the quick fix locally and it works for me. But I think we should wait until the proper driver is merged.
Comment 21 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-11-15 14:51:19 UTC
(In reply to Mikhail Rudenko from comment #20)

> Well, I've applied the quick fix locally and it works for me. But I think we
> should wait until the proper driver is merged.

Okay, but FWIW, to avoid surprises in that case it might be wise if you'd actually check rather sooner than later if the proposed driver really fixes the problem (
https://lore.kernel.org/lkml/20220927075511.3147847-1-clabbe@baylibre.com/ )
Comment 22 Mikhail Rudenko 2022-11-16 17:25:31 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #21)
> Okay, but FWIW, to avoid surprises in that case it might be wise if you'd
> actually check rather sooner than later if the proposed driver really fixes
> the problem (
> https://lore.kernel.org/lkml/20220927075511.3147847-1-clabbe@baylibre.com/ )

I've just repeated my tests with vanilla 6.1-rc5. The issue persists, unless Jason's
patch is applied. Moreover, the series you mentioned does NOT fix the problem. The
crypto devices initialize succesfully (as seen from dmesg), but boot times are as high
as they were on vanilla 6.1-rc5. Quick look shows that rk3288_crypto driver does not
register hwrng device unlike Jason's driver. Probably the rng part from Jason's driver
should be applied on top of Corentin's series, but I don't have enough time budget for doing
this now :(
Comment 23 Jason A. Donenfeld 2022-11-16 17:34:53 UTC
Stop talking in this Bugzilla. Nobody cares about or reads Bugzilla except for a few strange folks like me and Thorsten.

Instead, chime into that upstream patch thread and make sure the people working on this code know the results of your testing. I probably won't monitor this issue anymore, so if you want the rockchip fixed, you'll need to be proactive on the mailing list with other devs.