Bug 211317 - [5.8.9 -> 5.11(5.9?) regression] Constant hard freezes with "BUG: Bad page state in process swapper/8", works fine with previous kernel
Summary: [5.8.9 -> 5.11(5.9?) regression] Constant hard freezes with "BUG: Bad page st...
Status: NEW
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-23 14:00 UTC by Ellie
Modified: 2021-02-15 23:38 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
journalctl -k from the 5.11 test that froze (115.25 KB, text/plain)
2021-01-23 14:00 UTC, Ellie
Details

Description Ellie 2021-01-23 14:00:43 UTC
Created attachment 294815 [details]
journalctl -k from the 5.11 test that froze

After 5.8 in Fedora's down stream kernel was upgraded to Fedora's 5.9, there has been a new bug introduced: there are now constant hard freezes with "BUG: Bad page state in process swapper/8" as the last log entry. I now reproduced this with a vanilla kernel 5.11 build provided by a Fedora community member: 5.11.0-0.rc4.129.vanilla.1.fc33.x86_64. The logs are attached.

Downstream bug (for Fedora kernel) is here: https://bugzilla.redhat.com/show_bug.cgi?id=1899805

This bug makes machines close to unusable or at least does mine, since it hits after usually around 5 to 15 minutes. I've been running the old 5.8 Fedora kernel ever since. It seems to be hardware dependent, as you can see on the Fedora ticket it appears to be Ryzen-related. I personally have a Ryzen Gen 1 from an early batch, not sure about the exact CPU of the others that commented.

Since this is a work machine, I would prefer to exhaust other options like retrying with enhanced ways for logging for whatever component might be responsible, before attempting to test a long row of versions to nail it down better. I've had bad experiences with file systems eventually no longer handling faults from too many resets in a row without eventual file system errors that corrupt some things. (And while I have backups, knowing what even needs to be restored from backup in such cases can be difficult).
Comment 1 Ellie 2021-02-15 23:38:47 UTC
Is there any debug data I could possibly provide?

Also, due to this turning affected machines effectively unusable and this being a regression, maybe this should have some sort of higher priority. No idea how many machines or users might be affected though, after all the original Red Hat ticket doesn't have many comments so it might be something really specific. Nevertheless, right now this bug means Linux newer than 5.8 is effectively dead to me on an otherwise perfectly working Ryzen machine that was 100% compatible before, so that does seem like a potential issue to me. It certainly does make me think about just going to Debian Stable or FreeBSD or something, until this is being looked at, since I have to be able to work somehow.

So is there something I could do to help short of hanging my computer another 20+ times for a bisect? Any "super debug" option, modules worth trying to turn off> (A kernel bisect I've never done anyway so I'm not even sure how to compile the kernel for that)

Note You need to log in before you can comment on or make changes to this bug.