Bug 210333 - system freezes stutters /gpu/drm/nouveau/dispnv50/disp.c:211 nv50_dmac_wait+0x22f/0x280 [nouveau]
Summary: system freezes stutters /gpu/drm/nouveau/dispnv50/disp.c:211 nv50_dmac_wait+0...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(Other) (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: drivers_video-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-23 20:22 UTC by roman
Modified: 2021-04-01 14:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.9.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
error of journalctl (3.77 KB, text/plain)
2020-11-23 20:22 UTC, roman
Details

Description roman 2020-11-23 20:22:13 UTC
Created attachment 293793 [details]
error of journalctl

When I use my notebook for usual tasks, at some random time the whole system freezes and stutters very bad. 

Must be introduced in nouveau 5.9 kernel (as many other bugs).

After about 15-20 min when i try accessing the terminal I get my system back.
Comment 1 Gnatty 2021-01-11 01:24:27 UTC
I have a similar issue, which seems widely reported:
https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/3369
https://bugzilla.redhat.com/show_bug.cgi?id=1894257
https://bbs.archlinux.org/viewtopic.php?id=259770
https://forum.artixlinux.org/index.php/topic,2083.0.html

It seems related to the introduction of the 5.9 kernel, although on the Arch Linux forum report the details were:
5.8.14-arch1-1 nvidia 455.28-4
so in that case if it was the same thing, then it appeared with the nvidia driver in the last 5.8 revision.
So a similar bug might have been introduced in different places at slightly different times, fixing mine may not fix everyone's, but could help identify what went wrong.

I'm using nouveau nv50 family. I'm trying to identify the commit, so far v5.9-rc1 = good, v5.9-rc4 = bad. There are only 4 commits in nouveau in that range, which appeared in rc3, and I'm now testing the commit before. Usually the bug appears within a 24 hour period of my normal use, as it's random in occurence it's difficult to ever be sure of a good, but after 48hrs I am saying good and the result pattern is consistent with this.
Comment 2 Gnatty 2021-01-12 19:44:42 UTC
> ca386aa7155a drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug
> <--- now building here to try next          
> a9cfcfcad50c drm/nouveau/kms/nv50-gp1xx: disable notifies again after core
> update          
> 35dde8d40636 drm/nouveau/kms/nv50-: add some whitespace before debug message  
> a255e9c8694d drm/nouveau/kms/gv100-: Include correct push header in crcc37d.c 
> fc8c70526bd3 drm/radeon: Prefer lower feedback dividers  <--- this seems good
Comment 3 Gnatty 2021-01-13 23:23:38 UTC
ca386aa7155a - bad
Comment 4 Gnatty 2021-01-17 03:28:52 UTC
a9cfcfcad50c seemed good, so ca386aa7155a looks like the suspect commit (probably).
A recent patch looked like it might fix this, but building at f4e087c666f54559cb4e530af1fbfc9967e14a15 Fri Jan 15 10:55:33 2021 -0800 still has the bug present.
This is on a NVIDIA G96GLM [Quadro FX 770M] Dell Precision M4400.
Comment 5 roman 2021-01-17 09:26:14 UTC
first tryto apply the patch mentioned here.

https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_768256
Comment 6 Gnatty 2021-01-17 22:32:20 UTC
It looks interesting, but the patch describes a file with different initial content to the Torvalds GitHub linux, 0x114 wasn't there in the first place looking at the changes between v5.8 and v5.9-rc1 which is when the conversion to the new macros took place. Also there are some other similar files - 507, 807, 907, where the patch looks like it could be applied too, which probably relate to various types of graphics card, and I haven't yet found what file would apply to my card as it isn't the same model (might be 827 or not.)
Anyway while looking at that I was using a9cfcfcad50c as the last known good, but that has just recently shown itself to be bad after all, so according to my testing results so far a9cfcfcad50c looks the bad commit now.
Comment 7 Gnatty 2021-01-18 01:55:11 UTC
Looking at some old logs of this bug I had saved in the past while building a new kernel with the patch, I found that function does appear during the freezes, 
 base827c_image_set+0x2f/0x1d0 [nouveau]
 base507c_update+0x30/0x70 [nouveau]
 base507c_ntfy_set+0x2f/0x90 [nouveau] 
and those were being called repeatedly in that order causing a stack trace.

https://pastebin.com/dnXzN7Cb
Comment 8 Gnatty 2021-01-19 18:37:21 UTC
Testing that patch, after under a day of use suddenly the fan went faster and the system did a hard power off on it's own within about a second. Now on pressing the power button the POST screen shows but the bar does not progress at all and it shuts down again. I suspect the GPU has failed.
I think the patch may have applied to the skeggsb modified linux kernel and did not work well with the Torvalds linux tree.
I can't recommend it for general use yet, I think it may need more work, although it cured the desktop freezes it was not quite how I had envisaged!! :)
Perhaps if I get things repaired I'll return to this later.
Comment 9 Bastian Beranek 2021-01-21 11:18:29 UTC
Hello Gnatty, I am the author of the patch referenced by roman in #5. Thanks for pointing out that there are other places in the code which are similiarly affected, I have updated the patch (https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_773217). It should apply on top of the torvalds master branch, does it not do so for you?
Comment 10 Gnatty 2021-01-21 21:46:34 UTC
I can't view that page at present, it won't display, I only have an old Pentium 3 laptop working for now. But I have another motherboard on order for my usual laptop, the GPU is integrated into the mobo.
The patch applied fine, it just caused a GPU failure on my system. I have a theory about that though, if you look at the patch and the commit which I thought to be the problem, both have a part that isn't preceded by an NV* code. On that thread I think you were testing with an 880 GPU, while I was on a 770, and your change was in the 827 file. I think the other codes in there were all preceded by NV827? So what if that is saying to the firmware: do this if you are >827 (or some equivalent). So for you it works, for me it was an unsupported command hence it fried my GPU instead.
Comment 11 Gnatty 2021-01-30 02:17:32 UTC
$ sudo trace-cmd record -p function -l base827c_image_set
$ sudo trace-cmd report
cpus=2
    kworker/u8:1-11515 [001]  2488.838179: function:             base827c_image_set <-- nv50_wndw_flush_set
    kworker/u8:1-11515 [001]  2488.927266: function:             base827c_image_set <-- nv50_wndw_flush_set
    kworker/u8:1-11515 [000]  2489.206874: function:             base827c_image_set <-- nv50_wndw_flush_set

$ sudo trace-cmd record -p function -l core507d_update
$ sudo trace-cmd report
cpus=2
            Xorg-2520  [001]  4613.901274: function:             core507d_update <-- nv50_disp_atomic_commit_tail
    kworker/u8:3-3624  [001]  4613.910188: function:             core507d_update <-- nv50_disp_atomic_commit_core.isra.0
            Xorg-2520  [000]  4614.131704: function:             core507d_update <-- nv50_disp_atomic_commit_tail


This is an interesting thing, those "problem" functions get called a lot without  causing any problems. Not sure what paths within them are called yet, that would take more investigation. But what if (as seems likely) the additional code is fine? It could be the amount of code, my results so far seem to suggest. A smaller buffer makes the problem more likely to occur. Adding Lyude's patch makes it appear. Adding Bastian's as well caused mobo death. So what if under some rare condition, perhaps a race condition, where 2 things write at once, or it is not cleared, too much is written to the buffer, or it's contents may be mangled. With shorter instructions, there was room for both, or the mangling was not so disruptive, and the problem is not seen. This also fits with the random appearance.
Comment 12 Gnatty 2021-03-13 21:28:36 UTC
What the hell have you done now...
So I have been using the LTS kernel on Artix linux, I noticed it has gone to 5.10 - then today the same as before, the fans sped up, the laptop died and won't boot anymore.
I can only presume that patch has just killed another mobo. What moron ported that into a release kernel?
Sure it works on some hardware, but is DEATH to others.
AFAICT you are just playing with the contents of a buffer overflow, not fixing the problem.
Hasn't anyone noticed the patch author, while undoubtedly talented and enthusiastic, is actually a university student who was just trying some stuff to see if it works, with no idea what he is doing? I mean, he even says so himself in the linked thread.
Comment 13 Karol Herbst 2021-03-15 09:16:54 UTC
(In reply to Gnatty from comment #12)
> What the hell have you done now...
> So I have been using the LTS kernel on Artix linux, I noticed it has gone to
> 5.10 - then today the same as before, the fans sped up, the laptop died and
> won't boot anymore.
> I can only presume that patch has just killed another mobo. What moron
> ported that into a release kernel?
> Sure it works on some hardware, but is DEATH to others.
> AFAICT you are just playing with the contents of a buffer overflow, not
> fixing the problem.
> Hasn't anyone noticed the patch author, while undoubtedly talented and
> enthusiastic, is actually a university student who was just trying some
> stuff to see if it works, with no idea what he is doing? I mean, he even
> says so himself in the linked thread.

Nouveau shouldn't be able to kill your mobo, especially not a patch like this... Sorry hear that, but I think you are just being unlucky getting faulty motherboards over and over again :/
Comment 14 Gnatty 2021-04-01 14:18:19 UTC
Very bad luck - within a short time of trying the patch, failure occurred, yet that board had been working fine for the previous 2 years of ownership, and when I got the laptop and stripped it to service the fan & cooling fins & renew the heatsink compound it appeared to have never been done before. The replacement board worked until a few days after the patched LTS 5.10 kernel was used, and unlike before I was not trying to encourage the freeze bug to appear, ie using Terminator. But certainly, bad luck is a possibility, although I would place that as a slim one after 2 failures in association with this.
Writing malformed commands (if that's what's happening) could well cause failures:
https://security.stackexchange.com/questions/65153/is-there-any-virus-that-can-cause-physical-damage/65160
A Quadro FX770M in an M4400 probably has a special VBIOS optimised for business use, e.g. CAD program use, it's typical for laptops to be modified like this and they may not respond the same as others.
I also "fixed" the freeze bug - by reverting those 2 patches I noted. The other patch reportedly fixed it on some machines. The common link was changing the contents of the same push buffer, and the 2 fixes affected the 3 functions in the freeze bug error messages. Also the patch which made the push buffer smaller seemed to affect the frequency of the freeze bug occurring. I am not suggesting any of these patches in themselves represent faulty code, but they may be interacting in an unexpected way with some underlying issue. My (unconfirmed) theory is that neither addresses the real cause, especially as logging those problem functions shows they are called all the time without incident. But I do not dare add further logging code on this computer without risking upsetting the delicate balance, perhaps I will get some other machine that can display the bug and is documented to not be affected by that patch sometime, who knows.
Early udisks versions caused floppy drive failure in some cases by polling the drive every few seconds, and that was only a desktop library. Most TV's and so on have hidden service menus accessed by special key combinations on the remote, these are detailed only in service manuals, some settings could render the TV inoperable. Even my simple Casio digital watch has a secret option, although not harmful - press all 3 buttons together to do a display segment test, press a button to exit.

Note You need to log in before you can comment on or make changes to this bug.