Bug 206299

Summary: [nouveau/xen] RTX 20XX instant reboot
Product: Drivers Reporter: Frédéric Pierret (frederic.epitre)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: blocking CC: imirkin
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.X Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel log
kernel log (dmesg)

Description Frédéric Pierret 2020-01-24 23:24:23 UTC
Created attachment 286963 [details]
Kernel log

Hi,
On several kernels 4.19.X and 5.3.X or latest one 5.4, I'm having an issue with a NVIDIA RTX 2080TI (also reported by another user with RTX 2070 https://groups.google.com/forum/#!msg/qubes-devel/ozOQrOHsUBQ/XtIQsGm3DgAJ) causing lot of instant reboots of machine. Specifically, the distribution is Qubes OS so Xen is under the hood. On a classical Fedora 31 livecd I don't succeeded to reproduce the crash which is easily reproducible in Qubes (e.g. massive and intensive resize of windows).

Thanks to the help of Marek Marczykowski-Górecki, I obtained the following attached kernel log using netconsole.

Any help would be very appreciated.

Frédéric Pierret
Comment 1 Ilia Mirkin 2020-01-25 00:45:22 UTC
Comment on attachment 286963 [details]
Kernel log

badf5040 = bad mmio read.

Could there be some PCI situation? Can you include a full boot log?
Comment 2 Frédéric Pierret 2020-01-25 09:51:25 UTC
Created attachment 286967 [details]
kernel log (dmesg)
Comment 3 Frédéric Pierret 2020-01-25 09:51:38 UTC
Hi Ilia,
Thank you for your answer.

(In reply to Ilia Mirkin from comment #1)
> Comment on attachment 286963 [details]
> Kernel log
> 
> badf5040 = bad mmio read.
> 
> Could there be some PCI situation? Can you include a full boot log?

You'll find dmesg.log attached. By PCI situation you mean hardware issue? If yes, the card is normally functional under Windows. For your information, the GPU remains attached to dom0, not pci-passthroughed on a domU.
Comment 4 Frédéric Pierret 2020-01-26 15:02:09 UTC
Hi,
While debugging it I found the exception comes from gv100_disp_intr_exc_other in gv100.c because stat = 0x00001800.

I'm trying to figure out what messed up in the 'disp' structure but I'm doing it step by step by first searching for NULL pointers. Any advice for how to proceed? 

Thank you.
Comment 5 Ilia Mirkin 2020-01-26 15:07:03 UTC
Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case.

The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places.

I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.
Comment 6 Frédéric Pierret 2020-01-26 15:55:32 UTC
(In reply to Ilia Mirkin from comment #5)
> Your kernel log doesn't have anything too weird in it (which is good).
> However I did see a similar type of error with someone using coreboot
> (admittedly, with an MCP77 IGP). Are you using a non-original booting
> mechanism? Given that there's signed firmware situations going on, we can't
> just re-POST the GPU easily, unlike in the MCP77 case.

I'm using standard default bios (legacy mode).

> The mmio read failures may be a red herring -- basically we try to figure
> out why the error happened, and get bad mmio reads in the process. Could
> just be that the error handler hasn't been properly adjusted for Turing, and
> reads from bad places.
> 
> I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have
> something clever to say.

Hope so and thank you again for your feedback.
Comment 7 Frédéric Pierret 2020-01-26 20:20:05 UTC
With Marek, we think to found the problem. In nv50_disp_chan_mthd function, the exact NULL pointer reference is mthd->data[0]->mthd. Precisely,  mthd->data is not null but mthd->data[0] seems so.

Trying to access mthd->data[0] we get:
  BUG: kernel NULL pointer dereference, address: 0000000000000010
while trying to access mthd->data[0]->mthd, we get:
  BUG: kernel NULL pointer dereference, address: 0000000000000020

So this is exactly the issue. Any idea why mthd->data and not mthd->data[0]?
Comment 8 Frédéric Pierret 2020-01-26 21:45:26 UTC
We found more information!

The previous tests was done with those added lines:

--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
@@ -75,13 +75,25 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)
        if (debug > subdev->debug)
                return;
 
+       nvkm_warn(subdev, "mthd: %p", mthd);
+       nvkm_warn(subdev, "mthd->data: %p", mthd->data);
+       nvkm_warn(subdev, "&mthd->data[0]: %p", &mthd->data[0]);
+       nvkm_warn(subdev, "mthd->data[0].mthd: %p", mthd->data[0].mthd);
        for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {

which gaves as crashlog:

[   45.513617] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[   45.513633] nouveau 0000:26:00.0: disp: mthd: 00000000dfa55708
[   45.513638] nouveau 0000:26:00.0: disp: mthd->data: 00000000858af80f
[   45.513641] nouveau 0000:26:00.0: disp: &mthd->data[0]: 00000000858af80f

But replacing "%p" by "%lx", it revealed that mthd is NULL:

[   74.753207] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[   74.753223] nouveau 0000:26:00.0: disp: mthd: 0
[   74.753226] nouveau 0000:26:00.0: disp: mthd->data: 10
[   74.753231] nouveau 0000:26:00.0: disp: &mthd->data[0]: 10
[   74.753241] BUG: kernel NULL pointer dereference, address: 0000000000000020
[   74.753244] #PF: supervisor read access in kernel mode

That gives some hints!
Comment 9 Frédéric Pierret 2020-01-26 22:02:18 UTC
A rather simple and temporary fix we found is to add:

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
index bcf32d92ee5a..50e3539f33d2 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c
@@ -74,6 +74,8 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug)
 
        if (debug > subdev->debug)
                return;
+       if (!mthd)
+               return;
 
        for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) {
                u32 base = chan->head * mthd->addr;

With that, it remains stable.
Comment 10 Frédéric Pierret 2020-01-28 08:36:01 UTC
Last piece of information, aach time I'm trying to reproduce the freeze and thanks to the fix, I can see a second information in kernel log:

[  814.207723] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040
[  814.207749] nouveau 0000:26:00.0: bus: MMIO read of 00000000 FAULT at 611390 [ IBUS ]

And it's always repeated as the two lines.