Created attachment 286963 [details] Kernel log Hi, On several kernels 4.19.X and 5.3.X or latest one 5.4, I'm having an issue with a NVIDIA RTX 2080TI (also reported by another user with RTX 2070 https://groups.google.com/forum/#!msg/qubes-devel/ozOQrOHsUBQ/XtIQsGm3DgAJ) causing lot of instant reboots of machine. Specifically, the distribution is Qubes OS so Xen is under the hood. On a classical Fedora 31 livecd I don't succeeded to reproduce the crash which is easily reproducible in Qubes (e.g. massive and intensive resize of windows). Thanks to the help of Marek Marczykowski-Górecki, I obtained the following attached kernel log using netconsole. Any help would be very appreciated. Frédéric Pierret
Comment on attachment 286963 [details] Kernel log badf5040 = bad mmio read. Could there be some PCI situation? Can you include a full boot log?
Created attachment 286967 [details] kernel log (dmesg)
Hi Ilia, Thank you for your answer. (In reply to Ilia Mirkin from comment #1) > Comment on attachment 286963 [details] > Kernel log > > badf5040 = bad mmio read. > > Could there be some PCI situation? Can you include a full boot log? You'll find dmesg.log attached. By PCI situation you mean hardware issue? If yes, the card is normally functional under Windows. For your information, the GPU remains attached to dom0, not pci-passthroughed on a domU.
Hi, While debugging it I found the exception comes from gv100_disp_intr_exc_other in gv100.c because stat = 0x00001800. I'm trying to figure out what messed up in the 'disp' structure but I'm doing it step by step by first searching for NULL pointers. Any advice for how to proceed? Thank you.
Your kernel log doesn't have anything too weird in it (which is good). However I did see a similar type of error with someone using coreboot (admittedly, with an MCP77 IGP). Are you using a non-original booting mechanism? Given that there's signed firmware situations going on, we can't just re-POST the GPU easily, unlike in the MCP77 case. The mmio read failures may be a red herring -- basically we try to figure out why the error happened, and get bad mmio reads in the process. Could just be that the error handler hasn't been properly adjusted for Turing, and reads from bad places. I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have something clever to say.
(In reply to Ilia Mirkin from comment #5) > Your kernel log doesn't have anything too weird in it (which is good). > However I did see a similar type of error with someone using coreboot > (admittedly, with an MCP77 IGP). Are you using a non-original booting > mechanism? Given that there's signed firmware situations going on, we can't > just re-POST the GPU easily, unlike in the MCP77 case. I'm using standard default bios (legacy mode). > The mmio read failures may be a red herring -- basically we try to figure > out why the error happened, and get bad mmio reads in the process. Could > just be that the error handler hasn't been properly adjusted for Turing, and > reads from bad places. > > I'm afraid this is out of my knowledge base, sorry. Perhaps Ben will have > something clever to say. Hope so and thank you again for your feedback.
With Marek, we think to found the problem. In nv50_disp_chan_mthd function, the exact NULL pointer reference is mthd->data[0]->mthd. Precisely, mthd->data is not null but mthd->data[0] seems so. Trying to access mthd->data[0] we get: BUG: kernel NULL pointer dereference, address: 0000000000000010 while trying to access mthd->data[0]->mthd, we get: BUG: kernel NULL pointer dereference, address: 0000000000000020 So this is exactly the issue. Any idea why mthd->data and not mthd->data[0]?
We found more information! The previous tests was done with those added lines: --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -75,13 +75,25 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug) if (debug > subdev->debug) return; + nvkm_warn(subdev, "mthd: %p", mthd); + nvkm_warn(subdev, "mthd->data: %p", mthd->data); + nvkm_warn(subdev, "&mthd->data[0]: %p", &mthd->data[0]); + nvkm_warn(subdev, "mthd->data[0].mthd: %p", mthd->data[0].mthd); for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) { which gaves as crashlog: [ 45.513617] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 45.513633] nouveau 0000:26:00.0: disp: mthd: 00000000dfa55708 [ 45.513638] nouveau 0000:26:00.0: disp: mthd->data: 00000000858af80f [ 45.513641] nouveau 0000:26:00.0: disp: &mthd->data[0]: 00000000858af80f But replacing "%p" by "%lx", it revealed that mthd is NULL: [ 74.753207] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 74.753223] nouveau 0000:26:00.0: disp: mthd: 0 [ 74.753226] nouveau 0000:26:00.0: disp: mthd->data: 10 [ 74.753231] nouveau 0000:26:00.0: disp: &mthd->data[0]: 10 [ 74.753241] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 74.753244] #PF: supervisor read access in kernel mode That gives some hints!
A rather simple and temporary fix we found is to add: diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c index bcf32d92ee5a..50e3539f33d2 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/disp/channv50.c @@ -74,6 +74,8 @@ nv50_disp_chan_mthd(struct nv50_disp_chan *chan, int debug) if (debug > subdev->debug) return; + if (!mthd) + return; for (i = 0; (list = mthd->data[i].mthd) != NULL; i++) { u32 base = chan->head * mthd->addr; With that, it remains stable.
Last piece of information, aach time I'm trying to reproduce the freeze and thanks to the fix, I can see a second information in kernel log: [ 814.207723] nouveau 0000:26:00.0: disp: chid 73 stat 00001080 reason 1 [PUSHBUFFER_ERR] mthd 0200 data badf5040 code badf5040 [ 814.207749] nouveau 0000:26:00.0: bus: MMIO read of 00000000 FAULT at 611390 [ IBUS ] And it's always repeated as the two lines.