Bug 206225

Summary: nouveau: Screen distortion and lockup on resume
Product: Drivers Reporter: Christoph Marz (derchiller-foren)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: imirkin
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.4.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: 5.3.9 nouveau resume bug
5.3.9 nouveau resume ok
5.4.12: Syslog excerpt: Resume after hibernation
5.4.12: Syslog excerpt: Resume after suspend
5.4.11: Syslog excerpt: Resume after hibernation: No error

Description Christoph Marz 2020-01-16 16:03:47 UTC
When starting suspend or hibernate, it takes approx. 2 mins until the system actually begins to write RAM contents to disk (when hibernating), although the screen is switched off immediately.

When resuming, video is completely distorted. Sometimes I am able to restart Gnome via Alt+F2 or to switch to a VT, but sometimes the system doesn't react at all.

Syslog contains nouveau errors:

kernel: [10576.555245] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 0 MP 0: 00000010 [INVALID_OPCODE] at 07fe80 warp 0, opcode f6bfffbf ffffffff
kernel: [10576.555266] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 0 MP 1: 00000010 [INVALID_OPCODE] at 07fec0 warp 1, opcode fffffffe ffffffff
kernel: [10576.555293] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 1 MP 0: 00000010 [INVALID_OPCODE] at 07f540 warp 0, opcode ffffffff ffffdfff
kernel: [10576.555310] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 1 MP 1: 00000010 [INVALID_OPCODE] at 07f540 warp 0, opcode ffffffff ffffdfff
kernel: [10576.555315] nouveau 0000:01:00.0: gr: 00200000 [] ch 3 [003f8a4000 Xorg[717]] subc 3 class 8297 mthd 15e0 data 00000000

On last resume from hibernate, it additionally contained a call trace associated with nouveau:

kernel: [ 9985.949290] Trying to vfree() bad address (00000000f5be47e6)
kernel: [ 9985.949282] ------------[ cut here ]------------
kernel: [ 9985.949313] WARNING: CPU: 0 PID: 824 at mm/vmalloc.c:2234 __vunmap+0x1e6/0x210
kernel: [ 9985.949314] Modules linked in: nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E) usb_storage(E) ctr(E) ccm(E) rfcomm(E) cmac(E) bnep(E) iTCO_wdt(E) iTCO_vendor_support(E) watchdog(E) fuse(E) btusb(E) btrtl(E) btbcm(E) iwlmvm(E) acer_wmi(E) sparse_keymap(E) btintel(E) mac80211(E) libarc4(E) iwlwifi(E) wmi_bmof(E) mxm_wmi(E) hid_multitouch(E) i2c_i801(E) uvcvideo(E) videobuf2_vmalloc(E) videobuf2_memops(E) videobuf2_v4l2(E) bluetooth(E) snd_hda_codec_hdmi(E) sr_mod(E) cdrom(E) videodev(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_intel(E) videobuf2_common(E) snd_intel_nhlt(E) snd_hda_codec(E) lpc_ich(E) mfd_core(E) drbg(E) snd_hwdep(E) snd_hda_core(E) ansi_cprng(E) snd_pcm(E) ecdh_generic(E) ecc(E) jmb38x_ms(E) xhci_pci(E) sdhci_pci(E) snd_timer(E) cqhci(E) cfg80211(E) memstick(E) sdhci(E) rfkill(E) snd(E) ehci_pci(E) xhci_hcd(E) soundcore(E) mmc_core(E) iosf_mbi(E) acpi_cpufreq(E) binfmt_misc(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E)
kernel: [ 9985.949363]  crc16(E) mbcache(E) jbd2(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) libata(E) serio_raw(E) scsi_mod(E) nouveau(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) ttm(E) wmi(E) evdev(E)
kernel: [ 9985.949381] CPU: 0 PID: 824 Comm: gnome-shell Tainted: G        W   E     5.4.12 #1
kernel: [ 9985.949383] Hardware name: Acer, inc. Aspire 7730G     /Mammoth          , BIOS v0.3636 03/10/2009
kernel: [ 9985.949386] RIP: 0010:__vunmap+0x1e6/0x210
kernel: [ 9985.949389] Code: 41 5d 41 5e e9 9b 58 02 00 31 d2 31 f6 48 c7 c7 ff ff ff ff e8 eb fc ff ff eb b5 48 89 fe 48 c7 c7 88 50 97 ab e8 c8 39 e7 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e c3 4c 89 e6 48 c7 c7 b0 50 97 ab e8
kernel: [ 9985.949391] RSP: 0018:ffffb528033ebc08 EFLAGS: 00010286
kernel: [ 9985.949394] RAX: 0000000000000000 RBX: ffff9f3771eb2180 RCX: 0000000000000006
kernel: [ 9985.949396] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9f377ba16540
kernel: [ 9985.949398] RBP: 0000000000000720 R08: ffffb528033ebabd R09: 00000000000004f1
kernel: [ 9985.949400] R10: 0000000000000008 R11: ffffb528033ebabd R12: ffff9f3771f71720
kernel: [ 9985.949401] R13: 0000091508ee4d8d R14: 0000000000000000 R15: 00000000000000ff
kernel: [ 9985.949404] FS:  0000000000000000(0000) GS:ffff9f377ba00000(0000) knlGS:0000000000000000
kernel: [ 9985.949406] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 9985.949408] CR2: 000055f8c838e020 CR3: 000000010c1c6000 CR4: 00000000000406f0
kernel: [ 9985.949410] Call Trace:
kernel: [ 9985.949489]  nvkm_umem_unmap+0x49/0x60 [nouveau]
kernel: [ 9985.949521]  nvkm_object_dtor+0x99/0x100 [nouveau]
kernel: [ 9985.949550]  nvkm_object_del+0x20/0xa0 [nouveau]
kernel: [ 9985.949578]  nvkm_ioctl_del+0x37/0x50 [nouveau]
kernel: [ 9985.949606]  nvkm_ioctl+0xdf/0x180 [nouveau]
kernel: [ 9985.949635]  nvif_object_fini+0x59/0x80 [nouveau]
kernel: [ 9985.949669]  nouveau_mem_fini+0x53/0x70 [nouveau]
kernel: [ 9985.949705]  nouveau_mem_del+0x11/0x30 [nouveau]
kernel: [ 9985.949711]  ttm_bo_put+0x26e/0x2d0 [ttm]
kernel: [ 9985.949746]  nouveau_gem_object_del+0x51/0x80 [nouveau]
kernel: [ 9985.949750]  drm_gem_object_release_handle+0x70/0x90
kernel: [ 9985.949753]  ? drm_gem_object_handle_put_unlocked+0xa0/0xa0
kernel: [ 9985.949757]  idr_for_each+0x5e/0xd0
kernel: [ 9985.949761]  drm_gem_release+0x1c/0x30
kernel: [ 9985.949763]  drm_file_free.part.0+0x230/0x280
kernel: [ 9985.949766]  drm_release+0xa7/0xe0
kernel: [ 9985.949769]  __fput+0xb9/0x250
kernel: [ 9985.949774]  task_work_run+0x89/0xa0
kernel: [ 9985.949777]  exit_to_usermode_loop+0xb6/0xc0
kernel: [ 9985.949780]  do_syscall_64+0x13f/0x150
kernel: [ 9985.949784]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: [ 9985.949787] RIP: 0033:0x7fd77d831090
kernel: [ 9985.949794] Code: Bad RIP value.
kernel: [ 9985.949796] RSP: 002b:00007ffff3ade1d0 EFLAGS: 00000200 ORIG_RAX: 000000000000003b
kernel: [ 9985.949798] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: [ 9985.949800] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
kernel: [ 9985.949802] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
kernel: [ 9985.949803] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
kernel: [ 9985.949805] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
kernel: [ 9985.949808] ---[ end trace 542952d6d128998b ]---

I already encountered that issue on 5.3.9 and it vanished after installing Debian package 'firmware-misc-nonfree', but now with 5.4.12, it is back again, while 5.4.11 was ok.

I should mention that even with 5.4.12, this only happens after having worked with the system for a while, not when suspending/hibernating immediately after startup.
Comment 1 Christoph Marz 2020-01-16 16:13:04 UTC
Created attachment 286845 [details]
5.3.9 nouveau resume bug

dmesg output after resume from hibernation before installing 'firmware-misc-nonfree'
Comment 2 Christoph Marz 2020-01-16 16:15:41 UTC
Created attachment 286847 [details]
5.3.9 nouveau resume ok

dmesg output after resume from hibernation right after installing 'firmware-misc-nonfree'
Comment 3 Christoph Marz 2020-01-16 16:20:35 UTC
Created attachment 286849 [details]
5.4.12: Syslog excerpt: Resume after hibernation

nouveau error messages and call trace; I was able to switch to a VT
Comment 4 Christoph Marz 2020-01-16 16:22:52 UTC
Created attachment 286851 [details]
5.4.12: Syslog excerpt: Resume after suspend

nouveau error messages, no call trace; I was NOT able to switch to a VT
Comment 5 Christoph Marz 2020-01-18 08:26:04 UTC
Created attachment 286875 [details]
5.4.11: Syslog excerpt: Resume after hibernation: No error
Comment 6 Ilia Mirkin 2020-01-18 17:30:12 UTC
I see you have nouveau.config=PCRYPT=0 in your kernel config. Why did you add this -- was there some kind of issue with the engine? Did someone in #nouveau tell you to do it to help some issue? It's normally used for copy acceleration on G96 (which would, in turn, be used to copy off any vram data to ram on suspend).

The reason I ask is that starting with kernel 4.3, that will no longer have the effect of disabling PCRYPT. The new config to achieve that would be nouveau.config=cipher=0.

Note that for G96, I don't think anything in firmware-misc-nonfree would affect it either way.
Comment 7 Christoph Marz 2020-01-18 19:20:28 UTC
(In reply to Ilia Mirkin from comment #6)
> I see you have nouveau.config=PCRYPT=0 in your kernel config. Why did you
> add this -- was there some kind of issue with the engine? Did someone in
> #nouveau tell you to do it to help some issue?

Hello Ilia,

I had found a bug report (https://bugs.freedesktop.org/show_bug.cgi?id=58378) dealing with a similar issue, and there you suggested to try that option (https://bugs.freedesktop.org/show_bug.cgi?id=58378#c46), and it seemingly solved the issue, so I gave it a try, but removed it after I noticed that it had no effect at all.

>It's normally used for copy
> acceleration on G96 (which would, in turn, be used to copy off any vram data
> to ram on suspend).
> 
> The reason I ask is that starting with kernel 4.3, that will no longer have
> the effect of disabling PCRYPT. The new config to achieve that would be
> nouveau.config=cipher=0.

Ok, thanks for clarification. Copy acceleration sounds good, is there any downside?
 
> Note that for G96, I don't think anything in firmware-misc-nonfree would
> affect it either way.

I will uninstall that package and report back.

BTW: No problems with 5.4.13 so far.
Comment 8 Ilia Mirkin 2020-01-18 19:36:42 UTC
Well, the problem the other users were having is that their GPUs were actually missing the crypt engine entirely, and we were not properly reading the capabilities bits that indicated this. Trying to use the crypt engine when it's not actually there has some obvious downsides :) But I don't see an indication that this would be the case on your setup. (First of all, we now respect the capability bit, and secondly, you don't have any mmio read/write errors in that range.)
Comment 9 Christoph Marz 2020-01-19 14:50:11 UTC
Well, then this might sound strange:

I purged firmware-misc-nonfree, rebooted, sent the system to sleep and resumed, and the distortion was back.

Instead of reinstalling it, I set nouveau.config=cipher=0 and tested again, and everything is fine. Furthermore, now I can use the firmware for Video Acceleration. Before, I always had distortion after resume with that firmware installed.

So everything seems fine now, but is there any downside in disabling the crypt engine?
Comment 10 Ilia Mirkin 2020-01-19 19:37:08 UTC
Sounds like there are things going on that we don't quite understand then... maybe Ben can weigh in.

If the cipher method is disabled (aka CRYPT), it will fall back to M2MF for copy acceleration. In experiments, this is slightly slower but still accelerated.
Comment 11 Christoph Marz 2020-01-19 20:42:23 UTC
Ok. However, thank you for telling me the right option for disabling the crypt engine on current kernels.

If you need any logs or want me to do certain tests, let me know.
Comment 12 Christoph Marz 2020-02-06 17:55:46 UTC
Follow-up:

After a dist-upgrade, the error returned. I deleted the video acceleration firmware and it was ok again.

When I installed 5.4.14, there were warnings about possibly missing firmware (the nvidia files from firmware-misc-nonfree), so I reinstalled that package and updated the initramfs (I think I missed that step after purging the package). Furthermore, I removed nouveau.config=cipher=0 since that doesn't seem to be related to the error.

To conclude: When it works, I do a dist-upgrade one day and the error returns. Doing a dist-upgrade a few days after makes it work again. The same holds for kernel upgrades.