|Summary:||nouveau: Screen distortion and lockup on resume|
|Product:||Drivers||Reporter:||Christoph Marz (derchiller-foren)|
|Component:||Video(DRI - non Intel)||Assignee:||drivers_video-dri|
5.3.9 nouveau resume bug
5.3.9 nouveau resume ok
5.4.12: Syslog excerpt: Resume after hibernation
5.4.12: Syslog excerpt: Resume after suspend
5.4.11: Syslog excerpt: Resume after hibernation: No error
Description Christoph Marz 2020-01-16 16:03:47 UTC
When starting suspend or hibernate, it takes approx. 2 mins until the system actually begins to write RAM contents to disk (when hibernating), although the screen is switched off immediately. When resuming, video is completely distorted. Sometimes I am able to restart Gnome via Alt+F2 or to switch to a VT, but sometimes the system doesn't react at all. Syslog contains nouveau errors: kernel: [10576.555245] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 0 MP 0: 00000010 [INVALID_OPCODE] at 07fe80 warp 0, opcode f6bfffbf ffffffff kernel: [10576.555266] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 0 MP 1: 00000010 [INVALID_OPCODE] at 07fec0 warp 1, opcode fffffffe ffffffff kernel: [10576.555293] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 1 MP 0: 00000010 [INVALID_OPCODE] at 07f540 warp 0, opcode ffffffff ffffdfff kernel: [10576.555310] nouveau 0000:01:00.0: gr: TRAP_MP_EXEC - TP 1 MP 1: 00000010 [INVALID_OPCODE] at 07f540 warp 0, opcode ffffffff ffffdfff kernel: [10576.555315] nouveau 0000:01:00.0: gr: 00200000  ch 3 [003f8a4000 Xorg] subc 3 class 8297 mthd 15e0 data 00000000 On last resume from hibernate, it additionally contained a call trace associated with nouveau: kernel: [ 9985.949290] Trying to vfree() bad address (00000000f5be47e6) kernel: [ 9985.949282] ------------[ cut here ]------------ kernel: [ 9985.949313] WARNING: CPU: 0 PID: 824 at mm/vmalloc.c:2234 __vunmap+0x1e6/0x210 kernel: [ 9985.949314] Modules linked in: nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E) usb_storage(E) ctr(E) ccm(E) rfcomm(E) cmac(E) bnep(E) iTCO_wdt(E) iTCO_vendor_support(E) watchdog(E) fuse(E) btusb(E) btrtl(E) btbcm(E) iwlmvm(E) acer_wmi(E) sparse_keymap(E) btintel(E) mac80211(E) libarc4(E) iwlwifi(E) wmi_bmof(E) mxm_wmi(E) hid_multitouch(E) i2c_i801(E) uvcvideo(E) videobuf2_vmalloc(E) videobuf2_memops(E) videobuf2_v4l2(E) bluetooth(E) snd_hda_codec_hdmi(E) sr_mod(E) cdrom(E) videodev(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_intel(E) videobuf2_common(E) snd_intel_nhlt(E) snd_hda_codec(E) lpc_ich(E) mfd_core(E) drbg(E) snd_hwdep(E) snd_hda_core(E) ansi_cprng(E) snd_pcm(E) ecdh_generic(E) ecc(E) jmb38x_ms(E) xhci_pci(E) sdhci_pci(E) snd_timer(E) cqhci(E) cfg80211(E) memstick(E) sdhci(E) rfkill(E) snd(E) ehci_pci(E) xhci_hcd(E) soundcore(E) mmc_core(E) iosf_mbi(E) acpi_cpufreq(E) binfmt_misc(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc32c_generic(E) kernel: [ 9985.949363] crc16(E) mbcache(E) jbd2(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) libata(E) serio_raw(E) scsi_mod(E) nouveau(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) ttm(E) wmi(E) evdev(E) kernel: [ 9985.949381] CPU: 0 PID: 824 Comm: gnome-shell Tainted: G W E 5.4.12 #1 kernel: [ 9985.949383] Hardware name: Acer, inc. Aspire 7730G /Mammoth , BIOS v0.3636 03/10/2009 kernel: [ 9985.949386] RIP: 0010:__vunmap+0x1e6/0x210 kernel: [ 9985.949389] Code: 41 5d 41 5e e9 9b 58 02 00 31 d2 31 f6 48 c7 c7 ff ff ff ff e8 eb fc ff ff eb b5 48 89 fe 48 c7 c7 88 50 97 ab e8 c8 39 e7 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e c3 4c 89 e6 48 c7 c7 b0 50 97 ab e8 kernel: [ 9985.949391] RSP: 0018:ffffb528033ebc08 EFLAGS: 00010286 kernel: [ 9985.949394] RAX: 0000000000000000 RBX: ffff9f3771eb2180 RCX: 0000000000000006 kernel: [ 9985.949396] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9f377ba16540 kernel: [ 9985.949398] RBP: 0000000000000720 R08: ffffb528033ebabd R09: 00000000000004f1 kernel: [ 9985.949400] R10: 0000000000000008 R11: ffffb528033ebabd R12: ffff9f3771f71720 kernel: [ 9985.949401] R13: 0000091508ee4d8d R14: 0000000000000000 R15: 00000000000000ff kernel: [ 9985.949404] FS: 0000000000000000(0000) GS:ffff9f377ba00000(0000) knlGS:0000000000000000 kernel: [ 9985.949406] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 9985.949408] CR2: 000055f8c838e020 CR3: 000000010c1c6000 CR4: 00000000000406f0 kernel: [ 9985.949410] Call Trace: kernel: [ 9985.949489] nvkm_umem_unmap+0x49/0x60 [nouveau] kernel: [ 9985.949521] nvkm_object_dtor+0x99/0x100 [nouveau] kernel: [ 9985.949550] nvkm_object_del+0x20/0xa0 [nouveau] kernel: [ 9985.949578] nvkm_ioctl_del+0x37/0x50 [nouveau] kernel: [ 9985.949606] nvkm_ioctl+0xdf/0x180 [nouveau] kernel: [ 9985.949635] nvif_object_fini+0x59/0x80 [nouveau] kernel: [ 9985.949669] nouveau_mem_fini+0x53/0x70 [nouveau] kernel: [ 9985.949705] nouveau_mem_del+0x11/0x30 [nouveau] kernel: [ 9985.949711] ttm_bo_put+0x26e/0x2d0 [ttm] kernel: [ 9985.949746] nouveau_gem_object_del+0x51/0x80 [nouveau] kernel: [ 9985.949750] drm_gem_object_release_handle+0x70/0x90 kernel: [ 9985.949753] ? drm_gem_object_handle_put_unlocked+0xa0/0xa0 kernel: [ 9985.949757] idr_for_each+0x5e/0xd0 kernel: [ 9985.949761] drm_gem_release+0x1c/0x30 kernel: [ 9985.949763] drm_file_free.part.0+0x230/0x280 kernel: [ 9985.949766] drm_release+0xa7/0xe0 kernel: [ 9985.949769] __fput+0xb9/0x250 kernel: [ 9985.949774] task_work_run+0x89/0xa0 kernel: [ 9985.949777] exit_to_usermode_loop+0xb6/0xc0 kernel: [ 9985.949780] do_syscall_64+0x13f/0x150 kernel: [ 9985.949784] entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: [ 9985.949787] RIP: 0033:0x7fd77d831090 kernel: [ 9985.949794] Code: Bad RIP value. kernel: [ 9985.949796] RSP: 002b:00007ffff3ade1d0 EFLAGS: 00000200 ORIG_RAX: 000000000000003b kernel: [ 9985.949798] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 9985.949800] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 kernel: [ 9985.949802] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 kernel: [ 9985.949803] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 kernel: [ 9985.949805] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 9985.949808] ---[ end trace 542952d6d128998b ]--- I already encountered that issue on 5.3.9 and it vanished after installing Debian package 'firmware-misc-nonfree', but now with 5.4.12, it is back again, while 5.4.11 was ok. I should mention that even with 5.4.12, this only happens after having worked with the system for a while, not when suspending/hibernating immediately after startup.
Comment 1 Christoph Marz 2020-01-16 16:13:04 UTC
Created attachment 286845 [details] 5.3.9 nouveau resume bug dmesg output after resume from hibernation before installing 'firmware-misc-nonfree'
Comment 2 Christoph Marz 2020-01-16 16:15:41 UTC
Created attachment 286847 [details] 5.3.9 nouveau resume ok dmesg output after resume from hibernation right after installing 'firmware-misc-nonfree'
Comment 3 Christoph Marz 2020-01-16 16:20:35 UTC
Created attachment 286849 [details] 5.4.12: Syslog excerpt: Resume after hibernation nouveau error messages and call trace; I was able to switch to a VT
Comment 4 Christoph Marz 2020-01-16 16:22:52 UTC
Created attachment 286851 [details] 5.4.12: Syslog excerpt: Resume after suspend nouveau error messages, no call trace; I was NOT able to switch to a VT
Comment 5 Christoph Marz 2020-01-18 08:26:04 UTC
Created attachment 286875 [details] 5.4.11: Syslog excerpt: Resume after hibernation: No error
Comment 6 Ilia Mirkin 2020-01-18 17:30:12 UTC
I see you have nouveau.config=PCRYPT=0 in your kernel config. Why did you add this -- was there some kind of issue with the engine? Did someone in #nouveau tell you to do it to help some issue? It's normally used for copy acceleration on G96 (which would, in turn, be used to copy off any vram data to ram on suspend). The reason I ask is that starting with kernel 4.3, that will no longer have the effect of disabling PCRYPT. The new config to achieve that would be nouveau.config=cipher=0. Note that for G96, I don't think anything in firmware-misc-nonfree would affect it either way.
Comment 7 Christoph Marz 2020-01-18 19:20:28 UTC
(In reply to Ilia Mirkin from comment #6) > I see you have nouveau.config=PCRYPT=0 in your kernel config. Why did you > add this -- was there some kind of issue with the engine? Did someone in > #nouveau tell you to do it to help some issue? Hello Ilia, I had found a bug report (https://bugs.freedesktop.org/show_bug.cgi?id=58378) dealing with a similar issue, and there you suggested to try that option (https://bugs.freedesktop.org/show_bug.cgi?id=58378#c46), and it seemingly solved the issue, so I gave it a try, but removed it after I noticed that it had no effect at all. >It's normally used for copy > acceleration on G96 (which would, in turn, be used to copy off any vram data > to ram on suspend). > > The reason I ask is that starting with kernel 4.3, that will no longer have > the effect of disabling PCRYPT. The new config to achieve that would be > nouveau.config=cipher=0. Ok, thanks for clarification. Copy acceleration sounds good, is there any downside? > Note that for G96, I don't think anything in firmware-misc-nonfree would > affect it either way. I will uninstall that package and report back. BTW: No problems with 5.4.13 so far.
Comment 8 Ilia Mirkin 2020-01-18 19:36:42 UTC
Well, the problem the other users were having is that their GPUs were actually missing the crypt engine entirely, and we were not properly reading the capabilities bits that indicated this. Trying to use the crypt engine when it's not actually there has some obvious downsides :) But I don't see an indication that this would be the case on your setup. (First of all, we now respect the capability bit, and secondly, you don't have any mmio read/write errors in that range.)
Comment 9 Christoph Marz 2020-01-19 14:50:11 UTC
Well, then this might sound strange: I purged firmware-misc-nonfree, rebooted, sent the system to sleep and resumed, and the distortion was back. Instead of reinstalling it, I set nouveau.config=cipher=0 and tested again, and everything is fine. Furthermore, now I can use the firmware for Video Acceleration. Before, I always had distortion after resume with that firmware installed. So everything seems fine now, but is there any downside in disabling the crypt engine?
Comment 10 Ilia Mirkin 2020-01-19 19:37:08 UTC
Sounds like there are things going on that we don't quite understand then... maybe Ben can weigh in. If the cipher method is disabled (aka CRYPT), it will fall back to M2MF for copy acceleration. In experiments, this is slightly slower but still accelerated.
Comment 11 Christoph Marz 2020-01-19 20:42:23 UTC
Ok. However, thank you for telling me the right option for disabling the crypt engine on current kernels. If you need any logs or want me to do certain tests, let me know.
Comment 12 Christoph Marz 2020-02-06 17:55:46 UTC
Follow-up: After a dist-upgrade, the error returned. I deleted the video acceleration firmware and it was ok again. When I installed 5.4.14, there were warnings about possibly missing firmware (the nvidia files from firmware-misc-nonfree), so I reinstalled that package and updated the initramfs (I think I missed that step after purging the package). Furthermore, I removed nouveau.config=cipher=0 since that doesn't seem to be related to the error. To conclude: When it works, I do a dist-upgrade one day and the error returns. Doing a dist-upgrade a few days after makes it work again. The same holds for kernel upgrades.