Created attachment 261283 [details] Full dmesg Hi, my X server just stopped responding, and ssh from another host revealed that nouveau driver triggered kernel oops in queue_work_on. System is using this new kernel for less than a week. No similar crash ever happened on the system before. [605148.177397] BUG: unable to handle kernel paging request at 00000000000101f8 [605148.180006] IP: queue_work_on+0x12/0x40 [605148.186405] PGD 0 P4D 0 [605148.186405] Oops: 0002 [#1] PREEMPT SMP [605148.186405] Modules linked in: netlink_diag vmnet(OE) parport_pc vmw_vsock_vmci_transport vmw_vmci vmmon(OE) loop tcp_diag inet_diag nfsv3 ppdev parport xt_REDIRECT nf_nat_redirect arc4 md4 nls_iso8859_2 deflate cifs twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic ccm rpcsec_gss_krb5 nfsv4 fuse nfs fscache vsock camellia_aesni_avx_x86_64 camellia_x86_64 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic lrw blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common ablk_helper des_generic cmac xcbc rmd160 af_key xfrm_algo binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel coretemp hwmon [605148.234373] snd_hda_codec x86_pkg_temp_thermal crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel uas ghash_clmulni_intel snd_hda_core dcdbas pcbc snd_pcm_oss e1000e usb_storage aesni_intel snd_mixer_oss sr_mod snd_pcm aes_x86_64 crypto_simd glue_helper input_leds cryptd snd_timer cdrom mei_me i2c_i801 ptp tpm_tis sg snd mei pps_core tpm_tis_core tpm iTCO_wdt iTCO_vendor_support lpc_ich mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables ipv6 crc_ccitt autofs4 [last unloaded: vmnet] [605148.234373] CPU: 0 PID: 8021 Comm: Xorg Tainted: G OE 4.15.0-rc3-64-00039-g59217b88e0e1 #7 [605148.234373] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A12 09/09/2016 [605148.234373] RIP: 0010:queue_work_on+0x12/0x40 [605148.234373] RSP: 0018:ffffc90001f53cb0 EFLAGS: 00010002 [605148.234373] RAX: 0000000000000202 RBX: 0000000000000202 RCX: 0000000000000000 [605148.234373] RDX: 00000000000101f8 RSI: ffff88106e40c000 RDI: 0000000000000020 [605148.234373] RBP: ffff88012ba31140 R08: 0000000000000000 R09: 0000000000000000 [605148.234373] R10: ffff880be3fe3f30 R11: ffff88102716ab20 R12: ffff8810264f00a0 [605148.234373] R13: ffff880df5af3600 R14: ffff881020c95e38 R15: ffffffff81a9b978 [605148.234373] FS: 00007f701e290a40(0000) GS:ffff88106e800000(0000) knlGS:0000000000000000 [605148.234373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [605148.234373] CR2: 00000000000101f8 CR3: 0000000ff7c41005 CR4: 00000000001606f0 [605148.234373] Call Trace: [605148.234373] nouveau_gem_object_close+0x162/0x1e0 [605148.234373] drm_gem_object_release_handle+0x27/0x80 [605148.234373] drm_gem_handle_delete+0x55/0x80 [605148.234373] ? drm_gem_handle_create+0x30/0x30 [605148.234373] drm_ioctl_kernel+0x51/0xa0 [605148.234373] drm_ioctl+0x2cd/0x370 [605148.234373] ? drm_gem_handle_create+0x30/0x30 [605148.234373] nouveau_drm_ioctl+0x5c/0xb0 [605148.234373] do_vfs_ioctl+0x96/0x610 [605148.234373] ? __fget+0x62/0xa0 [605148.234373] ? __fget+0x62/0xa0 [605148.234373] SyS_ioctl+0x6f/0x80 [605148.234373] entry_SYSCALL_64_fastpath+0x1e/0x81 [605148.234373] RIP: 0033:0x7f701b75e5d7 [605148.234373] RSP: 002b:00007ffc01902ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [605148.234373] RAX: ffffffffffffffda RBX: 0000563abec8f1b0 RCX: 00007f701b75e5d7 [605148.234373] RDX: 00007ffc01902b10 RSI: 0000000040086409 RDI: 000000000000000b [605148.234373] RBP: 00007ffc01902b10 R08: 0000000000000002 R09: 0000000000000001 [605148.234373] R10: 0000563abe7257e0 R11: 0000000000000246 R12: 0000000040086409 [605148.234373] R13: 000000000000000b R14: 0000563abe72b890 R15: 0000563abec8f1b0 [605148.234373] Code: b0 34 c8 81 e8 55 92 03 00 c6 05 01 bd 2e 01 01 e9 68 ff ff ff 0f 1f 40 00 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 <f0> 0f ba 2a 00 73 10 31 c9 48 89 df 57 9d 0f 1f 44 00 00 89 c8 [605148.234373] RIP: queue_work_on+0x12/0x40 RSP: ffffc90001f53cb0 [605148.234373] CR2: 00000000000101f8 [605148.234373] ---[ end trace 537ebb0d3ac8419d ]---
There's a pretty big memory leak in the GEM object handler right now, fixed by https://github.com/skeggsb/linux/commit/4ef928929987c19fff4d3c1650f139560ba1cc13 Doesn't seem identical to your issue, but perhaps the leak triggers this as a secondary issue.
I've built kernel with the change and give it some time, but I don't think that it is a culprit. I think that there is a race in how nouveau handles scheduled work: This work being scheduled has func callback set to nouveau_gem_object_delete_work(), which does its job, and then frees the work. Now nouveau_cli_work_queue sets some work properties, adds work to the cli->worker queue, and schedules work->cli->work. But nouveau_cli_work might run between the moment work is added to cli->worker queue and moment schedule_work(&work->cli->work) is issued. In that case work dereference in nouveau_cli_work_fence will access freed work memory, read garbage (0x00010000 in this case), add offsetof(cli, work) of 0x1f8, and pass this garbage pointer into schedule_work(). I'm not entirely sure what are expectations for the work, but perhaps work should be added to the cli->worker list from nouveau_cli_work_fence, rather than by nouveau_cli_work_queue. Regression seems to be introduced by 814a23243bd2aaa9aafe03d85d0502a73be42b58, "drm/nouveau: implement per-client delayed workqueue with fence support"
Created attachment 261285 [details] Bandaid to prevent processing of works not fully set up I've attached bandaid that seems to fix crash on my system.
Hello, This should be fixed by https://github.com/skeggsb/nouveau/commit/b372d7330c41be5b751187821ab81d91295c847d. Could you please confirm whether it does fix it for you?
Yes, it fixes the crash. I am not sure what impact have holding lock for longer time, but I didn't notice anything bad.