Bug 198221

Summary: nouveau DRM driver scheduling invalid work
Product: Drivers Reporter: Petr Vandrovec (petr)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: imirkin, pierre.morrow
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.15.0-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Full dmesg
Bandaid to prevent processing of works not fully set up

Description Petr Vandrovec 2017-12-21 00:59:00 UTC
Created attachment 261283 [details]
Full dmesg

Hi,
   my X server just stopped responding, and ssh from another host revealed that nouveau driver triggered kernel oops in queue_work_on.  System is using this new kernel for less than a week.  No similar crash ever happened on the system before.

[605148.177397] BUG: unable to handle kernel paging request at 00000000000101f8
[605148.180006] IP: queue_work_on+0x12/0x40
[605148.186405] PGD 0 P4D 0
[605148.186405] Oops: 0002 [#1] PREEMPT SMP
[605148.186405] Modules linked in: netlink_diag vmnet(OE) parport_pc vmw_vsock_vmci_transport vmw_vmci vmmon(OE) loop tcp_diag inet_diag nfsv3 ppdev parport xt_REDIRECT nf_nat_redirect arc4 md4 nls_iso8859_2 deflate cifs twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic ccm rpcsec_gss_krb5 nfsv4 fuse nfs fscache vsock camellia_aesni_avx_x86_64 camellia_x86_64 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic lrw blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common ablk_helper des_generic cmac xcbc rmd160 af_key xfrm_algo binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel coretemp hwmon
[605148.234373]  snd_hda_codec x86_pkg_temp_thermal crct10dif_pclmul snd_hwdep crc32_pclmul crc32c_intel uas ghash_clmulni_intel snd_hda_core dcdbas pcbc snd_pcm_oss e1000e usb_storage aesni_intel snd_mixer_oss sr_mod snd_pcm aes_x86_64 crypto_simd glue_helper input_leds cryptd snd_timer cdrom mei_me i2c_i801 ptp tpm_tis sg snd mei pps_core tpm_tis_core tpm iTCO_wdt iTCO_vendor_support lpc_ich mfd_core nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables ipv6 crc_ccitt autofs4 [last unloaded: vmnet]
[605148.234373] CPU: 0 PID: 8021 Comm: Xorg Tainted: G           OE    4.15.0-rc3-64-00039-g59217b88e0e1 #7
[605148.234373] Hardware name: Dell Inc. Precision T3610/09M8Y8, BIOS A12 09/09/2016
[605148.234373] RIP: 0010:queue_work_on+0x12/0x40
[605148.234373] RSP: 0018:ffffc90001f53cb0 EFLAGS: 00010002
[605148.234373] RAX: 0000000000000202 RBX: 0000000000000202 RCX: 0000000000000000
[605148.234373] RDX: 00000000000101f8 RSI: ffff88106e40c000 RDI: 0000000000000020
[605148.234373] RBP: ffff88012ba31140 R08: 0000000000000000 R09: 0000000000000000
[605148.234373] R10: ffff880be3fe3f30 R11: ffff88102716ab20 R12: ffff8810264f00a0
[605148.234373] R13: ffff880df5af3600 R14: ffff881020c95e38 R15: ffffffff81a9b978
[605148.234373] FS:  00007f701e290a40(0000) GS:ffff88106e800000(0000) knlGS:0000000000000000
[605148.234373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[605148.234373] CR2: 00000000000101f8 CR3: 0000000ff7c41005 CR4: 00000000001606f0
[605148.234373] Call Trace:
[605148.234373]  nouveau_gem_object_close+0x162/0x1e0
[605148.234373]  drm_gem_object_release_handle+0x27/0x80
[605148.234373]  drm_gem_handle_delete+0x55/0x80
[605148.234373]  ? drm_gem_handle_create+0x30/0x30
[605148.234373]  drm_ioctl_kernel+0x51/0xa0
[605148.234373]  drm_ioctl+0x2cd/0x370
[605148.234373]  ? drm_gem_handle_create+0x30/0x30
[605148.234373]  nouveau_drm_ioctl+0x5c/0xb0
[605148.234373]  do_vfs_ioctl+0x96/0x610
[605148.234373]  ? __fget+0x62/0xa0
[605148.234373]  ? __fget+0x62/0xa0
[605148.234373]  SyS_ioctl+0x6f/0x80
[605148.234373]  entry_SYSCALL_64_fastpath+0x1e/0x81
[605148.234373] RIP: 0033:0x7f701b75e5d7
[605148.234373] RSP: 002b:00007ffc01902ad8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[605148.234373] RAX: ffffffffffffffda RBX: 0000563abec8f1b0 RCX: 00007f701b75e5d7
[605148.234373] RDX: 00007ffc01902b10 RSI: 0000000040086409 RDI: 000000000000000b
[605148.234373] RBP: 00007ffc01902b10 R08: 0000000000000002 R09: 0000000000000001
[605148.234373] R10: 0000563abe7257e0 R11: 0000000000000246 R12: 0000000040086409
[605148.234373] R13: 000000000000000b R14: 0000563abe72b890 R15: 0000563abec8f1b0
[605148.234373] Code: b0 34 c8 81 e8 55 92 03 00 c6 05 01 bd 2e 01 01 e9 68 ff ff ff 0f 1f 40 00 53 9c 58 0f 1f 44 00 00 48 89 c3 fa 66 0f 1f 44 00 00 <f0> 0f ba 2a 00 73 10 31 c9 48 89 df 57 9d 0f 1f 44 00 00 89 c8
[605148.234373] RIP: queue_work_on+0x12/0x40 RSP: ffffc90001f53cb0
[605148.234373] CR2: 00000000000101f8
[605148.234373] ---[ end trace 537ebb0d3ac8419d ]---
Comment 1 Ilia Mirkin 2017-12-21 01:21:42 UTC
There's a pretty big memory leak in the GEM object handler right now, fixed by

https://github.com/skeggsb/linux/commit/4ef928929987c19fff4d3c1650f139560ba1cc13

Doesn't seem identical to your issue, but perhaps the leak triggers this as a secondary issue.
Comment 2 Petr Vandrovec 2017-12-21 04:32:00 UTC
I've built kernel with the change and give it some time, but I don't think that it is a culprit.

I think that there is a race in how nouveau handles scheduled work:

This work being scheduled has func callback set to nouveau_gem_object_delete_work(), which does its job, and then frees the work.

Now nouveau_cli_work_queue sets some work properties, adds work to the cli->worker queue, and schedules work->cli->work.

But nouveau_cli_work might run between the moment work is added to cli->worker queue and moment schedule_work(&work->cli->work) is issued.

In that case work dereference in nouveau_cli_work_fence will access freed work memory, read garbage (0x00010000 in this case), add offsetof(cli, work) of 0x1f8, and pass this garbage pointer into schedule_work().

I'm not entirely sure what are expectations for the work, but perhaps work should be added to the cli->worker list from nouveau_cli_work_fence, rather than by nouveau_cli_work_queue. 

Regression seems to be introduced by 814a23243bd2aaa9aafe03d85d0502a73be42b58, "drm/nouveau: implement per-client delayed workqueue with fence support"
Comment 3 Petr Vandrovec 2017-12-21 07:17:08 UTC
Created attachment 261285 [details]
Bandaid to prevent processing of works not fully set up

I've attached bandaid that seems to fix crash on my system.
Comment 4 Pierre Moreau 2017-12-29 21:08:40 UTC
Hello,

This should be fixed by https://github.com/skeggsb/nouveau/commit/b372d7330c41be5b751187821ab81d91295c847d. Could you please confirm whether it does fix it for you?
Comment 5 Petr Vandrovec 2017-12-29 21:45:08 UTC
Yes, it fixes the crash. I am not sure what impact have holding lock for longer time, but I didn't notice anything bad.