Bug 195581

Summary: NULL pointer dereference when amdgpu driver calls drm_load_edid_firmware
Product: Drivers Reporter: Lutz Vieweg (lvml)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEEDINFO ---    
Severity: normal CC: mario.limonciello
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.10.11-1-ARCH Subsystem:
Regression: No Bisected commit-id:

Description Lutz Vieweg 2017-04-25 19:57:56 UTC
I experienced kernel crashes multiple times when drm_load_edid_firmware was called from the amdgpu driver, on the occasion of e.g. the external display shutting down after extended non-use.

Since this same operation also works quite well often, I assume there is some sort of race condition when using kernel parameter drm_kms_helper.edid_firmware=my_edid_file - so I just wanted to open this report to inform you about the observation:

[172643.308167] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[172643.309348] IP: set_root+0x24/0xb0
[172643.310516] PGD 0 
[172643.312844] Oops: 0000 [#1] PREEMPT SMP
[172643.314013] Modules linked in: blowfish_generic blowfish_x86_64 blowfish_common des3_ede_x86_64 des_generic cast5_avx_x86_64 cast5_generic cast_common cbc twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic lrw ablk_helper xts gf128mul fuse joydev mousedev hid_generic hidp hid arc4 md4 nls_utf8 cifs dns_resolver fscache xt_tcpudp ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_owner xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_filter cmac bnep cpufreq_ondemand msr nls_iso8859_1 eeepc_wmi nls_cp437 asus_wmi sparse_keymap vfat video mxm_wmi snd_hda_codec_generic fat edac_mce_amd snd_hda_codec_hdmi edac_core btusb snd_hda_intel
[172643.321641]  btrtl igb sp5100_tco btbcm ptp kvm_amd pps_core btintel snd_hda_codec kvm bluetooth evdev snd_hda_core input_leds led_class snd_hwdep irqbypass mac_hid snd_pcm rfkill pcspkr dca crc16 i2c_piix4 snd_timer snd soundcore shpchp fjes wmi i2c_designware_platform 8250_dw i2c_designware_core acpi_cpufreq button tpm_tis tpm_tis_core tpm sch_fq_codel usbip_host usbip_core sg it87(O) hwmon_vid ip_tables x_tables algif_skcipher af_alg sd_mod uas usb_storage serio_raw atkbd libps2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ahci ccp libahci rng_core xhci_pci xhci_hcd libata usbcore scsi_mod usb_common i8042 serio amdgpu i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm xfs libcrc32c crc32c_generic crc32c_intel
[172643.327265]  dm_crypt dm_mod nvme nvme_core i2c_dev
[172643.328089] CPU: 4 PID: 1071 Comm: Xorg Tainted: G           O    4.10.11-1-ARCH #1
[172643.328920] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 0604 04/06/2017
[172643.329759] task: ffff8807f6038e40 task.stack: ffffc9000809c000
[172643.330595] RIP: 0010:set_root+0x24/0xb0
[172643.331426] RSP: 0018:ffffc9000809f7e0 EFLAGS: 00010202
[172643.332256] RAX: ffff8807f6038e40 RBX: ffffc9000809f910 RCX: ffff8807f72e7200
[172643.333084] RDX: ffffffff81d383c8 RSI: 0000000000000000 RDI: ffffc9000809f910
[172643.333914] RBP: ffffc9000809f7f8 R08: ffff8807f72e7200 R09: ffff8807f72e7200
[172643.334741] R10: 00000000ffffffea R11: 0000000000000000 R12: 0000000000000000
[172643.335558] R13: ffff88002a67d01c R14: ffffc9000809f910 R15: 0000000000004650
[172643.336365] FS:  00007f4604912940(0000) GS:ffff88081ed00000(0000) knlGS:0000000000000000
[172643.337172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[172643.337981] CR2: 0000000000000008 CR3: 0000000005a09000 CR4: 00000000003406e0
[172643.338794] Call Trace:
[172643.339604]  path_init+0x1e3/0x350
[172643.340419]  path_openat+0x7c/0x1180
[172643.341220]  ? default_wake_function+0x12/0x20
[172643.342007]  ? __wake_up_common+0x4d/0x80
[172643.342777]  ? ep_poll_callback+0xef/0x1e0
[172643.343538]  do_filp_open+0x91/0x100
[172643.344297]  ? platform_match+0x29/0xa0
[172643.345046]  ? getname_kernel+0x32/0xe0
[172643.345782]  ? kmem_cache_alloc+0xdb/0x1b0
[172643.346510]  file_open_name+0x112/0x140
[172643.347234]  filp_open+0x33/0x60
[172643.347947]  kernel_read_file_from_path+0x36/0x70
[172643.348654]  _request_firmware+0x287/0xa70
[172643.349357]  request_firmware+0x37/0x50
[172643.350047]  drm_load_edid_firmware+0x316/0x530 [drm_kms_helper]
[172643.350728]  drm_helper_probe_single_connector_modes+0x16b/0x520 [drm_kms_helper]
[172643.351403]  drm_setup_crtcs+0x7b/0x9c0 [drm_kms_helper]
[172643.352068]  drm_fb_helper_hotplug_event+0xd2/0xf0 [drm_kms_helper]
[172643.352732]  drm_fb_helper_restore_fbdev_mode_unlocked+0x57/0x80 [drm_kms_helper]
[172643.353388]  amdgpu_fbdev_restore_mode+0x1a/0x40 [amdgpu]
[172643.354029]  amdgpu_driver_lastclose_kms+0x12/0x20 [amdgpu]
[172643.354662]  drm_lastclose+0x39/0xf0 [drm]
[172643.355292]  drm_release+0x2bc/0x370 [drm]
[172643.355929]  __fput+0xa2/0x1f0
[172643.356559]  ____fput+0xe/0x10
[172643.357175]  task_work_run+0x80/0xa0
[172643.357775]  do_exit+0x2b9/0xb40
[172643.358358]  ? __do_page_fault+0x2dc/0x510
[172643.358930]  do_group_exit+0x3b/0xb0
[172643.359483]  SyS_exit_group+0x14/0x20
[172643.360023]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[172643.360550] RIP: 0033:0x7f460274c868
[172643.361061] RSP: 002b:00007ffc82da85a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[172643.361571] RAX: ffffffffffffffda RBX: 000000000157f060 RCX: 00007f460274c868
[172643.362072] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[172643.362561] RBP: 00007ffc82da85a0 R08: 00000000000000e7 R09: fffffffffffffd68
[172643.363036] R10: 00007f45ef381178 R11: 0000000000000246 R12: 00007f45ef380db8
[172643.363500] R13: 00007ffc82da8518 R14: 00007f4604a10000 R15: 0000000000000000
[172643.363949] Code: c3 66 0f 1f 44 00 00 0f 1f 44 00 00 55 65 48 8b 04 25 00 d3 00 00 48 89 e5 41 55 41 54 53 f6 47 38 40 4c 8b a0 70 06 00 00 74 3b <41> 8b 4c 24 08 f6 c1 01 75 76 49 8b 54 24 20 49 8b 44 24 18 48 
[172643.364894] RIP: set_root+0x24/0xb0 RSP: ffffc9000809f7e0
[172643.365353] CR2: 0000000000000008
[172643.365804] ---[ end trace 04c82782dfd786e5 ]---
[172643.366251] Fixing recursive fault but reboot is needed!


(For the time being I will compile a non-preempting version of the kernel to get a more stable system.)
Comment 1 Michel Dänzer 2017-04-26 01:34:18 UTC
Looks like https://bugs.freedesktop.org/show_bug.cgi?id=100375 . Is this a regression from older kernel versions?
Comment 2 Lutz Vieweg 2017-04-26 07:53:52 UTC
Indeed, https://bugs.freedesktop.org/show_bug.cgi?id=100375 looks like it is the same issue.

I cannot say whether this is a regression from an older kernel, because I only recently put this AMD RX-460 GPU into the system, and significantly older kernels don't work that well with Ryzen CPUs...
Comment 3 Lutz Vieweg 2017-04-26 08:02:34 UTC
BTW: Is there a reason why the EDID is read anew every time one uses "xrandr --ouput ... --mode ... --rate ..." to just switch the refresh rate, and also when switching between X11 and the text console? 

(I would have thought that reading the EDID is required only when a new monitor connection is made.)
Comment 4 Mario Limonciello (AMD) 2022-07-09 13:54:28 UTC
Some patches were posted that might help this issue.  You might want to give them a try.

https://lore.kernel.org/lkml/CFXOER.OW6JFDCNUAT32@gmail.com/T/#m434c4f24f01e06f747fdc6c7f41b12babd4fb764