I experienced kernel crashes multiple times when drm_load_edid_firmware was called from the amdgpu driver, on the occasion of e.g. the external display shutting down after extended non-use. Since this same operation also works quite well often, I assume there is some sort of race condition when using kernel parameter drm_kms_helper.edid_firmware=my_edid_file - so I just wanted to open this report to inform you about the observation: [172643.308167] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [172643.309348] IP: set_root+0x24/0xb0 [172643.310516] PGD 0 [172643.312844] Oops: 0000 [#1] PREEMPT SMP [172643.314013] Modules linked in: blowfish_generic blowfish_x86_64 blowfish_common des3_ede_x86_64 des_generic cast5_avx_x86_64 cast5_generic cast_common cbc twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic lrw ablk_helper xts gf128mul fuse joydev mousedev hid_generic hidp hid arc4 md4 nls_utf8 cifs dns_resolver fscache xt_tcpudp ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_owner xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_filter cmac bnep cpufreq_ondemand msr nls_iso8859_1 eeepc_wmi nls_cp437 asus_wmi sparse_keymap vfat video mxm_wmi snd_hda_codec_generic fat edac_mce_amd snd_hda_codec_hdmi edac_core btusb snd_hda_intel [172643.321641] btrtl igb sp5100_tco btbcm ptp kvm_amd pps_core btintel snd_hda_codec kvm bluetooth evdev snd_hda_core input_leds led_class snd_hwdep irqbypass mac_hid snd_pcm rfkill pcspkr dca crc16 i2c_piix4 snd_timer snd soundcore shpchp fjes wmi i2c_designware_platform 8250_dw i2c_designware_core acpi_cpufreq button tpm_tis tpm_tis_core tpm sch_fq_codel usbip_host usbip_core sg it87(O) hwmon_vid ip_tables x_tables algif_skcipher af_alg sd_mod uas usb_storage serio_raw atkbd libps2 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ahci ccp libahci rng_core xhci_pci xhci_hcd libata usbcore scsi_mod usb_common i8042 serio amdgpu i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm xfs libcrc32c crc32c_generic crc32c_intel [172643.327265] dm_crypt dm_mod nvme nvme_core i2c_dev [172643.328089] CPU: 4 PID: 1071 Comm: Xorg Tainted: G O 4.10.11-1-ARCH #1 [172643.328920] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 0604 04/06/2017 [172643.329759] task: ffff8807f6038e40 task.stack: ffffc9000809c000 [172643.330595] RIP: 0010:set_root+0x24/0xb0 [172643.331426] RSP: 0018:ffffc9000809f7e0 EFLAGS: 00010202 [172643.332256] RAX: ffff8807f6038e40 RBX: ffffc9000809f910 RCX: ffff8807f72e7200 [172643.333084] RDX: ffffffff81d383c8 RSI: 0000000000000000 RDI: ffffc9000809f910 [172643.333914] RBP: ffffc9000809f7f8 R08: ffff8807f72e7200 R09: ffff8807f72e7200 [172643.334741] R10: 00000000ffffffea R11: 0000000000000000 R12: 0000000000000000 [172643.335558] R13: ffff88002a67d01c R14: ffffc9000809f910 R15: 0000000000004650 [172643.336365] FS: 00007f4604912940(0000) GS:ffff88081ed00000(0000) knlGS:0000000000000000 [172643.337172] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [172643.337981] CR2: 0000000000000008 CR3: 0000000005a09000 CR4: 00000000003406e0 [172643.338794] Call Trace: [172643.339604] path_init+0x1e3/0x350 [172643.340419] path_openat+0x7c/0x1180 [172643.341220] ? default_wake_function+0x12/0x20 [172643.342007] ? __wake_up_common+0x4d/0x80 [172643.342777] ? ep_poll_callback+0xef/0x1e0 [172643.343538] do_filp_open+0x91/0x100 [172643.344297] ? platform_match+0x29/0xa0 [172643.345046] ? getname_kernel+0x32/0xe0 [172643.345782] ? kmem_cache_alloc+0xdb/0x1b0 [172643.346510] file_open_name+0x112/0x140 [172643.347234] filp_open+0x33/0x60 [172643.347947] kernel_read_file_from_path+0x36/0x70 [172643.348654] _request_firmware+0x287/0xa70 [172643.349357] request_firmware+0x37/0x50 [172643.350047] drm_load_edid_firmware+0x316/0x530 [drm_kms_helper] [172643.350728] drm_helper_probe_single_connector_modes+0x16b/0x520 [drm_kms_helper] [172643.351403] drm_setup_crtcs+0x7b/0x9c0 [drm_kms_helper] [172643.352068] drm_fb_helper_hotplug_event+0xd2/0xf0 [drm_kms_helper] [172643.352732] drm_fb_helper_restore_fbdev_mode_unlocked+0x57/0x80 [drm_kms_helper] [172643.353388] amdgpu_fbdev_restore_mode+0x1a/0x40 [amdgpu] [172643.354029] amdgpu_driver_lastclose_kms+0x12/0x20 [amdgpu] [172643.354662] drm_lastclose+0x39/0xf0 [drm] [172643.355292] drm_release+0x2bc/0x370 [drm] [172643.355929] __fput+0xa2/0x1f0 [172643.356559] ____fput+0xe/0x10 [172643.357175] task_work_run+0x80/0xa0 [172643.357775] do_exit+0x2b9/0xb40 [172643.358358] ? __do_page_fault+0x2dc/0x510 [172643.358930] do_group_exit+0x3b/0xb0 [172643.359483] SyS_exit_group+0x14/0x20 [172643.360023] entry_SYSCALL_64_fastpath+0x1a/0xa9 [172643.360550] RIP: 0033:0x7f460274c868 [172643.361061] RSP: 002b:00007ffc82da85a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 [172643.361571] RAX: ffffffffffffffda RBX: 000000000157f060 RCX: 00007f460274c868 [172643.362072] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 [172643.362561] RBP: 00007ffc82da85a0 R08: 00000000000000e7 R09: fffffffffffffd68 [172643.363036] R10: 00007f45ef381178 R11: 0000000000000246 R12: 00007f45ef380db8 [172643.363500] R13: 00007ffc82da8518 R14: 00007f4604a10000 R15: 0000000000000000 [172643.363949] Code: c3 66 0f 1f 44 00 00 0f 1f 44 00 00 55 65 48 8b 04 25 00 d3 00 00 48 89 e5 41 55 41 54 53 f6 47 38 40 4c 8b a0 70 06 00 00 74 3b <41> 8b 4c 24 08 f6 c1 01 75 76 49 8b 54 24 20 49 8b 44 24 18 48 [172643.364894] RIP: set_root+0x24/0xb0 RSP: ffffc9000809f7e0 [172643.365353] CR2: 0000000000000008 [172643.365804] ---[ end trace 04c82782dfd786e5 ]--- [172643.366251] Fixing recursive fault but reboot is needed! (For the time being I will compile a non-preempting version of the kernel to get a more stable system.)
Looks like https://bugs.freedesktop.org/show_bug.cgi?id=100375 . Is this a regression from older kernel versions?
Indeed, https://bugs.freedesktop.org/show_bug.cgi?id=100375 looks like it is the same issue. I cannot say whether this is a regression from an older kernel, because I only recently put this AMD RX-460 GPU into the system, and significantly older kernels don't work that well with Ryzen CPUs...
BTW: Is there a reason why the EDID is read anew every time one uses "xrandr --ouput ... --mode ... --rate ..." to just switch the refresh rate, and also when switching between X11 and the text console? (I would have thought that reading the EDID is required only when a new monitor connection is made.)
Some patches were posted that might help this issue. You might want to give them a try. https://lore.kernel.org/lkml/CFXOER.OW6JFDCNUAT32@gmail.com/T/#m434c4f24f01e06f747fdc6c7f41b12babd4fb764