Bug 211337 - BUG: kernel NULL pointer dereference + PCI Bus reset on display power event when Thunderbolt plugged in
Summary: BUG: kernel NULL pointer dereference + PCI Bus reset on display power event w...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: IOMMU (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_iommu
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-01-25 08:55 UTC by François Guerraz
Modified: 2021-01-30 15:37 UTC (History)
0 users

See Also:
Kernel Version: 5.10.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (31.64 KB, application/zstd)
2021-01-25 08:55 UTC, François Guerraz
Details
kernel log from 5.11.0-rc5-1-drm-tip-git-g1e0161c5128a (30.23 KB, application/zstd)
2021-01-26 17:58 UTC, François Guerraz
Details

Description François Guerraz 2021-01-25 08:55:17 UTC
Created attachment 294847 [details]
dmesg

Posting as an IOMMU bug as it reminds me of bug 206571 and bug 208363 on the same device (Dell XPS 9300).

The devices works fine with external displays attached to the usb-c docking stations.

The Caldigit TS3 Plus docking station also works fine on its own as long as no external display is attached.

Now if I connect an external display while the Caldigit TS3 Plus docking station is connected, either via the docking station itself or another adapter, every time a "power even" happens on the external monitor (such as connecting, disconnecting, rearranging the displays on Gnome), the entire PCI bus seems to disappear with these messages in the logs:


`
Jan 24 22:20:07 XPS-20 kernel: pcieport 0000:00:07.0: pciehp: Slot(0): Link Down
Jan 24 22:20:07 XPS-20 kernel: pcieport 0000:00:07.0: pciehp: Slot(0): Card not present
Jan 24 22:20:07 XPS-20 kernel: pcieport 0000:02:04.0: can't change power state from D3cold to D0 (config space inaccessible)
Jan 24 22:20:07 XPS-20 kernel: igb 0000:06:00.0: removed PHC on eth0
Jan 24 22:20:07 XPS-20 kernel: igb 0000:06:00.0 eth0: PCIe link lost
Jan 24 22:20:07 XPS-20 kernel: i2c_hid i2c-WCOM4941:00: supply vdd not found, using dummy regulator
Jan 24 22:20:07 XPS-20 kernel: i2c_hid i2c-WCOM4941:00: supply vddl not found, using dummy regulator
`

System and docking station firmwares are up to date, tried without iommu (but I guess TB forces it on anyways, or it is handled by UEFI).
Works fine on Windows 10.
Also, interestingly, external displays just plain don't work at all if I connect a TB device and disable secure boot.
Comment 1 François Guerraz 2021-01-25 17:45:36 UTC
I opened a similar bug report on the DRM bug tracking:
https://gitlab.freedesktop.org/drm/intel/-/issues/2999

I will also mention that my device has 32GB or RAM as I find some similarities in the logs with this
https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1827042
Comment 2 François Guerraz 2021-01-26 17:58:09 UTC
Created attachment 294869 [details]
kernel log from 5.11.0-rc5-1-drm-tip-git-g1e0161c5128a

Tested today with 5.11.0-rc5-1-drm-tip-git-g1e0161c5128a, waking up screen from screen results in system crash
Comment 3 François Guerraz 2021-01-27 08:40:26 UTC
BUG: kernel NULL pointer dereference, address: 00000000000009c9
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 0 P4D 0 
Oops: 0002 [#2] PREEMPT SMP NOPTI
CPU: 3 PID: 113 Comm: irq/122-pciehp Tainted: G     UD W         5.11.0-rc5-1-drm-tip-git-g1e0161c5128a #3
Hardware name: Dell Inc. XPS 13 9300/0PP9G2, BIOS 1.4.1 11/23/2020
RIP: 0010:mutex_lock+0x10/0x20
Code: 03 31 c0 c3 eb d4 0f 1f 40 00 0f 1f 44 00 00 be 02 00 00 00 e9 91 fa ff ff 90 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7b 01 00 <f0> 48 0f b1 17 75 01 c3 eb d6 66 0f 1f 44 00 00 0f 1f 44 00 00 41
RSP: 0018:ffffb428803b7e30 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: ffff8a95c12025c0 RSI: 0000000000001bc1 RDI: 00000000000009c9
RBP: 00000000000009c9 R08: 0000000000000000 R09: ffffb428803b7888
R10: ffffb428803b7880 R11: ffffffff864cb908 R12: ffff8a95c1202e24
R13: 0000000000000001 R14: ffff8a95c12025c0 R15: 0000000000000006
FS:  0000000000000000(0000) GS:ffff8a9d2f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000009c9 CR3: 00000001662e0002 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
 perf_event_exit_task+0x30/0x440
 ? pciehp_ist+0x10a/0x110
 do_exit+0x38f/0xa80
 ? task_work_run+0x5c/0x90
 ? do_exit+0x37f/0xa80
 ? kthread+0x133/0x150
 ? rewind_stack_do_exit+0x17/0x17
Modules linked in: igb dca rfcomm ccm wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic typec_displayport cmac algif_hash algif_skcipher af_alg snd_hda_codec_hdmi cdc_ether usbnet snd_usb>
 dell_laptop mei_hdcp dell_wmi wacom blake2b_generic snd_sof_pci xor dell_smbios snd_sof_intel_byt intel_rapl_msr wmi_bmof intel_wmi_thunderbolt raid6_pq snd_sof_intel_ipc snd_sof_intel_hda_common libcrc32c dell_wmi_descriptor snd_soc_hdac_hda dcdbas snd_sof_xtensa_dsp snd_sof_intel_hda snd_sof snd_hda_ext_core snd_s>
 processor_thermal_mbox intel_gtt v4l2_fwnode processor_thermal_rapl syscopyarea sysfillrect intel_rapl_common sysimgblt intel_soc_dts_iosf videodev fb_sys_fops ucsi_acpi typec_ucsi tpm_tis typec tpm_tis_core wmi mac_hid intel_hid int3403_thermal i2c_hid mc int340x_thermal_zone video sparse_keymap int3400_thermal acp>
CR2: 00000000000009c9
---[ end trace b3179a9d4a3ea8ba ]---
RIP: 0010:kfree+0x3a4/0x460
Code: 55 18 e9 da fe ff ff 49 8b 07 31 ed a9 00 00 01 00 74 05 41 0f b6 6f 51 49 8b 07 a9 00 00 01 00 75 0a 49 8b 47 08 a8 01 75 02 <0f> 0b 49 8b 07 89 e9 be 06 00 00 00 48 c7 c2 00 f0 ff ff 48 d3 e2
RSP: 0000:ffffb428803b7bc0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a966083d000 RCX: 000000008080001c
RDX: 0000000000000000 RSI: ffffffff85493aa7 RDI: ffff8a966083d000
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8a95dd69b020 R11: 0000000000000001 R12: dead000000000100
R13: ffff8a966083d000 R14: ffff8a95c67a8ae8 R15: ffffe785c6820f40
FS:  0000000000000000(0000) GS:ffff8a9d2f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000009c9 CR3: 00000001662e0002 CR4: 0000000000770ee0
PKRU: 55555554
Fixing recursive fault but reboot is needed!
Comment 4 François Guerraz 2021-01-30 15:37:49 UTC
Fixed by https://patchwork.freedesktop.org/series/86402/

Note You need to log in before you can comment on or make changes to this bug.