Bug 219532 - Crash in RIP: 0010:xhci_handle_stopped_cmd_ring
Summary: Crash in RIP: 0010:xhci_handle_stopped_cmd_ring
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-26 01:53 UTC by James.Dutton
Modified: 2024-12-01 22:07 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
The log output (6.72 KB, text/plain)
2024-11-26 01:53 UTC, James.Dutton
Details
xhci-ring.c (130.41 KB, text/plain)
2024-11-26 01:55 UTC, James.Dutton
Details
The compiled xhci-ring.o file (903.09 KB, application/x-object)
2024-11-26 01:56 UTC, James.Dutton
Details
try to fix the bug and gather confirmation (711 bytes, patch)
2024-12-01 22:07 UTC, Michał Pecio
Details | Diff

Description James.Dutton 2024-11-26 01:53:03 UTC
Created attachment 307279 [details]
The log output

I am not sure how to reproduce this, but it halted the system.
kernel is 6.12.1  mainline.


The system halted and this was written to the log:
<4>1 2024-11-26T00:31:54+00:00 nvme3 kernel - - -  xhci_hcd 0000:c1:00.3: Timeout while waiting for setup device command
<1>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  BUG: kernel NULL pointer dereference, address: 0000000000000030
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  xhci_hcd 0000:c1:00.3: Timeout while waiting for setup device command
<1>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  #PF: supervisor read access in kernel mode
<1>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  #PF: error_code(0x0000) - not-present page
<6>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  PGD 0 P4D 0 
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  CPU: 6 UID: 0 PID: 30684 Comm: kworker/6:1 Not tainted 6.12.1 #126
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Hardware name: Framework Laptop 16 (AMD Ryzen 7040 Series)/FRANMZCP07, BIOS 03.05 11/13/2024
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Workqueue: events xhci_handle_command_timeout
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  RIP: 0010:xhci_handle_stopped_cmd_ring+0xf3/0x150
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Code: 49 8b 45 00 48 8b 4b 18 48 c7 c2 d0 af 6b a6 48 c7 c7 88 26 81 a7 48 8b 30 e8 b9 cd bc ff e9 76 ff ff ff 4d 89 a5 28 01 00 00 <41> 8b 7c 24 30 e8 d3 7c 4a ff 48 8b 35 6c e4 da 00 bf 00 20 00 00
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  RSP: 0018:ffff9d8c8e39fbe0 EFLAGS: 00010046
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  RAX: ffff928904c35500 RBX: ffff9289057e82e8 RCX: 0000000000000000
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9d8c8089dff0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  RBP: ffff9d8c8e39fc00 R08: 0000000000000000 R09: 0000000000000000
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  R13: ffff9289057e8250 R14: ffff9289057e82e8 R15: 00000000000001f4
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  FS:  0000000000000000(0000) GS:ffff9297dfd00000(0000) knlGS:0000000000000000
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  CR2: 0000000000000030 CR3: 0000000e16852000 CR4: 0000000000f50ef0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  PKRU: 55555554
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Call Trace:
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   <TASK>
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? show_regs+0x6c/0x80
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? __die+0x24/0x80
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? page_fault_oops+0x175/0x5c0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? finish_task_switch.isra.0+0x92/0x2d0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? do_user_addr_fault+0x4b2/0x870
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? srso_alias_return_thunk+0x5/0xfbef5
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? lock_timer_base+0x30/0xf0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? exc_page_fault+0x85/0x1c0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? asm_exc_page_fault+0x27/0x30
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? xhci_handle_stopped_cmd_ring+0xf3/0x150
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   xhci_handle_command_timeout+0x544/0x5a0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   process_one_work+0x178/0x3d0
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   worker_thread+0x2de/0x410
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? __pfx_worker_thread+0x10/0x10
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   kthread+0xe1/0x110
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? __pfx_kthread+0x10/0x10
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ret_from_fork+0x44/0x70
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ? __pfx_kthread+0x10/0x10
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   ret_from_fork_asm+0x1a/0x30
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   </TASK>
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  Modules linked in: typec_displayport cdc_acm ftdi_sio usbserial ccm rfcomm snd_seq_dummy snd_hrtimer xt_CHECKSUM xt_MASQUERADE nft_chain_nat nf_nat bridge stp llc cmac algif_hash algif_skcipher af_alg qrtr ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG bnep nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables binfmt_misc intel_rapl_msr amd_atl intel_rapl_common nls_iso8859_1 amdgpu snd_sof_amd_rembrandt snd_sof_amd_renoir snd_hda_codec_realtek snd_sof_amd_acp snd_sof_pci snd_hda_codec_generic snd_sof_xtensa_dsp snd_hda_scodec_component snd_sof snd_hda_codec_hdmi snd_sof_utils snd_pci_ps snd_amd_sdw_acpi soundwire_amd soundwire_generic_allocation snd_hda_intel soundwire_bus snd_intel_dspcfg leds_cros_ec cros_usbpd_charger cros_ec_sysfs cros_ec_debugfs led_class_multicolor cros_ec_chardev snd_intel_sdw_acpi cros_usbpd_logger cros_ec_hwmon cros_charge_control gpio_cros_ec cros_usbpd_notify snd_soc_core
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   snd_hda_codec btusb edac_mce_amd mt7921e btrtl snd_compress snd_hda_core cros_ec_dev ac97_bus mt7921_common btintel snd_pcm_dmaengine mt792x_lib snd_hwdep snd_rpl_pci_acp6x amdxcp mt76_connac_lib btbcm drm_exec snd_acp_pci snd_seq_midi btmtk kvm_amd sch_fq_codel snd_seq_midi_event mt76 gpu_sched cros_ec_lpcs snd_acp_legacy_common spd5118 drm_buddy snd_pci_acp6x bluetooth cros_ec hid_sensor_als snd_rawmidi kvm hid_sensor_trigger snd_pcm drm_suballoc_helper mac80211 snd_seq drm_ttm_helper snd_pci_acp5x industrialio_triggered_buffer ttm kfifo_buf amd_pmf snd_seq_device hid_sensor_iio_common rapl wmi_bmof drm_display_helper industrialio snd_rn_pci_acp3x snd_timer amdtee snd_acp_config cfg80211 input_leds snd_soc_acpi snd i2c_piix4 drm_kms_helper libarc4 soundcore i2c_smbus snd_pci_acp3x ccp k10temp i2c_algo_bit amd_sfh tee amd_pmc platform_profile joydev mac_hid msr parport_pc ppdev lp parport nvme_fabrics efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -   dm_crypt hid_multitouch hid_sensor_hub crct10dif_pclmul crc32_pclmul hid_generic polyval_clmulni polyval_generic ucsi_acpi ghash_clmulni_intel i2c_hid_acpi nvme sha256_ssse3 usbhid i2c_hid typec_ucsi hid sha1_ssse3 thunderbolt nvme_core typec video drm wmi aesni_intel crypto_simd cryptd
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  CR2: 0000000000000030
<4>1 2024-11-26T00:32:00+00:00 nvme3 kernel - - -  ---[ end trace 0000000000000000 ]---
Comment 1 James.Dutton 2024-11-26 01:55:38 UTC
Created attachment 307280 [details]
xhci-ring.c

The kernel source file where the crash happens.
xhci-ring.c:423
       /* ring command ring doorbell to restart the command ring */
        if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) &&
            !(xhci->xhc_state & XHCI_STATE_DYING)) {
                xhci->current_cmd = cur_cmd;     <- Crashes here. NULL pointer dereference.
                xhci_mod_cmd_timer(xhci);
                xhci_ring_cmd_db(xhci);
        }
Comment 2 James.Dutton 2024-11-26 01:56:08 UTC
Created attachment 307281 [details]
The compiled xhci-ring.o file
Comment 3 James.Dutton 2024-11-26 02:00:04 UTC
I think xhci variable must be NULL entering the xhci_handle_stopped_cmd_ring()
One could put a null check at the top of the function, but I don't know what one should do in that case. 
Its a void function, so no error value can be returned.
Comment 4 James.Dutton 2024-11-26 02:05:52 UTC
c1:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b9 (prog-if 30 [XHCI])
        Subsystem: Framework Computer Inc. Device 0005
        Flags: bus master, fast devsel, latency 0, IRQ 45, IOMMU group 17
        Memory at 90200000 (64-bit, non-prefetchable) [size=1M]
        Capabilities: <access denied>
        Kernel driver in use: xhci_hcd
Comment 5 Michał Pecio 2024-11-26 23:09:34 UTC
It looks like some device doesn't respond to address assignment after connection. If you weren't connecting anything at the time, it's possible that a device is buggy and had disconnected by itself a moment earlier, but the log is too short to tell.

Not sure why it crashed. It looks like there were two attempts 6 seconds apart and no crash on the first attempt.

It's unlikely that xhci was NULL near the end of the function. If it were, there were several opportunities to crash earlier. The call to xhci_mod_cmd_timer() in the next line dereferences xhci->cur_cmd, which is perhaps more suspicious.
Comment 6 James.Dutton 2024-11-28 00:19:02 UTC
It might be dealing with a buggy device. My question is how should one do error recovery here when xhci is null but the function has no error return value as its a void function.
Comment 7 Michał Pecio 2024-12-01 22:06:59 UTC
No sensible way to handle it and it should never happen. All we could do is print an error and return immediately, but in any such case the xHCI driver is likely already FUBAR anyway.

I *hope* that you are mistaken and your crash was caused by dereferencing xhci->current_cmd in the next line, due to cur_cmd being NULL. This is not supposed to happen either, because the check for (xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) is there exactly to catch cases when no commands are pending and cur_cmd is expected to be NULL.

But it doesn't work for one in 255 commands, namely when the aborted command was the last one in its ring segment. Then enqueue points at the subsequent link TRB, while dequeue is already in the next segment. Until recently, such command abort would have failed due to a different bug (and caused different problems), but that other bug has just been fixed and it looks like we may start seeing those NULL dereferences now.

This patch should keep your system from crashing *if* this is the problem that you are running into. The driver should print "cur_cmd bug detected, 0 fff" and continue working normally. (Which means, keep printing more of those "setup device timed out".)
Comment 8 Michał Pecio 2024-12-01 22:07:45 UTC
Created attachment 307305 [details]
try to fix the bug and gather confirmation

Note You need to log in before you can comment on or make changes to this bug.