Bug 120241
Comment 1
Greg Kroah-Hartman
2016-06-14 15:49:29 UTC
Any more details on this one? lsusb, dmesg? info about the usb key? I tried plugging/unplugging several times on a Dell xps 15 with USB 3.1 Gen2 type-c hdd, but could not reproduce it I couldn't find any mail on the list either for this The issue only happens when you use udisk to automount/umount the disk, and it hangs during umount. The USB key Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: New USB device found, idVendor=13fe, idProduct=5500 Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3 Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: Product: OTG Type C Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: Manufacturer: Verbatim Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: SerialNumber: 071057B7CFA22230 Created attachment 221051 [details]
kern.log remove some sensitive data
Created attachment 221061 [details]
lsusb -vvv
The log show the graphics driver hit a NULL pointer dereference after disconnected and connected hte usb device a couple times before. Are the logs always simiar to this one? ending with NULL pointer dereference showing i915 driver in the call trance? The log shows Alpine ridge found: [ 45.512706] xhci_hcd 0000:39:00.0: xHCI Host Controller [ 45.846031] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd [ 45.863347] usb 4-1: SerialNumber: 071057B7CFA22230 Device disconnected (physically, or "ejected" with udisk): this was the last device so alpine ridge xhci is removed from bus. [ 55.650331] xhci_hcd 0000:39:00.0: remove, state 1 [ 55.650340] usb usb4: USB disconnect, device number 1 [ 55.650341] usb 4-1: USB disconnect, device number 2 [ 56.551596] xhci_hcd 0000:39:00.0: USB bus 4 deregistered [ 56.551603] xhci_hcd 0000:39:00.0: remove, state 4 Something connected, alpine ridge xhci appeared on PCI bus again: [ 65.931026] xhci_hcd 0000:39:00.0: xHCI Host Controller [ 66.262717] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd [ 66.279565] usb 4-1: SerialNumber: 071057B7CFA22230 Device disconnected (physically, or "ejected" with udisk), alpine ridge xhci controller removed from PCI bus: [ 68.825526] usb 4-1: USB disconnect, device number 2 [ 71.026744] xhci_hcd 0000:39:00.0: remove, state 4 [ 71.026771] usb usb4: USB disconnect, device number 1 [ 71.027384] xhci_hcd 0000:39:00.0: Host not halted after 16000 microseconds. [ 71.028561] xhci_hcd 0000:39:00.0: USB bus 4 deregistered Device, and xhci controller detected again: [ 81.313791] xhci_hcd 0000:39:00.0: xHCI Host Controller [ 81.634807] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd [ 81.655734] usb 4-1: SerialNumber: 071057B7CFA22230 And disconnected: [ 86.400422] xhci_hcd 0000:39:00.0: remove, state 1 [ 86.400447] usb usb4: USB disconnect, device number 1 [ 86.400449] usb 4-1: USB disconnect, device number 2 [ 86.427127] xhci_hcd 0000:39:00.0: Host not halted after 16000 microseconds. [ 86.428312] xhci_hcd 0000:39:00.0: USB bus 4 deregistered And null pointer dereference hit in graphics driver: [ 93.466939] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 Jun 22 21:32:36 u-XPS-13-9xxx kernel: [ 93.466973] IP: [<ffffffff811ec887>] __kmalloc+0x97/0x250 Call Trace: [ 93.467896] [<ffffffff813f05d1>] ? radix_tree_insert+0x41/0xe0 [ 93.467933] [<ffffffffc01c6c3f>] alloc_gen8_temp_bitmaps+0x2f/0x80 [i915_bpo] [ 93.467968] [<ffffffffc01ca290>] gen8_alloc_va_range_3lvl+0xa0/0x9d0 [i915_bpo] [ 93.467992] [<ffffffff811a9a28>] ? shmem_getpage_gfp+0x6b8/0x830 [ 93.468012] [<ffffffff8141423d>] ? swiotlb_map_sg_attrs+0x6d/0x130 [ 93.468043] [<ffffffffc01cadff>] gen8_alloc_va_range+0x23f/0x4a0 [i915_bpo] [ 93.468077] [<ffffffffc01cce8b>] i915_vma_bind+0x9b/0x180 [i915_bpo] [ 93.468109] [<ffffffffc01d434a>] i915_gem_object_do_pin+0x89a/0xb10 [i915_bpo] [ 93.468144] [<ffffffffc01d45ed>] i915_gem_object_pin+0x2d/0x30 [i915_bpo] [ 93.468177] [<ffffffffc01c1c29>] i915_gem_execbuffer_reserve_vma.isra.18+0x99/0x160 [i915_bpo] [ 93.468213] [<ffffffffc01c207b>] i915_gem_execbuffer_reserve.isra.19+0x38b/0x3c0 [i915_bpo] [ 93.468213] [<ffffffffc01c207b>] cbuffer_reserve.isra.19+0x38b/0x3c0 [i915_bpo] [ 93.468289] [<ffffffffc01c34a7>] i915_gem_do_execbuffer.isra.22+0x6c7/0x12a0 [i915_bpo] [ 93.468336] [<ffffffff81221320>] ? poll_select_copy_remaining+0x140/0x140 [ 93.468377] [<ffffffff817c126f>] ? unix_stream_recvmsg+0x4f/0x70 [ 93.468412] [<ffffffff817bf050>] ? unix_ioctl+0x60/0x60 [ 93.468462] [<ffffffffc01c4d42>] i915_gem_execbuffer2+0xb2/0x250 [i915_bpo] Created attachment 221101 [details]
kern log 2
It's not necessarily to happen in i915, it's very randomly.
But we just found there is a new version of BIOS worth to try, will update the result here soon.
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.476758] xhci_hcd 0000:39:00.0: xHCI Host Controller
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.476767] xhci_hcd 0000:39:00.0: new USB bus registered, assigned bus number 3
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.480642] general protection fault: 0000 [#1] SMP
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.480682] Modules linked in: drbg ansi_cprng ctr ccm arc4 rfcomm uas usb_storage nvram msr uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core v4l2_common videodev media btusb btrtl hid_multitouch(OE) bnep i2c_designware_platform i2c_designware_core dell_wmi dell_laptop dcdbas nls_iso8859_1 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi(OE) irqbypass crct10dif_pclmul dell_led crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_realtek(OE) snd_hda_codec_generic(OE) joydev input_leds serio_raw snd_hda_intel(OE) snd_hda_codec(OE) snd_hda_core(OE) ath10k_pci snd_hwdep ath10k_core snd_pcm ath snd_seq_midi snd_seq_midi_event mac80211 snd_rawmidi snd_seq snd_seq_device snd_timer cfg80211 rtsx_pci_ms snd memstick soundcore mei_me shpchp mei idma64 virt_dma processor_thermal_device intel_lpss_pci intel_soc_dts_iosf hci_uart btbcm btqca btintel bluetooth soc_button_array int3403_thermal tpm_crb int3400_thermal intel_lpss_acpi intel_hid(OE) int340x_thermal_zone acpi_als acpi_thermal_rel intel_lpss sparse_keymap acpi_pad mac_hid kfifo_buf industrialio parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log rtsx_pci_sdmmc i915_bpo intel_ips psmouse i2c_algo_bit drm_kms_helper ahci rtsx_pci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops drm wmi i2c_hid hid pinctrl_sunrisepoint video pinctrl_intel fjes
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481651] CPU: 3 PID: 163 Comm: kworker/u8:3 Tainted: G OE 4.4.0-24-generic #43
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481696] Hardware name: Dell Inc. XPS 13 9360/ , BIOS 0.1.4 06/02/2016
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481739] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481771] task: ffff880174962c40 ti: ffff8801744c0000 task.ti: ffff8801744c0000
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481810] RIP: 0010:[<ffffffff811eb20b>] [<ffffffff811eb20b>] kmem_cache_alloc_trace+0x7b/0x1f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481862] RSP: 0018:ffff8801744c3780 EFLAGS: 00010206
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481891] RAX: 0000000000000000 RBX: 00000000024000c0 RCX: 000000000001a7be
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481929] RDX: 000000000001a7bd RSI: 00000000024000c0 RDI: 0000000000019f80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.481967] RBP: ffff8801744c37c0 R08: ffff88017e599f80 R09: ffff880179801b00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482006] R10: 3946b525ade39c6f R11: ffffffff81ccc535 R12: 00000000024000c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482043] R13: ffffffff811d9bdf R14: ffff880176156700 R15: ffff880179801b00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482081] FS: 0000000000000000(0000) GS:ffff88017e580000(0000) knlGS:0000000000000000
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482124] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482155] CR2: 00007ffceaef0aa0 CR3: 0000000172c62000 CR4: 00000000003406e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482193] Stack:
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482206] ffff8801744c37d8 024000c079801c00 0000000000000028 ffff880176156700
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482255] 00000000024000c0 ffff880176156710 ffff880176156700 00000000024000c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482301] ffff8801744c3828 ffffffff811d9bdf 00000000024080c0 ffff8801744c3828
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482347] Call Trace:
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482367] [<ffffffff811d9bdf>] dma_pool_alloc+0x1ff/0x240
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482401] [<ffffffff8165456d>] xhci_alloc_container_ctx.isra.26+0x7d/0xe0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482440] [<ffffffff81656192>] xhci_alloc_command+0xb2/0x140
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482474] [<ffffffff816575f5>] xhci_mem_init+0x395/0xbf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482506] [<ffffffff81663300>] ? xhci_pci_suspend+0x70/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482539] [<ffffffff8164edac>] xhci_init+0x7c/0x170
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482569] [<ffffffff8164f646>] xhci_gen_setup+0x276/0x4c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482600] [<ffffffff81663651>] xhci_pci_setup+0x41/0x120
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482633] [<ffffffff8160f348>] usb_add_hcd+0x288/0xa00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482665] [<ffffffff81620d41>] usb_hcd_pci_probe+0x181/0x4d0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482697] [<ffffffff816637bc>] xhci_pci_probe+0x2c/0x220
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482729] [<ffffffff8143acb5>] local_pci_probe+0x45/0xa0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482760] [<ffffffff8143c0f3>] pci_device_probe+0x103/0x150
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482793] [<ffffffff8154d7c2>] driver_probe_device+0x222/0x4a0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482827] [<ffffffff8154db41>] __device_attach_driver+0x71/0xa0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482861] [<ffffffff8154dad0>] ? __driver_attach+0x90/0x90
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482894] [<ffffffff8154b4c7>] bus_for_each_drv+0x67/0xb0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482926] [<ffffffff8154d49c>] __device_attach+0xdc/0x170
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482957] [<ffffffff8154d540>] device_attach+0x10/0x20
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482988] [<ffffffff8143090d>] pci_bus_add_device+0x3d/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483020] [<ffffffff81430d69>] pci_bus_add_devices+0x39/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483053] [<ffffffff81430d8a>] pci_bus_add_devices+0x5a/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483085] [<ffffffff81430d8a>] pci_bus_add_devices+0x5a/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483117] [<ffffffff81453cd4>] enable_slot+0x264/0x2f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483146] [<ffffffff814537c3>] ? get_slot_status+0xa3/0xe0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483177] [<ffffffff81453f8b>] acpiphp_check_bridge.part.8+0xbb/0xf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483213] [<ffffffff814547f4>] acpiphp_hotplug_notify+0x174/0x250
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483249] [<ffffffff81454680>] ? free_bridge+0x110/0x110
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483279] [<ffffffff81483037>] acpi_device_hotplug+0x3a2/0x3fe
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483314] [<ffffffff8147bbf2>] acpi_hotplug_work_fn+0x1e/0x29
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483349] [<ffffffff8109a175>] process_one_work+0x165/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483381] [<ffffffff8109a4db>] worker_thread+0x4b/0x4c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483413] [<ffffffff8109a490>] ? process_one_work+0x480/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483446] [<ffffffff8109a490>] ? process_one_work+0x480/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483479] [<ffffffff810a06a8>] kthread+0xd8/0xf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483507] [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483548] [<ffffffff8182600f>] ret_from_fork+0x3f/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483578] [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483612] Code: 08 65 4c 03 05 9f ef e1 7e 49 83 78 10 00 4d 8b 10 0f 84 21 01 00 00 4d 85 d2 0f 84 18 01 00 00 49 63 41 20 48 8d 4a 01 49 8b 39 <49> 8b 1c 02 4c 89 d0 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483812] RIP [<ffffffff811eb20b>] kmem_cache_alloc_trace+0x7b/0x1f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.483852] RSP <ffff8801744c3780>
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.494588] ---[ end trace e81a65484491bbfc ]---
Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.494653] BUG: unable to handle kernel paging request at ffffffffffffffd8
Created attachment 221111 [details]
kern log 3
After upgrading to latest BIOS, it still hangs.
It happens when unplug and plug-in the USB key quickly.
This does show some issue when allocating memory for the xhci->lpm_command
when xhci driver is being loaded
>
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482440] [<ffffffff81656192>]
> xhci_alloc_command+0xb2/0x140
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482474] [<ffffffff816575f5>]
> xhci_mem_init+0x395/0xbf0
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482506] [<ffffffff81663300>] ?
> xhci_pci_suspend+0x70/0x70
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [ 162.482539] [<ffffffff8164edac>]
> xhci_init+0x7c/0x170
And seems something tries to suspend xhci in the middle of this (xci_pci_suspend is called).
Created attachment 221331 [details]
kern log 4, disable xhci autosuspend
I tried disabling the xhci autosuspend("echo on" all "contorl" under "/sys/devices/pci0000:00/0000:00:14.0")
It still failed.
I always re-plugin the USB key quickly after un-plugining it, it more easily to reproduce the issue.
could you try adding init tracing to xhci can be done with: mount -t debugfs none /sys/kernel/debug echo xhci-hcd:xhci_dbg_init >> /sys/kernel/debug/tracing/set_event then trigger the issue, and show the output of /sys/kernel/debug/tracing/trace Does turning off default pagealloc debug make it work better? in kernel konfig: Enable debug page memory allocations by default? (DEBUG_PAGEALLOC_ENABLE_DEFAULT = n) I could reproduce a hang after enabling it and disconnecting the type-c usb device Created attachment 221431 [details]
kern log 5, tracing xhci-hcd:xhci_dbg_init event
Looks like I didn't enable that
$ grep PAGEALLOC /boot/config-4.4.0-24-generic
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
# CONFIG_DEBUG_PAGEALLOC is not set
And for the trace file, it's pretty hard to sync the trace file, since I can't copy the file after system hangs. I'm not sure if this file is useful.
Created attachment 221441 [details]
trace log 5, tracing xhci-hcd:xhci_dbg_init event
On my system I was only able to trigger this after adding some specific memory debugging options. I'n my case I tracked the fault down to dereferencing a xhci pointer after device was removed, freed and disabled. In short, xhci_pci_remove() touhces xhci->quirks after calling usb_hcd_pci_remove() usb_hcd_pci_remove() frees the hcd, with the xhci structure, and disables the whole pci device. I'll write a testpach in a few minutes Created attachment 221531 [details]
Patch for fixing paging errors on xhci removal
This attached patch works best if applied on top of
commit 27a41a83ec54d0edfcaf079310244e7f013a7701
xhci: Cleanup only when releasing primary hcd
Created attachment 221561 [details]
kern log 8, it hangs when I unplug the USB key
I've cherry-picked the commit
commit 27a41a83ec54d0edfcaf079310244e7f013a7701
xhci: Cleanup only when releasing primary hcd
and apply your patch, now I can reproduce this issue easily.
Created attachment 221571 [details]
kern log 9, it hangs when I unplug the USB key
Created attachment 221581 [details]
kern log 10, it hangs when I re-plugin the USB key
Created attachment 221591 [details]
kern log 11, it hangs when I re-plugin the USB key
Created attachment 221611 [details]
kern log 12
I also tried applying your patch on top of 4.7-rc5, and it still hangs
Created attachment 221621 [details]
kern log 13
Ok, so not the same issue then, but similar If you have gdb, and the kernel image with symbols could you check exactly where in the code we get the page fault. For example kernel used with log 12: find the instruction pointer from the oops: [ 410.775676] IP: [<ffffffff811fe4b7>] __kmalloc+0x97/0x230 Then check where that is exactly in the code. gdb vmlinux (gdb) li *( __kmalloc+0x97) It looks like it fails when we try to allocate memory after acpi pci hotplug thread took care of removing the pci devices. Could you add memory debugging to see if it detects where it goes wrong. under Kernel Hacking --> Memory Debugging enable: Debug page memory allocations Enable debug page memory allocations by default? Poison pages after freeing SLUB debugging on by default Kernel memory leak detector Debug VM (includig all options under it) Debug VM translations Check for stack overflows (it should set at least the following options) DEBUG_PAGEALLOC DEBUG_PAGEALLOC_ENABLE_DEFAULT PAGE_POISONING SLUB_DEBUG_ON DEBUG_KMEMLEAK DEBUG_VM DEBUG_VIRTUAL This will slowdown the machine a bit. Created attachment 221951 [details]
kern log 14
Created attachment 221961 [details]
gdb log for kern log 14
This is the gdb log for kern log 14
I didn't enable those memory options yet.
Created attachment 221971 [details]
attachment-21662-0.html
I will be out of office the week of the fourth of July. I will not be checking mail.
Created attachment 221981 [details]
kern log 15, enable page and mem related kernel options
This starts to involve more PCI hotplug, ACPI and memory management than just xhci. Asking for some assistance from the experts in those areas. https://marc.info/?l=linux-kernel&m=146764193131981&w=2 We can try to split this issue into smaller parts. Can this be reproduced by just unbinding the xhci the driver from the device with, keeping the usb key pluggen in? This way the xhci controller and PCI bridges will remain on the PCI bus and not interfere. Try several times (with patch in comment 16) # echo 0000:39:00.0 > /sys/bus/pci/drivers/xhci_hcd/unbind # (optinoally check with lspci -t that bridges and xhci are still present) # echo 0000:39:00.0 > /sys/bus/pci/drivers/xhci_hcd/bind Does this trigger the hang? If yes, then we can rule out the PCI hotplug parts and focus on xhci. Created attachment 222081 [details]
DMA memory debugging for xhci
This patch adds extra printk's for every dma memory function
With this patch added we can check that the freed dma memory matches the
one we allocated, and that the destroyed pools match the created ones.
Can you reproduce the issue with this patch added as well
Created attachment 222151 [details]
kern log 16
Applied the 2 patches on top of 4.7.0-rc5
and reproduce this issue by unbind / bind the usb port(0000:05:00.0)
(keep usb key plug-in in the port)
Thanks. I managed to reproduce this 2-3 times with another usb flashkey if I timed the physical disconnect correctly. Once memory debugging warned about double free before hang with call trace pointing to slub.c again. I removed all dma related memory freeing from xhci driver and was still able to reproduce it. Now I added extra prints to all kfree()s used in xhci, but suddely I can't reproduce this anymore. I've ordered a similar usb memory key as in this bug but have not received it yet. I'll be away for 3 weeks, back on Aug 1 Created attachment 222491 [details]
Patch to debug kfree in xhci
This is a debugging patch I would like to have included when triggering the hang.
If the double free warning is correct then this should show where it happens in xhci
Created attachment 222691 [details] kern log 17 I still can reproduce this issue after applied the patch on comment 34(all the 3 patches on top of 4.7-rc5) It emits OOPS while I plug-in the USB key at 206~207 sec, but it doesn't hang. And I unplug the USB key at 291, it hangs. Add a sleep while calling usb_remove_hcd seems work I can't trigger the hang now. diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 34b837a..1ad73a7 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -2952,6 +2952,8 @@ void usb_remove_hcd(struct usb_hcd *hcd) { struct usb_device *rhdev = hcd->self.root_hub; +printk("[%s:%d]\n", __FILE__, __LINE__); +msleep(1000); dev_info(hcd->self.controller, "remove, state %x\n", hcd->state); usb_get_dev(rhdev); The delay helped for me as well. The msleep() will allow other pending work to complete. It could be that there is some pending completions, or hub thread that doesn't get a chance to run before everything is freed. Once it finally can run it will dereference freed data. Even if complete() is called it doesnt mean that the work will be run synchronously. I wrote yet another testpsatch that checks if host is about to be removed and prevents queuing work, or letting the hub thread mess too much with xhci at removal. Created attachment 224041 [details]
Patch to prevent xhci work and hub thread at host removal
Try together with patch 221531,
With this I could not repoduce hang myself, but I was not able
to reproduce the issue reliably in the first place. need more testing
Created attachment 224361 [details]
kern log 18
Sorry to late to respond, I'm on a trip this week and do not bring the machine with me, so I ask my colleague to verify the patch.
No good news, we still can reproduce this issue after applying the 2 patches you mentioned above.
Status update Last place where msleep() helps is in hcd-pci.c at: +++ b/drivers/usb/core/hcd-pci.c @@ -368,6 +368,7 @@ void usb_hcd_pci_remove(struct pci_dev *dev) release_region(hcd->rsrc_start, hcd->rsrc_len); } + msleep(); usb_put_hcd(hcd); pci_disable_device(dev); } Here everything allocated by xhci is already freed, and the first hcd is also freed (xhci has two hcd's) only the last hcd is not yet freed. I figure there is still some thread/wq/timer function using hcd structure during the sleep. usb_put_hcd() will then free the hcd. It's likely that the async function touching the already freed hcd is using bus_to_hcd() to get the hcd struture so i modified bus_to_hcd() to return NULL if the bus roothub was already freed. This helped, and problem disappears. I'll add more debugging to this case to find the caller of bus_to_hcd() for that case to find the violating caller. The sleep works for me, too. Do you have any news about this issue? Found something touching the hcd structure when it should no longer happend. During the msleep(), just before usb_put_hcd(), bus_to_hcd() was called. Without the msleep the whole hcd would already be freed and we would corrupt memory and could hang the machine. Tracing back the caller of bus_to_hcd looks like this: BUS_TO_HCD with freed roothub called from device_release+0x32/0x90 from kobject_release+0x7a/0x190 from kobject_put+0x27/0x50 from put_device+0x17/0x20 from usb_put_dev+0x1a/0x20 [usbcore] from usb_release_interface+0x40/0x60 [usbcore] from device_release+0x32/0x90 from kobject_release+0x7a/0x190 from kobject_put+0x27/0x50 from put_device+0x17/0x20 from scsi_host_dev_release+0xdd/0x130 [scsi_mod] from device_release+0x32/0x90 from kobject_release+0x7a/0x190 from kobject_put+0x27/0x50 from put_device+0x17/0x20 from scsi_target_dev_release+0x1f/0x30 [scsi_mod] from device_release+0x32/0x90 from kobject_release+0x7a/0x190 from kobject_put+0x27/0x50 from put_device+0x17/0x20 from scsi_device_dev_release_usercontext+0x113/0x130 [scsi_mod] execute_in_process_context+0x85/0x90 from scsi_device_dev_release+0x1c/0x30 [scsi_mod] from device_release+0x32/0x90 from kobject_release+0x7a/0x190 from kobject_put+0x27/0x50 from put_device+0x17/0x20 from scsi_device_put+0x2b/0x30 [scsi_mod] from scsi_disk_put+0x32/0x50 [sd_mod] from sd_release+0x4d/0xb0 [sd_mod] from __blkdev_put+0x2a6/0x360 from __blkdev_put+0x23e/0x360 from blkdev_put+0x4e/0x170 from kill_block_super+0x41/0x70 from deactivate_locked_super+0x3e/0x70 from deactivate_super+0x5c/0x60 This far down the call chain it looks like its the device release from block device working its way up the parent devices up to scsi device -> scsi host dev -> usb interface. Either this chain becomes asynchronous at some point (like possible in the execute_in_process_context() call), or then it's asynch from the beginning, Need to dig deeper still. This commit seems can fix this issue ab2a4bf USB: don't free bandwidth_mutex too early Seems to work for me as well. Alan Stern who wrote that patch also commented that device remove is synchronous, but device release is asynchronous, and refcounting should take care of hcd memory getting freed when the last reference is removed. So right now my understanding what happened is: xhci uses two hcd's one for usb2 and one for usb3, these however share bandwidth and address mutex. xhci driver first removes the usb3 hcd, and then usb2 hcd (primary). The async _release_ for the hcd will however end up in reverse order, usb2 hcd (primary) first, and usb3 hcd second. As there is no usb2 device connected and it takes some time for the mass storage interface device to release its referece, the usb2 (primary) relase will free the mutexes that the usb3 hcd might use, and will free again (double free) The sleeping helped because it allowed the interface device to be released in time, not changing the order of how hcds were relased. This issue has been fixed by this commit ab2a4bf USB: don't free bandwidth_mutex too early |