Bug 120241 - System hang when plug/un-plug USB 3.1 key via thunderbolt port on Dell XPS 13
Summary: System hang when plug/un-plug USB 3.1 key via thunderbolt port on Dell XPS 13
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-14 11:49 UTC by AceLan Kao
Modified: 2016-10-07 00:21 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.7-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kern.log remove some sensitive data (37.50 KB, text/x-log)
2016-06-23 01:47 UTC, AceLan Kao
Details
lsusb -vvv (29.48 KB, text/x-log)
2016-06-23 01:48 UTC, AceLan Kao
Details
kern log 2 (43.00 KB, text/x-log)
2016-06-24 01:22 UTC, AceLan Kao
Details
kern log 3 (52.50 KB, text/x-log)
2016-06-24 03:09 UTC, AceLan Kao
Details
kern log 4, disable xhci autosuspend (55.00 KB, text/x-log)
2016-06-28 03:20 UTC, AceLan Kao
Details
kern log 5, tracing xhci-hcd:xhci_dbg_init event (29.45 KB, text/x-log)
2016-06-29 02:20 UTC, AceLan Kao
Details
trace log 5, tracing xhci-hcd:xhci_dbg_init event (15.59 KB, text/x-log)
2016-06-29 02:21 UTC, AceLan Kao
Details
Patch for fixing paging errors on xhci removal (1.13 KB, patch)
2016-06-30 14:33 UTC, Mathias Nyman
Details | Diff
kern log 8, it hangs when I unplug the USB key (24.46 KB, text/x-log)
2016-07-01 03:33 UTC, AceLan Kao
Details
kern log 9, it hangs when I unplug the USB key (39.21 KB, text/x-log)
2016-07-01 03:33 UTC, AceLan Kao
Details
kern log 10, it hangs when I re-plugin the USB key (27.74 KB, text/x-log)
2016-07-01 03:41 UTC, AceLan Kao
Details
kern log 11, it hangs when I re-plugin the USB key (52.94 KB, text/x-log)
2016-07-01 03:42 UTC, AceLan Kao
Details
kern log 12 (37.08 KB, text/x-log)
2016-07-01 09:14 UTC, AceLan Kao
Details
kern log 13 (43.75 KB, text/x-log)
2016-07-01 09:15 UTC, AceLan Kao
Details
kern log 14 (26.06 KB, text/x-log)
2016-07-04 04:57 UTC, AceLan Kao
Details
gdb log for kern log 14 (770 bytes, text/x-log)
2016-07-04 04:58 UTC, AceLan Kao
Details
attachment-21662-0.html (1.62 KB, text/html)
2016-07-04 05:07 UTC, Mario Limonciello
Details
kern log 15, enable page and mem related kernel options (59.74 KB, text/x-log)
2016-07-04 09:29 UTC, AceLan Kao
Details
DMA memory debugging for xhci (10.63 KB, patch)
2016-07-05 15:52 UTC, Mathias Nyman
Details | Diff
kern log 16 (66.38 KB, text/x-log)
2016-07-06 04:05 UTC, AceLan Kao
Details
Patch to debug kfree in xhci (10.85 KB, patch)
2016-07-08 15:55 UTC, Mathias Nyman
Details | Diff
kern log 17 (1.27 MB, text/x-log)
2016-07-11 02:26 UTC, AceLan Kao
Details
Patch to prevent xhci work and hub thread at host removal (4.46 KB, patch)
2016-07-16 00:12 UTC, Mathias Nyman
Details | Diff
kern log 18 (108.97 KB, text/x-log)
2016-07-20 07:31 UTC, AceLan Kao
Details

Comment 1 Greg Kroah-Hartman 2016-06-14 15:49:29 UTC
On Tue, Jun 14, 2016 at 11:49:42AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=120241
> 
>             Bug ID: 120241
>            Summary: System hang when plug/un-plug USB 3.1 key via
>                     thunderbolt port on Dell XPS 13
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 4.7-rc2

Please send to the linux-usb@vger.kernel.org mailing list.

Also be sure you update to the latest BIOS update for this hardware
platform, it fixed a number of thunderbolt issues.
Comment 2 Mathias Nyman 2016-06-22 11:59:50 UTC
Any more details on this one?
lsusb, dmesg? info about the usb key?

I tried plugging/unplugging several times on a Dell xps 15 with USB 3.1 Gen2 type-c hdd,  but could not reproduce it

I couldn't find any mail on the list either for this
Comment 3 AceLan Kao 2016-06-23 01:43:25 UTC
The issue only happens when you use udisk to automount/umount the disk, and it hangs during umount.

The USB key
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: New USB device found, idVendor=13fe, idProduct=5500
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: Product: OTG Type C
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: Manufacturer: Verbatim
Jun 03 21:26:53 u-XPS-13-9360 kernel: usb 4-1: SerialNumber: 071057B7CFA22230
Comment 4 AceLan Kao 2016-06-23 01:47:28 UTC
Created attachment 221051 [details]
kern.log remove some sensitive data
Comment 5 AceLan Kao 2016-06-23 01:48:43 UTC
Created attachment 221061 [details]
lsusb -vvv
Comment 6 Mathias Nyman 2016-06-23 08:35:34 UTC
The log show the graphics driver hit a NULL pointer dereference after disconnected and connected hte usb device a couple times before.

Are the logs always simiar to this one? ending with 
NULL pointer dereference showing i915 driver in the call trance?

The log shows 
Alpine ridge found:
[   45.512706] xhci_hcd 0000:39:00.0: xHCI Host Controller
[   45.846031] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd
[   45.863347] usb 4-1: SerialNumber: 071057B7CFA22230

Device disconnected (physically, or "ejected" with udisk):
this was the last device so alpine ridge xhci is removed from bus.
[   55.650331] xhci_hcd 0000:39:00.0: remove, state 1
[   55.650340] usb usb4: USB disconnect, device number 1
[   55.650341] usb 4-1: USB disconnect, device number 2
[   56.551596] xhci_hcd 0000:39:00.0: USB bus 4 deregistered
[   56.551603] xhci_hcd 0000:39:00.0: remove, state 4

Something connected, alpine ridge xhci appeared on PCI bus again:
[   65.931026] xhci_hcd 0000:39:00.0: xHCI Host Controller
[   66.262717] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd
[   66.279565] usb 4-1: SerialNumber: 071057B7CFA22230

Device disconnected (physically, or "ejected" with udisk),
alpine ridge xhci controller removed from PCI bus:
[   68.825526] usb 4-1: USB disconnect, device number 2
[   71.026744] xhci_hcd 0000:39:00.0: remove, state 4
[   71.026771] usb usb4: USB disconnect, device number 1
[   71.027384] xhci_hcd 0000:39:00.0: Host not halted after 16000 microseconds.
[   71.028561] xhci_hcd 0000:39:00.0: USB bus 4 deregistered

Device, and xhci controller detected again:
[   81.313791] xhci_hcd 0000:39:00.0: xHCI Host Controller
[   81.634807] usb 4-1: new SuperSpeed USB device number 2 using xhci_hcd
[   81.655734] usb 4-1: SerialNumber: 071057B7CFA22230

And disconnected:
[   86.400422] xhci_hcd 0000:39:00.0: remove, state 1
[   86.400447] usb usb4: USB disconnect, device number 1
[   86.400449] usb 4-1: USB disconnect, device number 2
[   86.427127] xhci_hcd 0000:39:00.0: Host not halted after 16000 microseconds.
[   86.428312] xhci_hcd 0000:39:00.0: USB bus 4 deregistered

And null pointer dereference hit in graphics driver:
 [   93.466939] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
Jun 22 21:32:36 u-XPS-13-9xxx kernel: [   93.466973] IP: [<ffffffff811ec887>] __kmalloc+0x97/0x250

 Call Trace:
[   93.467896]  [<ffffffff813f05d1>] ? radix_tree_insert+0x41/0xe0
[   93.467933]  [<ffffffffc01c6c3f>] alloc_gen8_temp_bitmaps+0x2f/0x80 [i915_bpo]
[   93.467968]  [<ffffffffc01ca290>] gen8_alloc_va_range_3lvl+0xa0/0x9d0 [i915_bpo]
[   93.467992]  [<ffffffff811a9a28>] ? shmem_getpage_gfp+0x6b8/0x830
[   93.468012]  [<ffffffff8141423d>] ? swiotlb_map_sg_attrs+0x6d/0x130
[   93.468043]  [<ffffffffc01cadff>] gen8_alloc_va_range+0x23f/0x4a0 [i915_bpo]
[   93.468077]  [<ffffffffc01cce8b>] i915_vma_bind+0x9b/0x180 [i915_bpo]
[   93.468109]  [<ffffffffc01d434a>] i915_gem_object_do_pin+0x89a/0xb10 [i915_bpo]
[   93.468144]  [<ffffffffc01d45ed>] i915_gem_object_pin+0x2d/0x30 [i915_bpo]
[   93.468177]  [<ffffffffc01c1c29>] i915_gem_execbuffer_reserve_vma.isra.18+0x99/0x160 [i915_bpo]
[   93.468213]  [<ffffffffc01c207b>] i915_gem_execbuffer_reserve.isra.19+0x38b/0x3c0 [i915_bpo]
[   93.468213]  [<ffffffffc01c207b>] cbuffer_reserve.isra.19+0x38b/0x3c0 [i915_bpo]
[   93.468289]  [<ffffffffc01c34a7>] i915_gem_do_execbuffer.isra.22+0x6c7/0x12a0 [i915_bpo]
[   93.468336]  [<ffffffff81221320>] ? poll_select_copy_remaining+0x140/0x140
[   93.468377]  [<ffffffff817c126f>] ? unix_stream_recvmsg+0x4f/0x70
[   93.468412]  [<ffffffff817bf050>] ? unix_ioctl+0x60/0x60
[   93.468462]  [<ffffffffc01c4d42>] i915_gem_execbuffer2+0xb2/0x250 [i915_bpo]
Comment 7 AceLan Kao 2016-06-24 01:22:13 UTC
Created attachment 221101 [details]
kern log 2

It's not necessarily to happen in i915, it's very randomly.
But we just found there is a new version of BIOS worth to try, will update the result here soon.

Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.476758] xhci_hcd 0000:39:00.0: xHCI Host Controller
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.476767] xhci_hcd 0000:39:00.0: new USB bus registered, assigned bus number 3
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.480642] general protection fault: 0000 [#1] SMP 
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.480682] Modules linked in: drbg ansi_cprng ctr ccm arc4 rfcomm uas usb_storage nvram msr uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core v4l2_common videodev media btusb btrtl hid_multitouch(OE) bnep i2c_designware_platform i2c_designware_core dell_wmi dell_laptop dcdbas nls_iso8859_1 x86_pkg_temp_thermal coretemp kvm_intel kvm snd_hda_codec_hdmi(OE) irqbypass crct10dif_pclmul dell_led crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_codec_realtek(OE) snd_hda_codec_generic(OE) joydev input_leds serio_raw snd_hda_intel(OE) snd_hda_codec(OE) snd_hda_core(OE) ath10k_pci snd_hwdep ath10k_core snd_pcm ath snd_seq_midi snd_seq_midi_event mac80211 snd_rawmidi snd_seq snd_seq_device snd_timer cfg80211 rtsx_pci_ms snd memstick soundcore mei_me shpchp mei idma64 virt_dma processor_thermal_device intel_lpss_pci intel_soc_dts_iosf hci_uart btbcm btqca btintel bluetooth soc_button_array int3403_thermal tpm_crb int3400_thermal intel_lpss_acpi intel_hid(OE) int340x_thermal_zone acpi_als acpi_thermal_rel intel_lpss sparse_keymap acpi_pad mac_hid kfifo_buf industrialio parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log rtsx_pci_sdmmc i915_bpo intel_ips psmouse i2c_algo_bit drm_kms_helper ahci rtsx_pci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops drm wmi i2c_hid hid pinctrl_sunrisepoint video pinctrl_intel fjes
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481651] CPU: 3 PID: 163 Comm: kworker/u8:3 Tainted: G           OE   4.4.0-24-generic #43
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481696] Hardware name: Dell Inc. XPS 13 9360/      , BIOS 0.1.4 06/02/2016
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481739] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481771] task: ffff880174962c40 ti: ffff8801744c0000 task.ti: ffff8801744c0000
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481810] RIP: 0010:[<ffffffff811eb20b>]  [<ffffffff811eb20b>] kmem_cache_alloc_trace+0x7b/0x1f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481862] RSP: 0018:ffff8801744c3780  EFLAGS: 00010206
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481891] RAX: 0000000000000000 RBX: 00000000024000c0 RCX: 000000000001a7be
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481929] RDX: 000000000001a7bd RSI: 00000000024000c0 RDI: 0000000000019f80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.481967] RBP: ffff8801744c37c0 R08: ffff88017e599f80 R09: ffff880179801b00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482006] R10: 3946b525ade39c6f R11: ffffffff81ccc535 R12: 00000000024000c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482043] R13: ffffffff811d9bdf R14: ffff880176156700 R15: ffff880179801b00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482081] FS:  0000000000000000(0000) GS:ffff88017e580000(0000) knlGS:0000000000000000
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482124] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482155] CR2: 00007ffceaef0aa0 CR3: 0000000172c62000 CR4: 00000000003406e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482193] Stack:
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482206]  ffff8801744c37d8 024000c079801c00 0000000000000028 ffff880176156700
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482255]  00000000024000c0 ffff880176156710 ffff880176156700 00000000024000c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482301]  ffff8801744c3828 ffffffff811d9bdf 00000000024080c0 ffff8801744c3828
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482347] Call Trace:
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482367]  [<ffffffff811d9bdf>] dma_pool_alloc+0x1ff/0x240
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482401]  [<ffffffff8165456d>] xhci_alloc_container_ctx.isra.26+0x7d/0xe0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482440]  [<ffffffff81656192>] xhci_alloc_command+0xb2/0x140
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482474]  [<ffffffff816575f5>] xhci_mem_init+0x395/0xbf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482506]  [<ffffffff81663300>] ? xhci_pci_suspend+0x70/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482539]  [<ffffffff8164edac>] xhci_init+0x7c/0x170
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482569]  [<ffffffff8164f646>] xhci_gen_setup+0x276/0x4c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482600]  [<ffffffff81663651>] xhci_pci_setup+0x41/0x120
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482633]  [<ffffffff8160f348>] usb_add_hcd+0x288/0xa00
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482665]  [<ffffffff81620d41>] usb_hcd_pci_probe+0x181/0x4d0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482697]  [<ffffffff816637bc>] xhci_pci_probe+0x2c/0x220
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482729]  [<ffffffff8143acb5>] local_pci_probe+0x45/0xa0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482760]  [<ffffffff8143c0f3>] pci_device_probe+0x103/0x150
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482793]  [<ffffffff8154d7c2>] driver_probe_device+0x222/0x4a0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482827]  [<ffffffff8154db41>] __device_attach_driver+0x71/0xa0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482861]  [<ffffffff8154dad0>] ? __driver_attach+0x90/0x90
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482894]  [<ffffffff8154b4c7>] bus_for_each_drv+0x67/0xb0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482926]  [<ffffffff8154d49c>] __device_attach+0xdc/0x170
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482957]  [<ffffffff8154d540>] device_attach+0x10/0x20
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482988]  [<ffffffff8143090d>] pci_bus_add_device+0x3d/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483020]  [<ffffffff81430d69>] pci_bus_add_devices+0x39/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483053]  [<ffffffff81430d8a>] pci_bus_add_devices+0x5a/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483085]  [<ffffffff81430d8a>] pci_bus_add_devices+0x5a/0x80
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483117]  [<ffffffff81453cd4>] enable_slot+0x264/0x2f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483146]  [<ffffffff814537c3>] ? get_slot_status+0xa3/0xe0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483177]  [<ffffffff81453f8b>] acpiphp_check_bridge.part.8+0xbb/0xf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483213]  [<ffffffff814547f4>] acpiphp_hotplug_notify+0x174/0x250
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483249]  [<ffffffff81454680>] ? free_bridge+0x110/0x110
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483279]  [<ffffffff81483037>] acpi_device_hotplug+0x3a2/0x3fe
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483314]  [<ffffffff8147bbf2>] acpi_hotplug_work_fn+0x1e/0x29
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483349]  [<ffffffff8109a175>] process_one_work+0x165/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483381]  [<ffffffff8109a4db>] worker_thread+0x4b/0x4c0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483413]  [<ffffffff8109a490>] ? process_one_work+0x480/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483446]  [<ffffffff8109a490>] ? process_one_work+0x480/0x480
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483479]  [<ffffffff810a06a8>] kthread+0xd8/0xf0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483507]  [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483548]  [<ffffffff8182600f>] ret_from_fork+0x3f/0x70
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483578]  [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483612] Code: 08 65 4c 03 05 9f ef e1 7e 49 83 78 10 00 4d 8b 10 0f 84 21 01 00 00 4d 85 d2 0f 84 18 01 00 00 49 63 41 20 48 8d 4a 01 49 8b 39 <49> 8b 1c 02 4c 89 d0 65 48 0f c7 0f 0f 94 c0 84 c0 74 bb 49 63 
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483812] RIP  [<ffffffff811eb20b>] kmem_cache_alloc_trace+0x7b/0x1f0
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.483852]  RSP <ffff8801744c3780>
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.494588] ---[ end trace e81a65484491bbfc ]---
Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.494653] BUG: unable to handle kernel paging request at ffffffffffffffd8
Comment 8 AceLan Kao 2016-06-24 03:09:21 UTC
Created attachment 221111 [details]
kern log 3

After upgrading to latest BIOS, it still hangs.

It happens when unplug and plug-in the USB key quickly.
Comment 9 Mathias Nyman 2016-06-27 14:14:44 UTC
This does show some issue when allocating memory for the xhci->lpm_command 
when xhci driver is being loaded

> 
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482440]  [<ffffffff81656192>]
> xhci_alloc_command+0xb2/0x140
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482474]  [<ffffffff816575f5>]
> xhci_mem_init+0x395/0xbf0
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482506]  [<ffffffff81663300>] ?
> xhci_pci_suspend+0x70/0x70
> Jun 23 21:15:07 u-XPS-13-9360 kernel: [  162.482539]  [<ffffffff8164edac>]
> xhci_init+0x7c/0x170

And seems something tries to suspend xhci in the middle of this (xci_pci_suspend is called).
Comment 10 AceLan Kao 2016-06-28 03:20:28 UTC
Created attachment 221331 [details]
kern log 4, disable xhci autosuspend

I tried disabling the xhci autosuspend("echo on" all "contorl" under "/sys/devices/pci0000:00/0000:00:14.0")
It still failed.

I always re-plugin the USB key quickly after un-plugining it, it more easily to reproduce the issue.
Comment 11 Mathias Nyman 2016-06-28 10:58:55 UTC
could you try adding init tracing to xhci  
can be done with:

mount -t debugfs none /sys/kernel/debug
echo xhci-hcd:xhci_dbg_init >> /sys/kernel/debug/tracing/set_event

then trigger the issue, and show the output of

/sys/kernel/debug/tracing/trace
Comment 12 Mathias Nyman 2016-06-28 13:46:00 UTC
Does turning off default pagealloc debug make it work better?

in kernel konfig:
Enable debug page memory allocations by default?
 (DEBUG_PAGEALLOC_ENABLE_DEFAULT = n)

I could reproduce a hang after enabling it and disconnecting the
type-c usb device
Comment 13 AceLan Kao 2016-06-29 02:20:25 UTC
Created attachment 221431 [details]
kern log 5, tracing xhci-hcd:xhci_dbg_init event

Looks like I didn't enable that
$ grep PAGEALLOC /boot/config-4.4.0-24-generic 
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
# CONFIG_DEBUG_PAGEALLOC is not set

And for the trace file, it's pretty hard to sync the trace file, since I can't copy the file after system hangs. I'm not sure if this file is useful.
Comment 14 AceLan Kao 2016-06-29 02:21:06 UTC
Created attachment 221441 [details]
trace log 5, tracing xhci-hcd:xhci_dbg_init event
Comment 15 Mathias Nyman 2016-06-30 14:20:27 UTC
On my system I was only able to trigger this after adding some specific
memory debugging options. I'n my case I tracked the fault down to 
dereferencing a xhci pointer after device was removed, freed and disabled.

In short,  xhci_pci_remove() touhces xhci->quirks after
calling usb_hcd_pci_remove()

usb_hcd_pci_remove() frees the hcd, with the xhci structure, and disables
the whole pci device.

I'll write a testpach in a few minutes
Comment 16 Mathias Nyman 2016-06-30 14:33:15 UTC
Created attachment 221531 [details]
Patch for fixing paging errors on xhci removal

This attached patch works best if applied on top of
commit 27a41a83ec54d0edfcaf079310244e7f013a7701
    xhci: Cleanup only when releasing primary hcd
Comment 17 AceLan Kao 2016-07-01 03:33:24 UTC
Created attachment 221561 [details]
kern log 8, it hangs when I unplug the USB key

I've cherry-picked the commit
commit 27a41a83ec54d0edfcaf079310244e7f013a7701
    xhci: Cleanup only when releasing primary hcd
and apply your patch, now I can reproduce this issue easily.
Comment 18 AceLan Kao 2016-07-01 03:33:56 UTC
Created attachment 221571 [details]
kern log 9, it hangs when I unplug the USB key
Comment 19 AceLan Kao 2016-07-01 03:41:49 UTC
Created attachment 221581 [details]
kern log 10, it hangs when I re-plugin the USB key
Comment 20 AceLan Kao 2016-07-01 03:42:14 UTC
Created attachment 221591 [details]
kern log 11, it hangs when I re-plugin the USB key
Comment 21 AceLan Kao 2016-07-01 09:14:53 UTC
Created attachment 221611 [details]
kern log 12

I also tried applying your patch on top of 4.7-rc5, and it still hangs
Comment 22 AceLan Kao 2016-07-01 09:15:19 UTC
Created attachment 221621 [details]
kern log 13
Comment 23 Mathias Nyman 2016-07-01 13:32:09 UTC
Ok, so not the same issue then, but similar

If you have gdb, and the kernel image with symbols could you 
check exactly where in the code we get the page fault.

For example kernel used with log 12:
find the instruction pointer from the oops:
[  410.775676] IP: [<ffffffff811fe4b7>] __kmalloc+0x97/0x230

Then check where that is exactly in the code.

gdb vmlinux
(gdb) li *( __kmalloc+0x97)

It looks like it fails when we try to allocate memory after 
acpi pci hotplug thread took care of removing the pci devices.
Comment 24 Mathias Nyman 2016-07-01 14:15:04 UTC
Could you add memory debugging to see if it detects where it goes wrong.

under Kernel Hacking --> Memory Debugging
enable:

Debug page memory allocations
  Enable debug page memory allocations by default?
Poison pages after freeing
SLUB debugging on by default
Kernel memory leak detector
Debug VM (includig all options under it)
Debug VM translations
Check for stack overflows

(it should set at least the following options)

DEBUG_PAGEALLOC
DEBUG_PAGEALLOC_ENABLE_DEFAULT 
PAGE_POISONING
SLUB_DEBUG_ON
DEBUG_KMEMLEAK
DEBUG_VM
DEBUG_VIRTUAL

This will slowdown the machine a bit.
Comment 25 AceLan Kao 2016-07-04 04:57:14 UTC
Created attachment 221951 [details]
kern log 14
Comment 26 AceLan Kao 2016-07-04 04:58:21 UTC
Created attachment 221961 [details]
gdb log for kern log 14

This is the gdb log for kern log 14

I didn't enable those memory options yet.
Comment 27 Mario Limonciello 2016-07-04 05:07:02 UTC
Created attachment 221971 [details]
attachment-21662-0.html

 I will be out of office the week of the fourth of July.  I will not be checking mail.
Comment 28 AceLan Kao 2016-07-04 09:29:55 UTC
Created attachment 221981 [details]
kern log 15, enable page and mem related kernel options
Comment 29 Mathias Nyman 2016-07-04 14:29:52 UTC
This starts to involve more PCI hotplug, ACPI and memory management than
just xhci. Asking for some assistance from the experts in those areas.   

https://marc.info/?l=linux-kernel&m=146764193131981&w=2
Comment 30 Mathias Nyman 2016-07-05 11:17:15 UTC
We can try to split this issue into smaller parts.
Can this be reproduced by just unbinding the xhci the driver from the
device with, keeping the usb key pluggen in? This way the xhci controller and PCI bridges will remain on the PCI bus and not interfere.

Try several times (with patch in comment 16)

# echo 0000:39:00.0 > /sys/bus/pci/drivers/xhci_hcd/unbind  
# (optinoally check with lspci -t that bridges and xhci are still present)
# echo 0000:39:00.0 > /sys/bus/pci/drivers/xhci_hcd/bind  

Does this trigger the hang?
If yes, then we can rule out the PCI hotplug parts and focus on xhci.
Comment 31 Mathias Nyman 2016-07-05 15:52:29 UTC
Created attachment 222081 [details]
DMA memory debugging for xhci

This patch adds extra printk's for every dma memory function

With this patch added we can check that the freed dma memory matches the
one we allocated, and that the destroyed pools match the created ones.

Can you reproduce the issue with this patch added as well
Comment 32 AceLan Kao 2016-07-06 04:05:12 UTC
Created attachment 222151 [details]
kern log 16

Applied the 2 patches on top of 4.7.0-rc5
and reproduce this issue by unbind / bind the usb port(0000:05:00.0)
(keep usb key plug-in in the port)
Comment 33 Mathias Nyman 2016-07-08 15:22:04 UTC
Thanks.

I managed to reproduce this 2-3 times with another usb flashkey if I timed
the physical disconnect correctly.

Once memory debugging warned about double free before hang with call trace pointing to slub.c again.

I removed all dma related memory freeing from xhci driver and was still able to reproduce it. 

Now I added extra prints to all kfree()s used in xhci, but suddely I can't reproduce this anymore. 

I've ordered a similar usb memory key as in this bug but have not received it yet.

I'll be away for 3 weeks, back on Aug 1
Comment 34 Mathias Nyman 2016-07-08 15:55:10 UTC
Created attachment 222491 [details]
Patch to debug kfree in xhci

This is a debugging patch I would like to have included when triggering the hang.
If the double free warning is correct then this should show where it happens in xhci
Comment 35 AceLan Kao 2016-07-11 02:26:23 UTC
Created attachment 222691 [details]
kern log 17

I still can reproduce this issue after applied the patch on comment 34(all the 3 patches on top of 4.7-rc5)
It emits OOPS while I plug-in the USB key at 206~207 sec, but it doesn't hang.
And I unplug the USB key at 291, it hangs.
Comment 36 AceLan Kao 2016-07-12 08:57:04 UTC
Add a sleep while calling usb_remove_hcd seems work
I can't trigger the hang now.

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 34b837a..1ad73a7 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -2952,6 +2952,8 @@ void usb_remove_hcd(struct usb_hcd *hcd)
 {
        struct usb_device *rhdev = hcd->self.root_hub;
 
+printk("[%s:%d]\n", __FILE__, __LINE__);
+msleep(1000);
        dev_info(hcd->self.controller, "remove, state %x\n", hcd->state);
 
        usb_get_dev(rhdev);
Comment 37 Mathias Nyman 2016-07-16 00:08:29 UTC
The delay helped for me as well.
The msleep() will allow other pending work to complete.
It could be that there is some pending completions,  or hub thread
that doesn't get a chance to run before everything is freed. Once
it finally can run it will dereference freed data. 
Even if complete() is called it doesnt mean that the work will be
run synchronously.

I wrote yet another testpsatch that checks if host is about to be removed
and prevents queuing work, or letting the hub thread mess too much with xhci
at removal.
Comment 38 Mathias Nyman 2016-07-16 00:12:51 UTC
Created attachment 224041 [details]
Patch to prevent xhci work and hub thread at host removal

Try together with patch 221531,
With this I could not repoduce hang myself, but I was not able
to reproduce the issue reliably in the first place. need more testing
Comment 39 AceLan Kao 2016-07-20 07:31:32 UTC
Created attachment 224361 [details]
kern log 18

Sorry to late to respond, I'm on a trip this week and do not bring the machine with me, so I ask my colleague to verify the patch.
No good news, we still can reproduce this issue after applying the 2 patches you mentioned above.
Comment 40 Mathias Nyman 2016-08-05 14:30:05 UTC
Status update

Last place where msleep() helps is in hcd-pci.c at:

+++ b/drivers/usb/core/hcd-pci.c
@@ -368,6 +368,7 @@ void usb_hcd_pci_remove(struct pci_dev *dev)
                release_region(hcd->rsrc_start, hcd->rsrc_len);
        }
 
+       msleep();
        usb_put_hcd(hcd);
        pci_disable_device(dev);
 }

Here everything allocated by xhci is already freed, and the first hcd is also freed (xhci has two hcd's) only the last hcd is not yet freed.

I figure there is still some thread/wq/timer function using hcd structure
during the sleep. usb_put_hcd() will then free the hcd.

It's likely that the async function touching the already freed hcd is using
bus_to_hcd() to get the hcd struture so i modified bus_to_hcd() to return NULL if the bus roothub was already freed.

This helped, and problem disappears.

I'll add more debugging to this case to find the caller of bus_to_hcd() for that case to find the violating caller.
Comment 41 AceLan Kao 2016-08-12 07:44:56 UTC
The sleep works for me, too.
Do you have any news about this issue?
Comment 42 Mathias Nyman 2016-08-16 16:51:30 UTC
Found something touching the hcd structure when it should no longer happend.

During the msleep(), just before usb_put_hcd(), bus_to_hcd() was called.
Without the msleep the whole hcd would already be freed and we would corrupt memory and could hang the machine.

Tracing back the caller of bus_to_hcd looks like this:

BUS_TO_HCD with freed roothub called from 
                   device_release+0x32/0x90 from
                   kobject_release+0x7a/0x190 from
                   kobject_put+0x27/0x50 from
                   put_device+0x17/0x20 from
                   usb_put_dev+0x1a/0x20 [usbcore] from
                   usb_release_interface+0x40/0x60 [usbcore] from
                   device_release+0x32/0x90 from
                   kobject_release+0x7a/0x190 from
                   kobject_put+0x27/0x50 from
                   put_device+0x17/0x20 from
                   scsi_host_dev_release+0xdd/0x130 [scsi_mod] from
                   device_release+0x32/0x90 from
                   kobject_release+0x7a/0x190 from
                   kobject_put+0x27/0x50 from
                   put_device+0x17/0x20 from
                   scsi_target_dev_release+0x1f/0x30 [scsi_mod] from
                   device_release+0x32/0x90 from
                   kobject_release+0x7a/0x190 from
                   kobject_put+0x27/0x50 from
                   put_device+0x17/0x20 from
                   scsi_device_dev_release_usercontext+0x113/0x130 [scsi_mod]
                   execute_in_process_context+0x85/0x90 from
                   scsi_device_dev_release+0x1c/0x30 [scsi_mod] from
                   device_release+0x32/0x90 from
                   kobject_release+0x7a/0x190 from
                   kobject_put+0x27/0x50 from
                   put_device+0x17/0x20 from
                   scsi_device_put+0x2b/0x30 [scsi_mod] from
                   scsi_disk_put+0x32/0x50 [sd_mod] from
                   sd_release+0x4d/0xb0 [sd_mod] from
                   __blkdev_put+0x2a6/0x360
                   from __blkdev_put+0x23e/0x360 from
                   blkdev_put+0x4e/0x170 from
                   kill_block_super+0x41/0x70 from
                   deactivate_locked_super+0x3e/0x70 from
                   deactivate_super+0x5c/0x60

This far down the call chain it looks like its the device release from
block device working its way up the parent devices up to scsi device -> scsi host dev -> usb interface.  Either this chain becomes asynchronous at some point (like possible in the execute_in_process_context() call), or then it's asynch from the beginning, 

Need to dig deeper still.
Comment 43 AceLan Kao 2016-08-24 05:11:03 UTC
This commit seems can fix this issue
ab2a4bf USB: don't free bandwidth_mutex too early
Comment 44 Mathias Nyman 2016-08-24 13:42:35 UTC
Seems to work for me as well.

Alan Stern who wrote that patch also commented that device remove is synchronous, but device release is asynchronous, and refcounting should take care of hcd memory getting freed when the last reference is removed.

So right now my understanding what happened is:

xhci uses two hcd's one for usb2 and one for usb3, these however share bandwidth and address mutex. xhci driver first removes the usb3 hcd, and then usb2 hcd (primary). The async _release_ for the hcd will however end up in reverse order,
usb2 hcd (primary) first, and usb3 hcd second. As there is no usb2 device connected and it takes some time for the mass storage interface device to release its referece, the usb2 (primary) relase will free the mutexes that the usb3 hcd
might use, and will free again (double free)

The sleeping helped because it allowed the interface device to be released in time, not changing the order of how hcds were relased.
Comment 45 AceLan Kao 2016-10-07 00:21:16 UTC
This issue has been fixed by this commit
ab2a4bf USB: don't free bandwidth_mutex too early

Note You need to log in before you can comment on or make changes to this bug.