Bug 85741 - xhci crashes kernel with: BUG: unable to handle kernel NULL pointer dereference
Summary: xhci crashes kernel with: BUG: unable to handle kernel NULL pointer dereference
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 high
Assignee: XHCI bugs virtual user
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-07 06:38 UTC by rocko
Modified: 2014-12-01 11:22 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.17.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log showing xhci oops at xhci_check_streams_endpoint (30.84 KB, application/octet-stream)
2014-10-07 06:38 UTC, rocko
Details
Log from 3.17-rc7 showing xhci oops at xhci_alloc_streams (27.45 KB, text/plain)
2014-10-07 06:39 UTC, rocko
Details

Description rocko 2014-10-07 06:38:23 UTC
Created attachment 152721 [details]
Log showing xhci oops at xhci_check_streams_endpoint

Since I updated the firmware on my WD 1230 USB3 external hard drive, whenever the kernel tries to wake it up from standby - eg using fdisks -l or just using nautilus to browse to the drive - the kernel locks up completely with this error and I have to hard-reset both the laptop and the hard drive.

The log shows uas_eh_abort_handler being called, and then uas_eh_bus_reset_handler, and then shortly afterwards the oops occurs.

Often the lockup is catastrophic enough that the log isn't even persisted to disc, but I've managed to capture a couple of logs showing it crashing in xhci_alloc_streams and xhci_check_streams_endpoint.

I can manually put the hard drive into standby mode with hdparm -y and it will wake up successfully after that. It just seems to be after it puts itself into standby after the default 10 minutes of idle time that the crash seems to happen. However, note that when it does wake up after being manually put into standby mode there is a delay of a few seconds before the drive starts to spin up - normally the drive should start to spin up immediately.

I tried this once with the drive plugged into a USB2 port, and it didn't lock up the kernel, but the drive was still not accessible.

This also happens with the 3.16 kernel.
Comment 1 rocko 2014-10-07 06:39:45 UTC
Created attachment 152731 [details]
Log from 3.17-rc7 showing xhci oops at xhci_alloc_streams
Comment 2 Greg Kroah-Hartman 2014-10-07 16:24:00 UTC
On Tue, Oct 07, 2014 at 06:38:23AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=85741
> 
>             Bug ID: 85741
>            Summary: xhci crashes kernel with: BUG: unable to handle kernel
>                     NULL pointer dereference

Please send to linux-usb@vger.kernel.org
Comment 3 rocko 2014-10-12 01:11:51 UTC
I did a bisect and found that the regression first appears between 3.15-rc4 and 3.15-rc5. The bisect shows:


 git bisect start
 # good: [89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd] Linux 3.15-rc4
 git bisect good 89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd
 # bad: [d6d211db37e75de2ddc3a4f979038c40df7cc79c] Linux 3.15-rc5
 git bisect bad d6d211db37e75de2ddc3a4f979038c40df7cc79c
 # bad: [c8ea5a22bd3b27d68ec2f95483ce8bfe7f114933] net: macb: Fix race between HW and driver
 git bisect bad c8ea5a22bd3b27d68ec2f95483ce8bfe7f114933
 # bad: [6d4596905b65bf4c63c1a008f50bf385fa49f19b] Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 6d4596905b65bf4c63c1a008f50bf385fa49f19b
 
 
 The merge base 6d4596905b65bf4c63c1a008f50bf385fa49f19b is bad.
 This means the bug has been fixed between 6d4596905b65bf4c63c1a008f50bf385fa49f19b and [89ca3b881987f5a4be4c5dbaa7f0df12bbdde2fd].


I don't know how to keep bisecting from here. The kernel compiled from 6d4596905b65bf4c63c1a008f50bf385fa49f19b says that it's an rc1 kernel (and rc2 doesn't exhibit the bug), so I guess there's some merge fun going on here like the message says.
Comment 4 rocko 2014-10-17 04:30:04 UTC
The regression is still present in 3.17.1.
Comment 5 rocko 2014-10-19 15:23:22 UTC
The initial crash appears to be a UAS issue. Disabling UAS for this particular device with this in the file /etc/modprobe.d/usb-storage.conf stops the initial uas error and the resulting kernel crash:

options usb-storage quirks=1058:1230:u

Instead, the drive wakes up as expected.
Comment 6 Alan 2014-10-23 15:22:45 UTC
+xhci team
Comment 7 rocko 2014-12-01 11:22:04 UTC
3.18-rc7 is working somewhat better, in that it doesn't lock up the entire machine. However, the external drive does fall off the bus and both the PC and the external drive have to be power cycled before the drive can be read again.

Should this be reported as a UAS issue instead of an xhci issue?

The log is:

Dec  1 19:19:47 unicorn kernel: [  776.521627] sd 6:0:0:0: [sdb] uas_eh_abort_handler 0 tag 2 inflight: CMD IN 
Dec  1 19:19:47 unicorn kernel: [  776.521647] sd 6:0:0:0: [sdb] CDB: 
Dec  1 19:19:47 unicorn kernel: [  776.521650] Read(10): 28 00 00 02 e2 44 00 00 04 00
Dec  1 19:19:47 unicorn kernel: [  776.522067] scsi host6: uas_eh_bus_reset_handler start
Dec  1 19:19:48 unicorn kernel: [  777.413355] usb 2-2.1: device not accepting address 3, error -22
Dec  1 19:19:48 unicorn kernel: [  778.189078] usb 2-2.1: device not accepting address 3, error -22
Dec  1 19:19:49 unicorn kernel: [  778.964872] usb 2-2.1: device not accepting address 3, error -22
Dec  1 19:19:50 unicorn kernel: [  779.740619] usb 2-2.1: device not accepting address 3, error -22
Dec  1 19:19:50 unicorn kernel: [  779.760800] scsi host6: uas_post_reset: alloc streams error -19 after reset
Dec  1 19:19:50 unicorn kernel: [  779.772266] usb 2-2.1: USB disconnect, device number 3
Dec  1 19:19:50 unicorn kernel: [  779.777069] sd 6:0:0:0: [sdb] Synchronizing SCSI cache
Dec  1 19:19:50 unicorn udisksd[3296]: Cleaning up mount point /media/wdc-1230 (device 8:17 no longer exist)

Note You need to log in before you can comment on or make changes to this bug.