Bug 197875 - Processes hang on attempted access of WDC WD30-EZRX 3TB HDD on HP Z420 Workstation
Summary: Processes hang on attempted access of WDC WD30-EZRX 3TB HDD on HP Z420 Workst...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-14 22:06 UTC by Chuck Burt
Modified: 2018-03-23 21:35 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.13+
Tree: Mainline
Regression: No


Attachments
Output of requested command (74.76 KB, text/plain)
2017-11-29 14:07 UTC, Chuck Burt
Details
Output of requested command as su (159.43 KB, text/plain)
2017-11-29 14:14 UTC, Chuck Burt
Details
Output of requested command as su (136.14 KB, text/plain)
2017-11-29 14:34 UTC, Chuck Burt
Details
Output of lspci on 4.10.x kernel (7.68 KB, text/plain)
2017-11-30 16:04 UTC, Chuck Burt
Details
Git Bisect Log 1 - 20180323 (2.81 KB, text/plain)
2018-03-23 14:00 UTC, Chuck Burt
Details
Git Bisect Log 2 - 20180323 (3.17 KB, text/plain)
2018-03-23 14:00 UTC, Chuck Burt
Details

Description Chuck Burt 2017-11-14 22:06:16 UTC
When booted under kernel 4.13.x, processes (such as parted, boot-info, gparted, etc) always hang when attempting to run due to a Western Digital Green WD30-EZRX 3TB HDD. Drive works as expected under kernel 4.10.x (even when all other things are the same about the system but only booted kernel is different).

Full specs of the machine if useful: https://www.support.hp.com/id-en/document/c03277050

More details on Ubuntu's LaunchPad (where they asked me to come here to file an upstream bug): https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730746
Comment 1 Bart Van Assche 2017-11-14 22:26:27 UTC
Please provide the output of the following command after having reproduced the hang:

    dmesg -c >/dev/null; echo w > /proc/sysrq-trigger; dmesg

Additionally, if you know how to build the kernel yourself, it would be helpful if you could bisect this issue. Documentation is available e.g. at https://git-scm.com/docs/git-bisect.
Comment 2 Chuck Burt 2017-11-29 14:07:59 UTC
Created attachment 260925 [details]
Output of requested command

I reproduced the hang and ran the command as requested.  See attached file output-20171129.txt

Building the kernel is something I could attempt tackling, but as a newbie I'm highly likely to mess something up.  Either way, it will be a few weeks before I can get to it (best case).  So I _really_ hope this provides the clue needed!
Comment 3 Chuck Burt 2017-11-29 14:14:44 UTC
Created attachment 260927 [details]
Output of requested command as su

After reading the first few lines of the last attachment, it occurred to me that running this command as su might be useful.  See attached.
Comment 4 Chuck Burt 2017-11-29 14:34:15 UTC
Created attachment 260929 [details]
Output of requested command as su
Comment 5 Bart Van Assche 2017-11-29 16:36:47 UTC
So command processing got stuck. Since there are two code paths in recent kernels we need to know whether or not scsi-mq was used. Hence please provide the output of the following command:

for d in /sys/block/*; do sfx=""; [ -e "$d/mq" ] && sfx=" [mq]"; echo "$d$sfx"; done

If the above command reports that scsi-mq is being used for the WDC disk, please check whether the following command resolves the lockup:

for d in /sys/kernel/debug/block/*/state; do echo kick >$d; done
Comment 6 Chuck Burt 2017-11-29 20:01:10 UTC
> # for d in /sys/block/*; do sfx=""; [ -e "$d/mq" ] && sfx=" [mq]"; echo
> "$d$sfx"; done
> /sys/block/loop0 [mq]
> /sys/block/loop1 [mq]
> /sys/block/loop2 [mq]
> /sys/block/loop3 [mq]
> /sys/block/loop4 [mq]
> /sys/block/loop5 [mq]
> /sys/block/loop6 [mq]
> /sys/block/loop7 [mq]
> /sys/block/sda
> /sys/block/sdb
> /sys/block/sdc
> /sys/block/sr0
Comment 7 Bart Van Assche 2017-11-29 20:16:11 UTC
That's weird, there are no known queue lockup bugs in the legacy block/SCSI core layers. Is the WDC harddisk perhaps controlled by a HBA? Can you provide the output of lspci (run as root)?
Comment 8 Chuck Burt 2017-11-30 16:04:40 UTC
Created attachment 260953 [details]
Output of lspci on 4.10.x kernel

I won't be able to boot into the newer kernel for about a week, however since `lspci` is hardware-oriented, sharing the output under the older kernel in case it's helpful.  Please let me know if you want me to run it on the new one instead and I'll get it when I can.
Comment 9 Bart Van Assche 2017-12-07 21:40:24 UTC
My hope was that the list of PCI devices would show a PCI HBA of which the driver has been modified recently. Since that's not the case I'm out of ideas about what could be the root cause of this bug. Unless someone else has an idea about how to find the root cause of this issue I think your only option is to perform a bisect of the Linux kernel.
Comment 10 Chuck Burt 2018-03-23 14:00:20 UTC
Created attachment 274895 [details]
Git Bisect Log 1 - 20180323
Comment 11 Chuck Burt 2018-03-23 14:00:44 UTC
Created attachment 274897 [details]
Git Bisect Log 2 - 20180323
Comment 12 Chuck Burt 2018-03-23 14:04:37 UTC
I finally got around to bisecting.

I had to do it twice as I identified two issues here.

Git Bisect Log 1 - https://bugzilla.kernel.org/attachment.cgi?id=274895
This identifies the commit where processes would full on hang as a result of the drive being connected.

Git Bisect Log 2 - https://bugzilla.kernel.org/attachment.cgi?id=274897
This identifies a separate issue (should I file a separate bug for this?) where mounting/unmounting caused error:

> Device /dev/sdb3 is already mounted at `/media/temp/[identifier]`. 
> (udisks-error-quark, 6)
Comment 13 Bart Van Assche 2018-03-23 16:06:33 UTC
Thanks for having run a bisect, that really helps.

Recently the following commit went upstream:

commit c9f926000fe3b84135a81602a9f7e63a6a7898e2 (mkp-scsi/4.15/scsi-fixes)
Author: Hannes Reinecke <hare@suse.de>
Date:   Wed Jan 10 09:34:02 2018 +0100

    scsi: libsas: Disable asynchronous aborts for SATA devices
    
    Handling CD-ROM devices from libsas is decidedly odd, as libata relies
    on SCSI EH to be started to figure out that no medium is present.  So we
    cannot do asynchronous aborts for SATA devices.
    
    Fixes: 909657615d9 ("scsi: libsas: allow async aborts")
    Cc: <stable@vger.kernel.org> # 4.12+
    Signed-off-by: Hannes Reinecke <hare@suse.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Tested-by: Yves-Alexis Perez <corsac@debian.org>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

So you may want to try one of the kernel versions that includes that fix, e.g. v4.14.15 or v4.15.
Comment 14 Chuck Burt 2018-03-23 19:48:29 UTC
I tested with v4.15.0-041500.  Great news, the hang is resolved!

The second issue I found still exists, but that is not nearly as severe (it doesn't block my usage).  It also occurs on more drives.  Should I break that into a separate issue?

Thank you very very much for your help, Bart.
Comment 15 Bart Van Assche 2018-03-23 19:56:53 UTC
Sorry but I lost track. What was the second issue?
Comment 16 Chuck Burt 2018-03-23 20:35:23 UTC
Git Bisect Log 2 - https://bugzilla.kernel.org/attachment.cgi?id=274897
This identifies a separate issue (should I file a separate bug for this?) where mounting/unmounting caused error:

> Device /dev/sdb3 is already mounted at `/media/temp/[identifier]`. 
> (udisks-error-quark, 6)

Unmounting gives a similar error about being unable to unmount (I can provide the exact error in a bit if you need it).


This mounting/unmounting error still exists in the v4.15 kernel and was introduced in the commit isolated in the above bisect (Git Bisect Log 2).
Comment 17 Bart Van Assche 2018-03-23 20:48:26 UTC
At the end of bisect log 2 I found the following:

first bad commit: [8d65b08debc7e62b2c6032d7fe7389d895b92cbc] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next

It seems unlikely to me that any of the commits in the networking tree would cause mounting of a local filesystem to fail.
Comment 18 Chuck Burt 2018-03-23 20:56:02 UTC
Yet it appears there were numerous revisions in the `drivers/scsi` area...? 

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=8d65b08debc7e62b2c6032d7fe7389d895b92cbc

I'm a newbie, so... I could obviously be reading this completely wrong...
Comment 19 Chuck Burt 2018-03-23 20:57:16 UTC
(Also, to clarify, the mounting does not actually fail... it produces that error as a dialog in the GUI, but mounting does actually succeed.)
Comment 20 Bart Van Assche 2018-03-23 21:03:18 UTC
As far as I can see merging Dave's tree pulled in only the following three SCSI changes:
* qed*: Utilize Firmware 8.15.3.0
* qedf: fix wrong le16 conversion
* netlink: extended ACK reporting

Unless you are using the qedi or qedf driver I think that's it's unlikely that these changes are related to the issue you reported.
Comment 21 Chuck Burt 2018-03-23 21:32:05 UTC
Thank you again.  Should we close this issue as duplicate / resolves elsewhere?
Comment 22 Bart Van Assche 2018-03-23 21:35:04 UTC
This ticket has category IO/Storage; SCSI. That category does not cover mounting filesystems. I'm fine with closing this ticket and creating a new ticket if for the mount issue if necessary.

Note You need to log in before you can comment on or make changes to this bug.