When booted under kernel 4.13.x, processes (such as parted, boot-info, gparted, etc) always hang when attempting to run due to a Western Digital Green WD30-EZRX 3TB HDD. Drive works as expected under kernel 4.10.x (even when all other things are the same about the system but only booted kernel is different). Full specs of the machine if useful: https://www.support.hp.com/id-en/document/c03277050 More details on Ubuntu's LaunchPad (where they asked me to come here to file an upstream bug): https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730746
Please provide the output of the following command after having reproduced the hang: dmesg -c >/dev/null; echo w > /proc/sysrq-trigger; dmesg Additionally, if you know how to build the kernel yourself, it would be helpful if you could bisect this issue. Documentation is available e.g. at https://git-scm.com/docs/git-bisect.
Created attachment 260925 [details] Output of requested command I reproduced the hang and ran the command as requested. See attached file output-20171129.txt Building the kernel is something I could attempt tackling, but as a newbie I'm highly likely to mess something up. Either way, it will be a few weeks before I can get to it (best case). So I _really_ hope this provides the clue needed!
Created attachment 260927 [details] Output of requested command as su After reading the first few lines of the last attachment, it occurred to me that running this command as su might be useful. See attached.
Created attachment 260929 [details] Output of requested command as su
So command processing got stuck. Since there are two code paths in recent kernels we need to know whether or not scsi-mq was used. Hence please provide the output of the following command: for d in /sys/block/*; do sfx=""; [ -e "$d/mq" ] && sfx=" [mq]"; echo "$d$sfx"; done If the above command reports that scsi-mq is being used for the WDC disk, please check whether the following command resolves the lockup: for d in /sys/kernel/debug/block/*/state; do echo kick >$d; done
> # for d in /sys/block/*; do sfx=""; [ -e "$d/mq" ] && sfx=" [mq]"; echo > "$d$sfx"; done > /sys/block/loop0 [mq] > /sys/block/loop1 [mq] > /sys/block/loop2 [mq] > /sys/block/loop3 [mq] > /sys/block/loop4 [mq] > /sys/block/loop5 [mq] > /sys/block/loop6 [mq] > /sys/block/loop7 [mq] > /sys/block/sda > /sys/block/sdb > /sys/block/sdc > /sys/block/sr0
That's weird, there are no known queue lockup bugs in the legacy block/SCSI core layers. Is the WDC harddisk perhaps controlled by a HBA? Can you provide the output of lspci (run as root)?
Created attachment 260953 [details] Output of lspci on 4.10.x kernel I won't be able to boot into the newer kernel for about a week, however since `lspci` is hardware-oriented, sharing the output under the older kernel in case it's helpful. Please let me know if you want me to run it on the new one instead and I'll get it when I can.
My hope was that the list of PCI devices would show a PCI HBA of which the driver has been modified recently. Since that's not the case I'm out of ideas about what could be the root cause of this bug. Unless someone else has an idea about how to find the root cause of this issue I think your only option is to perform a bisect of the Linux kernel.
Created attachment 274895 [details] Git Bisect Log 1 - 20180323
Created attachment 274897 [details] Git Bisect Log 2 - 20180323
I finally got around to bisecting. I had to do it twice as I identified two issues here. Git Bisect Log 1 - https://bugzilla.kernel.org/attachment.cgi?id=274895 This identifies the commit where processes would full on hang as a result of the drive being connected. Git Bisect Log 2 - https://bugzilla.kernel.org/attachment.cgi?id=274897 This identifies a separate issue (should I file a separate bug for this?) where mounting/unmounting caused error: > Device /dev/sdb3 is already mounted at `/media/temp/[identifier]`. > (udisks-error-quark, 6)
Thanks for having run a bisect, that really helps. Recently the following commit went upstream: commit c9f926000fe3b84135a81602a9f7e63a6a7898e2 (mkp-scsi/4.15/scsi-fixes) Author: Hannes Reinecke <hare@suse.de> Date: Wed Jan 10 09:34:02 2018 +0100 scsi: libsas: Disable asynchronous aborts for SATA devices Handling CD-ROM devices from libsas is decidedly odd, as libata relies on SCSI EH to be started to figure out that no medium is present. So we cannot do asynchronous aborts for SATA devices. Fixes: 909657615d9 ("scsi: libsas: allow async aborts") Cc: <stable@vger.kernel.org> # 4.12+ Signed-off-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Yves-Alexis Perez <corsac@debian.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> So you may want to try one of the kernel versions that includes that fix, e.g. v4.14.15 or v4.15.
I tested with v4.15.0-041500. Great news, the hang is resolved! The second issue I found still exists, but that is not nearly as severe (it doesn't block my usage). It also occurs on more drives. Should I break that into a separate issue? Thank you very very much for your help, Bart.
Sorry but I lost track. What was the second issue?
Git Bisect Log 2 - https://bugzilla.kernel.org/attachment.cgi?id=274897 This identifies a separate issue (should I file a separate bug for this?) where mounting/unmounting caused error: > Device /dev/sdb3 is already mounted at `/media/temp/[identifier]`. > (udisks-error-quark, 6) Unmounting gives a similar error about being unable to unmount (I can provide the exact error in a bit if you need it). This mounting/unmounting error still exists in the v4.15 kernel and was introduced in the commit isolated in the above bisect (Git Bisect Log 2).
At the end of bisect log 2 I found the following: first bad commit: [8d65b08debc7e62b2c6032d7fe7389d895b92cbc] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next It seems unlikely to me that any of the commits in the networking tree would cause mounting of a local filesystem to fail.
Yet it appears there were numerous revisions in the `drivers/scsi` area...? https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=8d65b08debc7e62b2c6032d7fe7389d895b92cbc I'm a newbie, so... I could obviously be reading this completely wrong...
(Also, to clarify, the mounting does not actually fail... it produces that error as a dialog in the GUI, but mounting does actually succeed.)
As far as I can see merging Dave's tree pulled in only the following three SCSI changes: * qed*: Utilize Firmware 8.15.3.0 * qedf: fix wrong le16 conversion * netlink: extended ACK reporting Unless you are using the qedi or qedf driver I think that's it's unlikely that these changes are related to the issue you reported.
Thank you again. Should we close this issue as duplicate / resolves elsewhere?
This ticket has category IO/Storage; SCSI. That category does not cover mounting filesystems. I'm fine with closing this ticket and creating a new ticket if for the mount issue if necessary.