Bug 200709 - QEMU's IDE hard disk device fails to work properly with 4.18 release candidates (regression vs. 4.17 and earlier)
Summary: QEMU's IDE hard disk device fails to work properly with 4.18 release candidat...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: IDE (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: io_ide@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-02 02:38 UTC by David H. Gutteridge
Modified: 2018-11-13 01:50 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.18-rc7
Tree: Mainline
Regression: No


Attachments
Specific KVM/QEMU configuration that triggers the issue (2.18 KB, text/plain)
2018-08-02 02:40 UTC, David H. Gutteridge
Details
Kernel config reproducer (193.18 KB, text/plain)
2018-08-17 02:38 UTC, David H. Gutteridge
Details

Description David H. Gutteridge 2018-08-02 02:38:56 UTC
With kernels 4.18 RC4 through RC7 (RC4 is the farthest back I've
tested), QEMU's IDE hard disk device fails to work properly (dmesg
excerpt follows), resulting in VMs that cannot boot. This is not an
issue with any stable kernel releases, e.g. 4.17.11 boots fine with
the very same QEMU configuration.

Output like the following loops over and over until the dracut
emergency shell is spawned.

[  236.862646] ata_piix 0000:00:01.1: dma_direct_map_sg: overflow 0x0000000277583000+4096 of device mask ffffffff
[  236.874151] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[  236.876273] ata7.00: failed command: READ DMA
[  236.878413] ata7.00: cmd c8/00:08:00:08:20/00:00:00:00:00/e0 tag 0 dma 4096 in
[  236.878413]          res 50/00:00:00:00:00/00:00:00:00:00/b0 Emask 0x40 (internal error)
[  236.890747] ata7.00: status: { DRDY }
[  236.893063] ata7.00: configured for MWDMA2
[  236.895120] ata7.01: configured for MWDMA2
[  236.898289] sd 6:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  236.903840] sd 6:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current] 
[  236.908104] sd 6:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command
[  236.912203] sd 6:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 20 08 00 00 00 08 00
[  236.916444] print_req_error: I/O error, dev sda, sector 2099200
[  236.918649] ata7: EH complete

I don't know if I've filed this under the correct system: it occurs
specifically with QEMU's IDE hard disk device, but it may well be
an issue with DMA rather than IDE, or something else.

At the moment, I can't provide a full dmesg, as I'm unable to save
dmesg output when the VM gets stuck like this, and the serial console
in virt-manager doesn't have a large enough scrollback buffer to
capture everything before a lot of text has flown by.

I haven't tested with x86_64, it may well be an issue there. (I just
happened to have been testing i386 to validate 32-bit PTI mitigations
that are under development, so I've filed it against that arch.)

I can follow up with further information as need be next week.
Comment 1 David H. Gutteridge 2018-08-02 02:40:08 UTC
Created attachment 277657 [details]
Specific KVM/QEMU configuration that triggers the issue

Here's my configuration that can reproduce the issue.
Comment 2 David H. Gutteridge 2018-08-17 02:36:53 UTC
I've found that I cannot reproduce this issue with the kernel config
produced by "make defconfig" for the 4.18 series. It occurs with a
specific config that I've been using across multiple kernel releases.
Originally it was used for 4.15, and so on, changing only in terms of
new options that have been added since. Until 4.18, it worked fine in
this context.

With the specific kernel config I've attached, I can still reproduce
this with a more recent check out of tip (a.k.a. 4.18.0_rc8+).

I haven't narrowed down the difference between the rather slimmed-down
config generated by "make defconfig" and mine that causes the
regression.
Comment 3 David H. Gutteridge 2018-08-17 02:38:26 UTC
Created attachment 277903 [details]
Kernel config reproducer

This is the kernel config I've been using to test with.
Comment 4 Nemo Inis 2018-10-08 22:58:55 UTC
The dma_direct_map_sg: overflow error seems to affect 32-bit installations from kernel 4.18 on. See these reports:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=908924

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1794922/
Comment 5 Daniel Reichelt 2018-10-14 11:12:21 UTC
I'm using this kvm command-line:

kvm
-append "root=/dev/sda ro elevator=noop net.ifnames=0 console=ttyS0"
-display none
-drive file=kvmsid32.qcow2,cache=writeback,detect-zeroes=unmap,discard=unmap
-initrd initrd.img
-k de
-kernel vmlinuz
-m $MEM
-monitor unix:/run/myutils/ckvm/3.monitor-socket,server,nowait,nodelay
-name kvmsid32
-net nic,model=e1000
-net tap,ifname=tap3,script=/etc/qemu-ifup-with-mac
-pidfile /run/myutils/ckvm/3.pid
-runas nobody
-serial stdio
-smp 4
-snapshot
-usbdevice tablet
-vnc localhost:53


When using less than 3.5GiB of RAM (-m 3583) the issue doesn't get triggered. Specifying -m 3584 or more triggers the issue.

git-bisect'ing lead me to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=21e07dba9fb1179148089d611fc9e6e70d1887c3 - with this reverted, the issue disappears also when I start kvm with -m 3584 or more.
Comment 6 David H. Gutteridge 2018-10-20 05:40:48 UTC
This commit addresses the problem for me:
https://lkml.org/lkml/2018/10/14/62

It was pulled into the tip tree a day ago. (https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/) I just built and tested a kernel from tip, and I'm no longer able to reproduce the issue.
Comment 7 Daniel Reichelt 2018-10-20 18:40:27 UTC
+1

Thanks, David!
Comment 8 David H. Gutteridge 2018-11-13 01:50:04 UTC
The fix has also been pulled into 4.18.18. I tested and confirmed that release addresses the issue for me as well. Closing ticket accordingly.

Note You need to log in before you can comment on or make changes to this bug.