Bug 216932
Summary: | io_uring with libvirt cause kernel NULL pointer dereference | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Sergey V. (truesmb) |
Component: | AIO | Assignee: | Badari Pulavarty (pbadari) |
Status: | NEW --- | ||
Severity: | normal | CC: | aenigma1372, axboe, dburger |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.1.5 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
My dmesg
MainDesktop1-Win10.xml MainDesktop1-Win10_QEMU_cmdline qemu_hooks |
After update from 6.1.4 to 6.1.5, rollback to 6.1.4 solve the problem If you are building your own kernels, cherry pick commit 613b14884b8595e20b9fac4126bf627313827fbe and it should work fine. This was missed in the backport: axboe@m1max ~/gi/linux-block (6.1-stable)> git cherry-pick 613b14884b8595e20b9fac4126bf627313827fbe Auto-merging block/blk-merge.c Auto-merging block/blk-mq.c Auto-merging drivers/block/drbd/drbd_req.c Auto-merging drivers/md/dm.c Auto-merging drivers/md/md.c Auto-merging drivers/nvme/host/multipath.c [6.1-stable fecb066b9366] block: handle bio_split_to_limits() NULL return Date: Wed Jan 4 08:51:19 2023 -0700 8 files changed, 19 insertions(+), 2 deletions(-) I've been running into the same issue as well. Definitely affects 6.1.x series from 6.1.5 and onward (up through and including 6.1.12). Downgrade to 6.1.4 and everything works fine, anything past that and it's broken. Running on 6.1.12 now and if I disable io_uring in my VM config, also works fine, so seems likely to be related to this issue (unfortunately, I just came across this issue and don't believe I have the logs anymore). Does anybody know if this issue was fixed in kernel 6.2 release? Huh, this really should be fixed in 6.1.12, see comment just above yours. I'm OOO until tomorrow, will take a look at it then. But 6.1.7 and newer has the patch, only 6.1.5/6.1.6 should be affected for the original report. Darrin, please attach your oops here on 6.1.12, please. (In reply to Jens Axboe from comment #5) > Darrin, please attach your oops here on 6.1.12, please. Let me go back through and see if I have the logs from that time. If not, I should be able to reproduce the issue again (not sure if the oops was the same and/or if it's a different issue), but timing of the breakage was the same, and reverting to 6.1.4 and/or disabling io_uring in the VM config fixes it). Will go back and re-test shortly when I'm done with what I'm working on. Just re-tested this. It's pretty frustrating, as I don't see any dmesg/kernel errors related to this, no errors in the libvirtd/qemu logs, etc. From what I can see from a log perspective, it looks like everything is working fine. However, something has definitely been broken since 6.1.5+. Behavior is slightly different than the original post (Windows 10 VM boots fine, but when I try to access something on the volume that's using io_uring, that application hangs and it won't let me kill it (access denied error via task manager). If I try to shutdown the VM, it hangs on shutdown (end up having to force-off it). Relevant storage config for that specific volume (data storage volume on an NVMe via LVM) is as follows: --- <iothread id="3" thread_pool_min="4" thread_pool_max="16"/> <iothreadpin iothread="3" cpuset="6,14"/> <disk type="block" device="disk"> <driver name="qemu" type="raw" cache="none" io="io_uring"/> <source dev="/dev/mapper/vg_VMstore-lv_Win10VM"/> <target dev="sdc" bus="scsi"/> <address type="drive" controller="2" bus="0" target="0" unit="1"/> </disk> <controller type="scsi" index="2" model="virtio-scsi"> <driver queues="4" iothread="3"/> <address type="pci" domain="0x0000" bus="0x0d" slot="0x00" function="0x0"/> </controller> --- There are two other volumes (boot + data) but doesn't appear to be affected by the issue, as they're currently using io=threads due to issues with io=io_uring breaking in the past and causing problems (especially on the boot volume). If I change I/O setting on it from io_uring to threads, everything works fine on kernel 6.1.12. If I leave I/O setting as io_uring and revert back to kernel 6.1.4, everything works fine. Version info: [root@experior ~]# uname -r 6.1.12-arch1-1 [root@experior ~]# libvirtd --version libvirtd (libvirt) 9.0.0 [root@experior ~]# qemu-system-x86_64 --version QEMU emulator version 7.2.0 IIRC, I believe there were multiple io_uring related commits in kernel 6.1.5, but since I'm not getting any errors in the logs, it's definitely not giving me much to work with to get a general idea as to what's causing the breakage. I saw 6.1.13 was released yesterday or today, but I doubt it's going to make it into the Arch repos, as 6.2 is already in the testing repo. I was going to hold off for a bit before testing 6.2, but I might see if the issue is resolved there once it ends up in the standard repo. How is your dm device setup on top of the nvme? I'm going to try and see if I can reproduce this. [root@experior ~]# lshw -c storage *-nvme description: NVMe device product: Samsung SSD 970 EVO Plus 2TB vendor: Samsung Electronics Co Ltd physical id: 0 bus info: pci@0000:01:00.0 logical name: /dev/nvme0 version: 2B2QEXM7 serial: S59CNM0R905753H width: 64 bits clock: 33MHz capabilities: nvme pm msi pciexpress msix nvm_express bus_master cap_list configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:144d144dS59CNM0R905753H Samsung SSD 970 EVO Plus 2TB state=live resources: irq:98 memory:fcf00000-fcf03fff [root@experior ~]# parted /dev/nvme0n1 print Model: Samsung SSD 970 EVO Plus 2TB (nvme) Disk /dev/nvme0n1: 2000GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 1000GB 1000GB btrfs primary 2 1000GB 2000GB 1000GB primary [root@experior ~]# [root@experior ~]# lsblk -t /dev/nvme0n1 NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME nvme0n1 0 512 0 512 512 0 none 1023 128 0B ├─nvme0n1p1 0 512 0 512 512 0 none 1023 128 0B └─nvme0n1p2 0 512 0 512 512 0 none 1023 128 0B └─vg_VMstore-lv_Win10VM 0 512 0 512 512 0 128 128 0B [root@experior ~]# [root@experior ~]# pvs PV VG Fmt Attr PSize PFree /dev/nvme0n1p2 vg_VMstore lvm2 a-- 931.50g 0 [root@experior ~]# [root@experior ~]# vgs VG #PV #LV #SN Attr VSize VFree vg_VMstore 1 1 0 wz--n- 931.50g 0 [root@experior ~]# [root@experior ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert lv_Win10VM vg_VMstore -wi-ao---- 931.50g [root@experior ~]# root@experior ~]# ls -alh /dev/disk/by-id | grep nvme lrwxrwxrwx 1 root root 15 Feb 23 02:09 lvm-pv-uuid-TKF2kb-FUiD-elg1-3Oew-cCtc-xP2V-EjBesY -> ../../nvme0n1p2 lrwxrwxrwx 1 root root 13 Feb 23 02:09 nvme-eui.0025385911902877 -> ../../nvme0n1 lrwxrwxrwx 1 root root 15 Feb 23 02:09 nvme-eui.0025385911902877-part1 -> ../../nvme0n1p1 lrwxrwxrwx 1 root root 15 Feb 23 02:09 nvme-eui.0025385911902877-part2 -> ../../nvme0n1p2 lrwxrwxrwx 1 root root 13 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H -> ../../nvme0n1 lrwxrwxrwx 1 root root 15 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H-part1 -> ../../nvme0n1p1 lrwxrwxrwx 1 root root 15 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H-part2 -> ../../nvme0n1p2 [root@experior ~]# If you need any other info to attempt to reproduce, please do not hesitate to let me know. Thanks for looking into this. As a side note, I might do some other testing if/when I have some time, as I have recent backups of the other volume images (primary boot volume is qcow2 backed on SSDs (BTRFS RAID-1) and secondary data volume is qcow2 backed on HDDs (ZFS RAID-10). Those broke with io_uring awhile back as well (I believe when I had transitioned from Ubuntu 20.04 over to Arch IIRC) and I haven't re-tested them since. Tried many things, cannot seem to reproduce it. Can you grab: grep . /sys/block/nvme0n1/queue/* grep . /sys/block/vg_VMstore-lv_Win10VM/queue/* output for me? And the lv_Win10VM, is that using btrfs? Or some other fs? Might also help if you can provide a full vm config that I can just run here. If I need to load up a windows10 to reproduce, I'll be happy to do that. Key here is just that it'd be great if it includes all the config I need and the command you run to boot it, so that we avoid any differences there. Oh, and I'm assuming vg_VMstore-lv_Win10VM is just linear? Here's how I setup my test: dmsetup create test-linear --table '0 41943040 linear /dev/nvme1n1 0' Output requested: [root@experior ~]# grep . /sys/block/nvme0n1/queue/* /sys/block/nvme0n1/queue/add_random:0 /sys/block/nvme0n1/queue/chunk_sectors:0 /sys/block/nvme0n1/queue/dax:0 /sys/block/nvme0n1/queue/discard_granularity:512 /sys/block/nvme0n1/queue/discard_max_bytes:2199023255040 /sys/block/nvme0n1/queue/discard_max_hw_bytes:2199023255040 /sys/block/nvme0n1/queue/discard_zeroes_data:0 /sys/block/nvme0n1/queue/dma_alignment:3 /sys/block/nvme0n1/queue/fua:1 /sys/block/nvme0n1/queue/hw_sector_size:512 /sys/block/nvme0n1/queue/io_poll:0 /sys/block/nvme0n1/queue/io_poll_delay:-1 /sys/block/nvme0n1/queue/iostats:1 /sys/block/nvme0n1/queue/io_timeout:30000 /sys/block/nvme0n1/queue/logical_block_size:512 /sys/block/nvme0n1/queue/max_discard_segments:256 /sys/block/nvme0n1/queue/max_hw_sectors_kb:2048 /sys/block/nvme0n1/queue/max_integrity_segments:0 /sys/block/nvme0n1/queue/max_sectors_kb:1280 /sys/block/nvme0n1/queue/max_segments:127 /sys/block/nvme0n1/queue/max_segment_size:4294967295 /sys/block/nvme0n1/queue/minimum_io_size:512 /sys/block/nvme0n1/queue/nomerges:0 /sys/block/nvme0n1/queue/nr_requests:1023 /sys/block/nvme0n1/queue/nr_zones:0 /sys/block/nvme0n1/queue/optimal_io_size:0 /sys/block/nvme0n1/queue/physical_block_size:512 /sys/block/nvme0n1/queue/read_ahead_kb:128 /sys/block/nvme0n1/queue/rotational:0 /sys/block/nvme0n1/queue/rq_affinity:1 /sys/block/nvme0n1/queue/scheduler:[none] mq-deadline kyber bfq /sys/block/nvme0n1/queue/stable_writes:0 /sys/block/nvme0n1/queue/throttle_sample_time:20 /sys/block/nvme0n1/queue/virt_boundary_mask:4095 /sys/block/nvme0n1/queue/wbt_lat_usec:2000 /sys/block/nvme0n1/queue/write_cache:write back /sys/block/nvme0n1/queue/write_same_max_bytes:0 /sys/block/nvme0n1/queue/write_zeroes_max_bytes:2097152 /sys/block/nvme0n1/queue/zone_append_max_bytes:0 /sys/block/nvme0n1/queue/zoned:none /sys/block/nvme0n1/queue/zone_write_granularity:0 [root@experior ~]# [root@experior ~]# grep . /sys/block/vg_VMstore-lv_Win10VM/queue/* grep: /sys/block/vg_VMstore-lv_Win10VM/queue/*: No such file or directory [root@experior ~]# Will respond re questions in next post, and will submit full VM config in separate follow-up post shortly as well. As quick dumb side question, does BugZilla support code blocks at all? Re the LVM volume, it's on top of a partition, no underlying filesystem on Linux (just passed raw to Win10 where it's formatted NTFS within Win10. Original commands to setup were as follows: --- parted /dev/nvme0n1 mklabel gpt parted -a optimal /dev/nvme0n1 mkpart primary 0% 50% parted -a optimal /dev/nvme0n1 mkpart primary 50% 100% pvcreate /dev/nvme0n1p2 vgcreate vg_VMstore /dev/nvme0n1p2 lvcreate -l +100%FREE vg_VMstore -n lv_Win10VM --- Created attachment 303775 [details]
MainDesktop1-Win10.xml
VM config attached (currently have "io=threads" instead of "io=io_uring" configured for sdc as mentioned previously due to issues).
If you have any other questions or need any addl. details regarding any of the other config (e.g. -- host hardwware specs, storage setup for other 2 volumes, etc.), please do not hesitate to let me know. Also, just as a heads-up re versions, liburing version is as follows (as I forgot to include this previously): c0de@experior:~$ yay -Q | grep liburing liburing 2.3-1 c0de@experior:~$ And please include instructions on how to launch it too. I don’t generally use libvirt or xml for my QEMU configurations.
> On Feb 23, 2023, at 2:07 PM, bugzilla-daemon@kernel.org wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=216932
>
> --- Comment #14 from Darrin Burger (dburger@synsol-it.com) ---
> Created attachment 303775 [details]
> --> https://bugzilla.kernel.org/attachment.cgi?id=303775&action=edit
> MainDesktop1-Win10.xml
>
> VM config attached (currently have "io=threads" instead of "io=io_uring"
> configured for sdc as mentioned previously due to issues).
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Just out of curiosity, are you testing on Arch or a different distro? On Arch, I'm using the qemu-full package, but I usually create and start/stop VMs via VMM (virt-manager) outside of custom parameters that need to be adjusted manually in the XML. c0de@experior:~$ yay -Q | grep qemu-full qemu-full 7.2.0-3 c0de@experior:~$ yay -Q | grep virt-manager virt-manager 4.1.0-1 I believe the libvirt/QEMU logs show the full command-line, so I'll see if I can grab that and throw it in an attachment in a follow-up post, although I suspect that you'll need to edit some of the devices/paths/etc. since they're probably different on your system than on mine. Also, if you need to know exactly how the other two storage volumes are setup, I can dig back through my notes in re to see if I can put together a process and the relevant settings for them, given that one is backed on BTRFS on SSDs and the other is backed on ZFS on HDDs. Created attachment 303776 [details]
MainDesktop1-Win10_QEMU_cmdline
As a quick note, the "MainDesktop1-Win10.xml" file would reside in the /etc/libvirt/qemu directory.
Created attachment 303777 [details]
qemu_hooks
Also, while it's probably irrelevant to the current issue, I also have CPU isolation configured via "/etc/libvirt/hooks/qemu", contents are in attachment.
Just as a quick follow-up, I'm still seeing the issue/behavior persist on kernel 6.2.2 as well. Sorry been slow on this, and I'm OOO this week. I'll get back on this on Monday so we can get to the bottom of wtf is going on here. Jens, thank you for the update. All good, I appreciate the assistance with this. Once you have a chance to dig further into the issue, please do not hesitate to reach out if you need any additional information. |
Created attachment 303605 [details] My dmesg After 6.1.5 kernel update my vm locks up at boot, and dmesg says "BUG: kernel NULL pointer dereference, address: 0000000000000005" (more details in attachment) I have few drives attached to vm with io_uring <driver name="qemu" type="raw" cache="none" io="io_uring" discard="unmap" detect_zeroes="off"/>