Bug 216932

Summary: io_uring with libvirt cause kernel NULL pointer dereference
Product: IO/Storage Reporter: Sergey V. (truesmb)
Component: AIOAssignee: Badari Pulavarty (pbadari)
Status: NEW ---    
Severity: normal CC: aenigma1372, axboe, dburger
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 6.1.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: My dmesg
MainDesktop1-Win10.xml
MainDesktop1-Win10_QEMU_cmdline
qemu_hooks

Description Sergey V. 2023-01-15 06:41:03 UTC
Created attachment 303605 [details]
My dmesg

After 6.1.5 kernel update my vm locks up at boot, and dmesg says "BUG: kernel NULL pointer dereference, address: 0000000000000005" (more details in attachment)

I have few drives attached to vm with io_uring

<driver name="qemu" type="raw" cache="none" io="io_uring" discard="unmap" detect_zeroes="off"/>
Comment 1 Sergey V. 2023-01-15 06:51:54 UTC
After update from 6.1.4 to 6.1.5, rollback to 6.1.4 solve the problem
Comment 2 Jens Axboe 2023-01-16 13:51:20 UTC
If you are building your own kernels, cherry pick commit 613b14884b8595e20b9fac4126bf627313827fbe and it should work fine. This was missed in the backport:

axboe@m1max ~/gi/linux-block (6.1-stable)> git cherry-pick 613b14884b8595e20b9fac4126bf627313827fbe
Auto-merging block/blk-merge.c
Auto-merging block/blk-mq.c
Auto-merging drivers/block/drbd/drbd_req.c
Auto-merging drivers/md/dm.c
Auto-merging drivers/md/md.c
Auto-merging drivers/nvme/host/multipath.c
[6.1-stable fecb066b9366] block: handle bio_split_to_limits() NULL return
 Date: Wed Jan 4 08:51:19 2023 -0700
 8 files changed, 19 insertions(+), 2 deletions(-)
Comment 3 Darrin Burger 2023-02-21 21:09:04 UTC
I've been running into the same issue as well.  Definitely affects 6.1.x series from 6.1.5 and onward (up through and including 6.1.12).  Downgrade to 6.1.4 and everything works fine, anything past that and it's broken.  Running on 6.1.12 now and if I disable io_uring in my VM config, also works fine, so seems likely to be related to this issue (unfortunately, I just came across this issue and don't believe I have the logs anymore).

Does anybody know if this issue was fixed in kernel 6.2 release?
Comment 4 Jens Axboe 2023-02-21 23:44:32 UTC
Huh, this really should be fixed in 6.1.12, see comment just above yours. I'm OOO until tomorrow, will take a look at it then. But 6.1.7 and newer has the patch, only 6.1.5/6.1.6 should be affected for the original report.
Comment 5 Jens Axboe 2023-02-21 23:47:12 UTC
Darrin, please attach your oops here on 6.1.12, please.
Comment 6 Darrin Burger 2023-02-22 18:01:55 UTC
(In reply to Jens Axboe from comment #5)
> Darrin, please attach your oops here on 6.1.12, please.

Let me go back through and see if I have the logs from that time.  If not, I should be able to reproduce the issue again (not sure if the oops was the same and/or if it's a different issue), but timing of the breakage was the same, and reverting to 6.1.4 and/or disabling io_uring in the VM config fixes it).  Will go back and re-test shortly when I'm done with what I'm working on.
Comment 7 Darrin Burger 2023-02-22 19:15:05 UTC
Just re-tested this.  It's pretty frustrating, as I don't see any dmesg/kernel errors related to this, no errors in the libvirtd/qemu logs, etc.  From what I can see from a log perspective, it looks like everything is working fine.  However, something has definitely been broken since 6.1.5+.  Behavior is slightly different than the original post (Windows 10 VM boots fine, but when I try to access something on the volume that's using io_uring, that application hangs and it won't let me kill it (access denied error via task manager).  If I try to shutdown the VM, it hangs on shutdown (end up having to force-off it).



Relevant storage config for that specific volume (data storage volume on an NVMe via LVM) is as follows:

---

    <iothread id="3" thread_pool_min="4" thread_pool_max="16"/>

    <iothreadpin iothread="3" cpuset="6,14"/>

    <disk type="block" device="disk">
      <driver name="qemu" type="raw" cache="none" io="io_uring"/>
      <source dev="/dev/mapper/vg_VMstore-lv_Win10VM"/>
      <target dev="sdc" bus="scsi"/>
      <address type="drive" controller="2" bus="0" target="0" unit="1"/>
    </disk>

    <controller type="scsi" index="2" model="virtio-scsi">
      <driver queues="4" iothread="3"/>
      <address type="pci" domain="0x0000" bus="0x0d" slot="0x00" function="0x0"/>
    </controller>

---



There are two other volumes (boot + data) but doesn't appear to be affected by the issue, as they're currently using io=threads due to issues with io=io_uring breaking in the past and causing problems (especially on the boot volume).


If I change I/O setting on it from io_uring to threads, everything works fine on kernel 6.1.12.  If I leave I/O setting as io_uring and revert back to kernel 6.1.4, everything works fine.



Version info:

[root@experior ~]# uname -r
6.1.12-arch1-1

[root@experior ~]# libvirtd --version
libvirtd (libvirt) 9.0.0

[root@experior ~]# qemu-system-x86_64 --version
QEMU emulator version 7.2.0



IIRC, I believe there were multiple io_uring related commits in kernel 6.1.5, but since I'm not getting any errors in the logs, it's definitely not giving me much to work with to get a general idea as to what's causing the breakage.


I saw 6.1.13 was released yesterday or today, but I doubt it's going to make it into the Arch repos, as 6.2 is already in the testing repo.  I was going to hold off for a bit before testing 6.2, but I might see if the issue is resolved there once it ends up in the standard repo.
Comment 8 Jens Axboe 2023-02-22 20:41:54 UTC
How is your dm device setup on top of the nvme? I'm going to try and see if I can reproduce this.
Comment 9 Darrin Burger 2023-02-23 08:15:46 UTC
[root@experior ~]# lshw -c storage
  *-nvme                    
       description: NVMe device
       product: Samsung SSD 970 EVO Plus 2TB
       vendor: Samsung Electronics Co Ltd
       physical id: 0
       bus info: pci@0000:01:00.0
       logical name: /dev/nvme0
       version: 2B2QEXM7
       serial: S59CNM0R905753H
       width: 64 bits
       clock: 33MHz
       capabilities: nvme pm msi pciexpress msix nvm_express bus_master cap_list
       configuration: driver=nvme latency=0 nqn=nqn.2014.08.org.nvmexpress:144d144dS59CNM0R905753H     Samsung SSD 970 EVO Plus 2TB state=live
       resources: irq:98 memory:fcf00000-fcf03fff



[root@experior ~]# parted /dev/nvme0n1 print
Model: Samsung SSD 970 EVO Plus 2TB (nvme)
Disk /dev/nvme0n1: 2000GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags: 

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  1000GB  1000GB  btrfs        primary
 2      1000GB  2000GB  1000GB               primary

[root@experior ~]#



[root@experior ~]# lsblk -t /dev/nvme0n1
NAME                      ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1                           0    512      0     512     512    0 none     1023 128    0B
├─nvme0n1p1                       0    512      0     512     512    0 none     1023 128    0B
└─nvme0n1p2                       0    512      0     512     512    0 none     1023 128    0B
  └─vg_VMstore-lv_Win10VM         0    512      0     512     512    0           128 128    0B
[root@experior ~]#



[root@experior ~]# pvs
  PV             VG         Fmt  Attr PSize   PFree
  /dev/nvme0n1p2 vg_VMstore lvm2 a--  931.50g    0 
[root@experior ~]#


[root@experior ~]# vgs
  VG         #PV #LV #SN Attr   VSize   VFree
  vg_VMstore   1   1   0 wz--n- 931.50g    0 
[root@experior ~]#


[root@experior ~]# lvs
  LV         VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_Win10VM vg_VMstore -wi-ao---- 931.50g                                                    
[root@experior ~]#



root@experior ~]# ls -alh /dev/disk/by-id | grep nvme
lrwxrwxrwx 1 root root   15 Feb 23 02:09 lvm-pv-uuid-TKF2kb-FUiD-elg1-3Oew-cCtc-xP2V-EjBesY -> ../../nvme0n1p2
lrwxrwxrwx 1 root root   13 Feb 23 02:09 nvme-eui.0025385911902877 -> ../../nvme0n1
lrwxrwxrwx 1 root root   15 Feb 23 02:09 nvme-eui.0025385911902877-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root   15 Feb 23 02:09 nvme-eui.0025385911902877-part2 -> ../../nvme0n1p2
lrwxrwxrwx 1 root root   13 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H -> ../../nvme0n1
lrwxrwxrwx 1 root root   15 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H-part1 -> ../../nvme0n1p1
lrwxrwxrwx 1 root root   15 Feb 23 02:09 nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0R905753H-part2 -> ../../nvme0n1p2
[root@experior ~]#




If you need any other info to attempt to reproduce, please do not hesitate to let me know.  Thanks for looking into this.

As a side note, I might do some other testing if/when I have some time, as I have recent backups of the other volume images (primary boot volume is qcow2 backed on SSDs (BTRFS RAID-1) and secondary data volume is qcow2 backed on HDDs (ZFS RAID-10).  Those broke with io_uring awhile back as well (I believe when I had transitioned from Ubuntu 20.04 over to Arch IIRC) and I haven't re-tested them since.
Comment 10 Jens Axboe 2023-02-23 20:47:09 UTC
Tried many things, cannot seem to reproduce it. Can you grab:

grep . /sys/block/nvme0n1/queue/*
grep . /sys/block/vg_VMstore-lv_Win10VM/queue/*

output for me? And the lv_Win10VM, is that using btrfs? Or some other fs?

Might also help if you can provide a full vm config that I can just run here. If I need to load up a windows10 to reproduce, I'll be happy to do that. Key here is just that it'd be great if it includes all the config I need and the command you run to boot it, so that we avoid any differences there.
Comment 11 Jens Axboe 2023-02-23 20:49:14 UTC
Oh, and I'm assuming vg_VMstore-lv_Win10VM is just linear? Here's how I setup my test:

dmsetup create test-linear --table '0 41943040 linear /dev/nvme1n1 0'
Comment 12 Darrin Burger 2023-02-23 21:00:12 UTC
Output requested:


[root@experior ~]# grep . /sys/block/nvme0n1/queue/*
/sys/block/nvme0n1/queue/add_random:0
/sys/block/nvme0n1/queue/chunk_sectors:0
/sys/block/nvme0n1/queue/dax:0
/sys/block/nvme0n1/queue/discard_granularity:512
/sys/block/nvme0n1/queue/discard_max_bytes:2199023255040
/sys/block/nvme0n1/queue/discard_max_hw_bytes:2199023255040
/sys/block/nvme0n1/queue/discard_zeroes_data:0
/sys/block/nvme0n1/queue/dma_alignment:3
/sys/block/nvme0n1/queue/fua:1
/sys/block/nvme0n1/queue/hw_sector_size:512
/sys/block/nvme0n1/queue/io_poll:0
/sys/block/nvme0n1/queue/io_poll_delay:-1
/sys/block/nvme0n1/queue/iostats:1
/sys/block/nvme0n1/queue/io_timeout:30000
/sys/block/nvme0n1/queue/logical_block_size:512
/sys/block/nvme0n1/queue/max_discard_segments:256
/sys/block/nvme0n1/queue/max_hw_sectors_kb:2048
/sys/block/nvme0n1/queue/max_integrity_segments:0
/sys/block/nvme0n1/queue/max_sectors_kb:1280
/sys/block/nvme0n1/queue/max_segments:127
/sys/block/nvme0n1/queue/max_segment_size:4294967295
/sys/block/nvme0n1/queue/minimum_io_size:512
/sys/block/nvme0n1/queue/nomerges:0
/sys/block/nvme0n1/queue/nr_requests:1023
/sys/block/nvme0n1/queue/nr_zones:0
/sys/block/nvme0n1/queue/optimal_io_size:0
/sys/block/nvme0n1/queue/physical_block_size:512
/sys/block/nvme0n1/queue/read_ahead_kb:128
/sys/block/nvme0n1/queue/rotational:0
/sys/block/nvme0n1/queue/rq_affinity:1
/sys/block/nvme0n1/queue/scheduler:[none] mq-deadline kyber bfq 
/sys/block/nvme0n1/queue/stable_writes:0
/sys/block/nvme0n1/queue/throttle_sample_time:20
/sys/block/nvme0n1/queue/virt_boundary_mask:4095
/sys/block/nvme0n1/queue/wbt_lat_usec:2000
/sys/block/nvme0n1/queue/write_cache:write back
/sys/block/nvme0n1/queue/write_same_max_bytes:0
/sys/block/nvme0n1/queue/write_zeroes_max_bytes:2097152
/sys/block/nvme0n1/queue/zone_append_max_bytes:0
/sys/block/nvme0n1/queue/zoned:none
/sys/block/nvme0n1/queue/zone_write_granularity:0
[root@experior ~]#


[root@experior ~]# grep . /sys/block/vg_VMstore-lv_Win10VM/queue/*
grep: /sys/block/vg_VMstore-lv_Win10VM/queue/*: No such file or directory
[root@experior ~]#


Will respond re questions in next post, and will submit full VM config in separate follow-up post shortly as well.


As quick dumb side question, does BugZilla support code blocks at all?
Comment 13 Darrin Burger 2023-02-23 21:02:20 UTC
Re the LVM volume, it's on top of a partition, no underlying filesystem on Linux (just passed raw to Win10 where it's formatted NTFS within Win10.



Original commands to setup were as follows:

---
parted /dev/nvme0n1 mklabel gpt

parted -a optimal /dev/nvme0n1 mkpart primary 0% 50%
parted -a optimal /dev/nvme0n1 mkpart primary 50% 100%


pvcreate /dev/nvme0n1p2
vgcreate vg_VMstore /dev/nvme0n1p2
lvcreate -l +100%FREE vg_VMstore -n lv_Win10VM
---
Comment 14 Darrin Burger 2023-02-23 21:07:28 UTC
Created attachment 303775 [details]
MainDesktop1-Win10.xml

VM config attached (currently have "io=threads" instead of "io=io_uring" configured for sdc as mentioned previously due to issues).
Comment 15 Darrin Burger 2023-02-23 21:11:28 UTC
If you have any other questions or need any addl. details regarding any of the other config (e.g. -- host hardwware specs, storage setup for other 2 volumes, etc.), please do not hesitate to let me know.


Also, just as a heads-up re versions, liburing version is as follows (as I forgot to include this previously):

c0de@experior:~$ yay -Q | grep liburing
liburing 2.3-1
c0de@experior:~$
Comment 16 Jens Axboe 2023-02-23 21:31:21 UTC
And please include instructions on how to launch it too. I don’t generally use libvirt or xml for my QEMU configurations. 

> On Feb 23, 2023, at 2:07 PM, bugzilla-daemon@kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=216932
> 
> --- Comment #14 from Darrin Burger (dburger@synsol-it.com) ---
> Created attachment 303775 [details]
>  --> https://bugzilla.kernel.org/attachment.cgi?id=303775&action=edit
> MainDesktop1-Win10.xml
> 
> VM config attached (currently have "io=threads" instead of "io=io_uring"
> configured for sdc as mentioned previously due to issues).
> 
> -- 
> You may reply to this email to add a comment.
> 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 17 Darrin Burger 2023-02-23 22:41:44 UTC
Just out of curiosity, are you testing on Arch or a different distro?  On Arch, I'm using the qemu-full package, but I usually create and start/stop VMs via VMM (virt-manager) outside of custom parameters that need to be adjusted manually in the XML.


c0de@experior:~$ yay -Q | grep qemu-full
qemu-full 7.2.0-3

c0de@experior:~$ yay -Q | grep virt-manager
virt-manager 4.1.0-1


I believe the libvirt/QEMU logs show the full command-line, so I'll see if I can grab that and throw it in an attachment in a follow-up post, although I suspect that you'll need to edit some of the devices/paths/etc. since they're probably different on your system than on mine.  Also, if you need to know exactly how the other two storage volumes are setup, I can dig back through my notes in re to see if I can put together a process and the relevant settings for them, given that one is backed on BTRFS on SSDs and the other is backed on ZFS on HDDs.
Comment 18 Darrin Burger 2023-02-23 22:44:43 UTC
Created attachment 303776 [details]
MainDesktop1-Win10_QEMU_cmdline

As a quick note, the "MainDesktop1-Win10.xml" file would reside in the /etc/libvirt/qemu directory.
Comment 19 Darrin Burger 2023-02-23 22:46:47 UTC
Created attachment 303777 [details]
qemu_hooks

Also, while it's probably irrelevant to the current issue, I also have CPU isolation configured via "/etc/libvirt/hooks/qemu", contents are in attachment.
Comment 20 Darrin Burger 2023-03-04 15:46:52 UTC
Just as a quick follow-up, I'm still seeing the issue/behavior persist on kernel 6.2.2 as well.
Comment 21 Jens Axboe 2023-03-07 17:25:11 UTC
Sorry been slow on this, and I'm OOO this week. I'll get back on this on Monday so we can get to the bottom of wtf is going on here.
Comment 22 Darrin Burger 2023-03-07 20:15:08 UTC
Jens, thank you for the update.  All good, I appreciate the assistance with this.  Once you have a chance to dig further into the issue, please do not hesitate to reach out if you need any additional information.