Bug 218042 - queue/scheduler missing under nvmf block device
Summary: queue/scheduler missing under nvmf block device
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P3 low
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-23 15:28 UTC by michallinuxstuff
Modified: 2023-10-24 14:57 UTC (History)
3 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description michallinuxstuff 2023-10-23 15:28:01 UTC
Noticed that under 6.5.6 (Fedora build, 6.5.6-100.fc37.x86_64) the queue/scheduler attr is not visible under namespace block device connected over nvme-fabrics. 

# readlink -f /sys/block/nvme0n1
/sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
# grep . /sys/devices/virtual/nvme-subsystem/nvme-subsys0/*/transport
/sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0/transport:rdma
/sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme1/transport:rdma
# [[ -e /sys/block/nvme0n1/queue/scheduler ]] || echo oops
oops

What's a bit confusing is that each of the ctrls attached to this subsystem also expose nvme*c*n1 device. These are marked as hidden under sysfs, hence not available as an actual block device (i.e. not present under /dev/). That said, these devices actually do have queue/scheduler attr available under sysfs.

# readlink -f /sys/block/nvme0*c*
/sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0c0n1
/sys/devices/virtual/nvme-fabrics/ctl/nvme1/nvme0c1n1
# readlink -f  /sys/block/nvme0*c*/queue/scheduler
/sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0c0n1/queue/scheduler
/sys/devices/virtual/nvme-fabrics/ctl/nvme1/nvme0c1n1/queue/scheduler
# grep . /sys/block/nvme0*c*/queue/scheduler
/sys/block/nvme0c0n1/queue/scheduler:[none] mq-deadline kyber bfq
/sys/block/nvme0c1n1/queue/scheduler:[none] mq-deadline kyber bfq


I have a little test infra which normally, after the nvmef got connected, would take the namespace device, set some sysfs attributes to specific values (that would include queue/scheduler) and then execute fio, targeting this namespace device.

The only clue I got is this https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d85ebf95c44e, but then again I am not sure what to make of it. Initially, my thought was "ok, queue/scheduler is gone, so just don't try to touch it". But if the c*n* devices still do have this attribute available, are they meant to be used instead of the actual namespace device, to tweak these specific sysfs attributes?

The problem here is that I have two c*n* devices but only single block device (multipath setup). Would that mean that changing either of those devices' attributes would affect the actual namespace device? Or is each path independent here?

Any hints would be appreciated. :)

Regards,
Michal
Comment 1 Bagas Sanjaya 2023-10-24 00:27:09 UTC
(In reply to michallinuxstuff from comment #0)
> Noticed that under 6.5.6 (Fedora build, 6.5.6-100.fc37.x86_64) the
> queue/scheduler attr is not visible under namespace block device connected
> over nvme-fabrics. 
> 
> # readlink -f /sys/block/nvme0n1
> /sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
> # grep . /sys/devices/virtual/nvme-subsystem/nvme-subsys0/*/transport
> /sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0/transport:rdma
> /sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme1/transport:rdma
> # [[ -e /sys/block/nvme0n1/queue/scheduler ]] || echo oops
> oops
> 
> What's a bit confusing is that each of the ctrls attached to this subsystem
> also expose nvme*c*n1 device. These are marked as hidden under sysfs, hence
> not available as an actual block device (i.e. not present under /dev/). That
> said, these devices actually do have queue/scheduler attr available under
> sysfs.
> 
> # readlink -f /sys/block/nvme0*c*
> /sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0c0n1
> /sys/devices/virtual/nvme-fabrics/ctl/nvme1/nvme0c1n1
> # readlink -f  /sys/block/nvme0*c*/queue/scheduler
> /sys/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0c0n1/queue/scheduler
> /sys/devices/virtual/nvme-fabrics/ctl/nvme1/nvme0c1n1/queue/scheduler
> # grep . /sys/block/nvme0*c*/queue/scheduler
> /sys/block/nvme0c0n1/queue/scheduler:[none] mq-deadline kyber bfq
> /sys/block/nvme0c1n1/queue/scheduler:[none] mq-deadline kyber bfq
> 
> 
> I have a little test infra which normally, after the nvmef got connected,
> would take the namespace device, set some sysfs attributes to specific
> values (that would include queue/scheduler) and then execute fio, targeting
> this namespace device.
> 
> The only clue I got is this
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=6d85ebf95c44e, but then again I am not sure what to make of it.
> Initially, my thought was "ok, queue/scheduler is gone, so just don't try to
> touch it". But if the c*n* devices still do have this attribute available,
> are they meant to be used instead of the actual namespace device, to tweak
> these specific sysfs attributes?
> 

Can you try reverting 6d85ebf95c44e on top of current mainline (v6.6-rc7)?
Comment 2 Hannes Reinecke 2023-10-24 07:55:45 UTC
Doesn't really help.

The native nvme multipath devices are bio-based devices, and as such don't have a scheduler attached to them. So even if you would revert the mentioned patch the fact won't change; the sysfs attribute might be visible, but any modifications will be ignored.

BTW, same thing happens when you switch to bio-based dm-multipathing.

Guess you need to fix up your tooling.
Comment 3 michallinuxstuff 2023-10-24 09:13:27 UTC
(In reply to Hannes Reinecke from comment #2)
> Doesn't really help.
> 
> The native nvme multipath devices are bio-based devices, and as such don't
> have a scheduler attached to them. So even if you would revert the mentioned
> patch the fact won't change; the sysfs attribute might be visible, but any
> modifications will be ignored.
> 
> BTW, same thing happens when you switch to bio-based dm-multipathing.
> 
> Guess you need to fix up your tooling.

Appreciate the feedback. :) 

Though I am still not clear on what's the purpose behind the nvmeXcXnX devices - for the above case, when you say "native multipath devices", are you referring to the actual /dev/nvme0n1 device or the nvmeXcXnX ones? Asking since the latter still have the queue/scheduler attr visible. What would be the effect of modifying it? Would it be ignored as well? Because from the user standpoint, the actual write seems to complete successfully and afterwards the proper scheduler is marked as "in-use" ([scheduler]).
Comment 4 Keith Busch 2023-10-24 14:45:45 UTC
The 'nvmeXcYnZ' are individual paths to the multipath device 'nvmeXnZ'. The multi-path device, nvmeXnZ, is bio-based, but schedulers operate on a different structure called "request", which don't exist at the nvme multi-path level. Even if you export the scheduler attribute and show what is in use, no requests will ever be allocated at this layer; the scheduler is bypassed here.

You should be able to set IO schedulers on the individual paths if you want to.
Comment 5 michallinuxstuff 2023-10-24 14:57:47 UTC
(In reply to Keith Busch from comment #4)
> The 'nvmeXcYnZ' are individual paths to the multipath device 'nvmeXnZ'. The
> multi-path device, nvmeXnZ, is bio-based, but schedulers operate on a
> different structure called "request", which don't exist at the nvme
> multi-path level. Even if you export the scheduler attribute and show what
> is in use, no requests will ever be allocated at this layer; the scheduler
> is bypassed here.
> 
> You should be able to set IO schedulers on the individual paths if you want
> to.

Thank you for the explanation, greatly appreciated!

Note You need to log in before you can comment on or make changes to this bug.