Bug 215679 - NVMe/writeback wb_workfn/blocked for more than 30 seconds
Summary: NVMe/writeback wb_workfn/blocked for more than 30 seconds
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: NVMe (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: IO/NVME Virtual Default Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-13 20:13 UTC by Imre Deak
Modified: 2022-09-13 20:42 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.17.0-rc7
Tree: Mainline
Regression: Yes


Attachments
dmesg log after suspend resume, io stuck (80.18 KB, text/plain)
2022-03-13 20:13 UTC, Imre Deak
Details
dmesg log after suspend resume, io ok (68.33 KB, text/plain)
2022-03-13 20:15 UTC, Imre Deak
Details
attachment-31105-0.html (1.32 KB, text/html)
2022-05-06 12:33 UTC, keith.busch
Details
dmesg with the timeout on 5.18.0-rc5 (62.17 KB, patch)
2022-05-06 18:29 UTC, Imre Deak
Details | Diff
dmesg without the timeout on 5.18.0-rc5 (66.02 KB, patch)
2022-05-06 18:30 UTC, Imre Deak
Details | Diff
proposed fix (3.17 KB, patch)
2022-08-24 16:53 UTC, Keith Busch
Details | Diff

Description Imre Deak 2022-03-13 20:13:53 UTC
Created attachment 300564 [details]
dmesg log after suspend resume, io stuck

After system suspend/resume filesystem IO will stall, producing a 'kworker blocked for more than x sec" in dmesg, recovering after a long delay. See the attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue happening right after booting or after runtime suspend transitions.

The same issue also happens on multiple SKL systems in the i915 team's CI farm, see:

https://gitlab.freedesktop.org/drm/intel/-/issues/4547

I bisected the problem to
commit 4f5022453acd0f7b28012e20b7d048470f129894
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Oct 18 08:45:39 2021 -0600

    nvme: wire up completion batching for the IRQ path

By reverting it on top of 5.17.0-rc7, I can't reproduce the problem. Attached   dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few suspend/resume.
Comment 1 Imre Deak 2022-03-13 20:15:00 UTC
Created attachment 300565 [details]
dmesg log after suspend resume, io ok
Comment 2 Keith Busch 2022-03-13 20:29:02 UTC
Thanks for the bisect, that helps a lot. It sounds like a completion is missed somewhere in the batch attempt, I'll take a look this week. I'm not sure if we're getting an rc8 or not, so any potential fix may have to wait for stable.
Comment 3 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-17 13:41:50 UTC
(In reply to Keith Busch from comment #2)
> Thanks for the bisect, that helps a lot. It sounds like a completion is
> missed somewhere in the batch attempt, I'll take a look this week. I'm not
> sure if we're getting an rc8 or not, so any potential fix may have to wait
> for stable.

So, we got a rc8; was any progress made to get this fixed? Otherwise I'll add it to the regression tracking so Linus gets aware of it.
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-04-09 13:06:47 UTC
(In reply to Keith Busch from comment #2)
> Thanks for the bisect, that helps a lot. It sounds like a completion is
> missed somewhere in the batch attempt, I'll take a look this week. I'm not
> sure if we're getting an rc8 or not, so any potential fix may have to wait
> for stable.

Keith, is this still on your todo list? It looks like nothing happend in about a month here.
Comment 5 Keith Busch 2022-04-14 16:03:42 UTC
Sorry for the delay. After not finding any issues from code inspection, I tried recreating this, but haven't had any luck. I'll post this to the mailing list since more people subscribe to that.
Comment 6 Keith Busch 2022-04-14 19:08:42 UTC
On Sun, Mar 13, 2022 at 08:13:53PM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215679
> 
>             Bug ID: 215679
>            Summary: NVMe/writeback wb_workfn/blocked for more than 30
>                     seconds
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 5.17.0-rc7
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: NVMe
>           Assignee: io_nvme@kernel-bugs.kernel.org
>           Reporter: imre.deak@intel.com
>         Regression: No
> 
> Created attachment 300564 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=300564&action=edit
> dmesg log after suspend resume, io stuck
> 
> After system suspend/resume filesystem IO will stall, producing a 'kworker
> blocked for more than x sec" in dmesg, recovering after a long delay. See the
> attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue
> happening right after booting or after runtime suspend transitions.
> 
> The same issue also happens on multiple SKL systems in the i915 team's CI
> farm,
> see:
> 
> https://gitlab.freedesktop.org/drm/intel/-/issues/4547
> 
> I bisected the problem to
> commit 4f5022453acd0f7b28012e20b7d048470f129894
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Mon Oct 18 08:45:39 2021 -0600
> 
>     nvme: wire up completion batching for the IRQ path
> 
> By reverting it on top of 5.17.0-rc7, I can't reproduce the problem. Attached 
> dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few
> suspend/resume.

Forwarding to linux-nvme for higher visibility.
Comment 7 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-04-21 11:27:48 UTC
Hi, this is your Linux kernel regression tracker. Top-posting for once,
to make this easily accessible to everyone.

Jens, this regression is now a month old and the culprit was already
identified when the issue was reported, nevertheless it seems no real
progress was made to address this, at least afaics. What's up there?
Should the culprit be reverted?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

#regzbot poke

On 14.04.22 21:08, Keith Busch wrote:
> On Sun, Mar 13, 2022 at 08:13:53PM +0000, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215679
>>
>>             Bug ID: 215679
>>            Summary: NVMe/writeback wb_workfn/blocked for more than 30
>>                     seconds
>>            Product: IO/Storage
>>            Version: 2.5
>>     Kernel Version: 5.17.0-rc7
>>           Hardware: x86-64
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: high
>>           Priority: P1
>>          Component: NVMe
>>           Assignee: io_nvme@kernel-bugs.kernel.org
>>           Reporter: imre.deak@intel.com
>>         Regression: No
>>
>> Created attachment 300564 [details]
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=300564&action=edit
>> dmesg log after suspend resume, io stuck
>>
>> After system suspend/resume filesystem IO will stall, producing a 'kworker
>> blocked for more than x sec" in dmesg, recovering after a long delay. See
>> the
>> attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue
>> happening right after booting or after runtime suspend transitions.
>>
>> The same issue also happens on multiple SKL systems in the i915 team's CI
>> farm,
>> see:
>>
>> https://gitlab.freedesktop.org/drm/intel/-/issues/4547
>>
>> I bisected the problem to
>> commit 4f5022453acd0f7b28012e20b7d048470f129894
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date:   Mon Oct 18 08:45:39 2021 -0600
>>
>>     nvme: wire up completion batching for the IRQ path
>>
>> By reverting it on top of 5.17.0-rc7, I can't reproduce the problem.
>> Attached  
>> dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few
>> suspend/resume.
> 
> Forwarding to linux-nvme for higher visibility.
>
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-06 11:55:41 UTC
Imre, I pretty unhappy how the maintainers handle this report. Sorry, things shouldn't take this long for a bisected regression. But before escalating the problem further, could you please confirm that the regression is still present in latest mainline, aka 5.18-rc5 or later?
Comment 9 keith.busch 2022-05-06 12:33:19 UTC
Created attachment 300895 [details]
attachment-31105-0.html

Take it to the list. As far as I can tell, this report isn't real, and I am
pretty unhappy with how you're handling this.

On Fri, May 6, 2022, 5:55 AM <bugzilla-daemon@kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=215679
>
> --- Comment #8 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) (regressions@leemhuis.info) ---
> Imre, I pretty unhappy how the maintainers handle this report. Sorry,
> things
> shouldn't take this long for a bisected regression. But before escalating
> the
> problem further, could you please confirm that the regression is still
> present
> in latest mainline, aka 5.18-rc5 or later?
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
> You are watching the assignee of the bug.
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-06 14:30:53 UTC
(In reply to keith.busch from comment #9)

> Take it to the list.

Sorry, now I'm confused. Take if *off* the list or tracked regression reports? Or take it to the mailing list for discussion? But the latter would make sense if this is true:

> As far as I can tell, this report isn't real,

So I assume it's the former. A quick explanation on why you think this report isn't real (just a few words) would have been good (or is this somewhere mentioned in this bug report? sorry, then I missed it and can't see it now), but FWIW, if Imre wouldn't have replied I'd like have dropped the ball myself.

> and I am pretty unhappy with how you're handling this.

Why? From my point it looks a lot like this is a regression that is not handled appropriately. If there is a reason why this report is bogus then sorry, I just missed it.
Comment 11 Keith Busch 2022-05-06 16:12:13 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10)
> (In reply to keith.busch from comment #9)
> 
> > Take it to the list.
> 
> Sorry, now I'm confused. Take if *off* the list or tracked regression
> reports? Or take it to the mailing list for discussion? But the latter would
> make sense if this is true:

The linux-nvme mailing list. bugzilla might be okay as a starting point, but the list is the appropriate place for nvme discussion.

> > As far as I can tell, this report isn't real,
> 
> So I assume it's the former. A quick explanation on why you think this
> report isn't real (just a few words) would have been good (or is this
> somewhere mentioned in this bug report? sorry, then I missed it and can't
> see it now), but FWIW, if Imre wouldn't have replied I'd like have dropped
> the ball myself.

Because no one reported it to the official mailing list.

But truthfully, I apologize for coming off as impatient in my previous message.
Comment 12 Imre Deak 2022-05-06 18:29:17 UTC
Created attachment 300897 [details]
dmesg with the timeout on 5.18.0-rc5

Fwiw, the problem is still reproducible on 5.18.0-rc5, and without the commit I mentioned it's not reproducible. Please find the logs attached.
Comment 13 Imre Deak 2022-05-06 18:30:21 UTC
Created attachment 300898 [details]
dmesg without the timeout on 5.18.0-rc5
Comment 14 Keith Busch 2022-05-06 21:07:23 UTC
(In reply to Imre Deak from comment #12)
> Fwiw, the problem is still reproducible on 5.18.0-rc5, and without the
> commit I mentioned it's not reproducible.

As far as I can tell, it looks like you're just booting with an nvme ext4 root partition. I have been unsuccessful at reproducing the observation. Do you have any other hints on how I might be able to recreate this?
Comment 15 Imre Deak 2022-05-06 21:43:00 UTC
I run 'rtcwake -m mem -s 7' right after booting to reproduce the problem. This leads to the timeout in the log already during the first or if not then during the second suspend/resume sequence; at least in some if not in all the cases during the suspend phase.

If it helps I can work on arranging remote access for you to this machine.
Comment 16 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-09 05:51:07 UTC
(In reply to Keith Busch from comment #11)
>
> The linux-nvme mailing list. bugzilla might be okay as a starting point, but
> the list is the appropriate place for nvme discussion.

My problem is: I have to deal with lot of subsystems and developers and some of them prefer to keep things in bugzilla. I can't keep track of who prefer what, so I just look and try to continue in the place where they interacted (to the best of my judgement, which sometimes is wrong).
 
> [...] 
> But truthfully, I apologize for coming off as impatient in my previous
> message.

No worries!
Comment 17 Keith Busch 2022-05-09 15:12:16 UTC
(In reply to Imre Deak from comment #15)
> If it helps I can work on arranging remote access for you to this machine.

I'm not having any luck running in virtual machines, but I don't have any real hardware that I could attempt this kind of test. If you want to send me login details privately, you can reach me at kbusch@kernel.org and I'll take a look.
Comment 18 Imre Deak 2022-05-10 10:49:27 UTC
Keith, ok. Before doing that could you check the stats below, if with that it's worth continuing debugging the issue, or we should just replace the module (since these particular available_spare etc. metrics could cause the problem)?

testrunner@skl-6600u:~$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0x1
temperature : 34 C
available_spare : 44%
available_spare_threshold : 50%
percentage_used : 39%
data_units_read : 153 585 846
data_units_written : 152 504 340
host_read_commands : 2 017 620 857
host_write_commands : 3 581 240 952
controller_busy_time : 67 613
power_cycles : 179 437
power_on_hours : 36 652
unsafe_shutdowns : 4 206
media_errors : 0
num_err_log_entries : 30
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 34 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
Comment 19 Keith Busch 2022-05-10 13:59:31 UTC
(In reply to Imre Deak from comment #18)
> Keith, ok. Before doing that could you check the stats below, if with that
> it's worth continuing debugging the issue, or we should just replace the
> module (since these particular available_spare etc. metrics could cause the
> problem)?

That doesn't sound related to your observation. Even if the hardware is becoming flakey as it ages, that shouldn't cause a stuck task.
Comment 20 Tomi Sarvela 2022-05-11 10:49:59 UTC
I'll take this hardware under changes. First thing to do is update the Dell BIOS. This might not be direct cause, but shouldn't hurt either, and it's more comfortable to work on newest BIOS.

If the issue persists after updating BIOS, I'll see if NVMe can be changed easily. This NVMe seems to be towards the end of its life, over half of the spare blocks are already used. This maybe shouldn't cause delays in writes, but very often it does. Reason could also relate to NVMe Firmware (I haven't checked if Toshiba offers updates). Changing the NVMe will clear these problems.

If BIOS update and NVMe change are not helping, we'll try to arrange remote access to this host.
Comment 21 Tomi Sarvela 2022-05-11 13:48:20 UTC
I've updated the BIOS and changed the NVMe: the BIOS was pretty vocal (audibly so) about the health of the NVMe.

This will probably fix the issue, as it seemed to be very NVMe related. I have the old half-spares-gone Toshiba NVMe drive if there's interest investigating driver functionality with faulty drive.
Comment 22 Tomi Sarvela 2022-05-13 12:34:32 UTC
Dell XPS 13 fi-skl-6600u is back in the farm, and looks all green:
https://intel-gfx-ci.01.org/tree/drm-tip/fi-skl-6600u.html

I tried to reproduce the issue with 'rtcwake -m mem -s 7' but no luck.

i915-CI shard-skl hosts mentioned in fd.o#4547 issue are Chromebooks with eMMC: the problem is probably different there.
Comment 23 Imre Deak 2022-05-13 14:31:05 UTC
Okay then, the BIOS update and NVMe module change seem to have fixed the issue, I'm fine to close this ticket.
Comment 24 Orange Kao 2022-08-24 02:02:51 UTC
Hello.

We had a similar observation (stuck workqueues with kcryptd_crypt
[dm_crypt]) unrelated to the problematic system suspend/resume
filesystem IO, which lead us to bisect to the very same problematic
commit

commit 4f5022453acd0f7b28012e20b7d048470f129894
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Oct 18 08:45:39 2021 -0600

    nvme: wire up completion batching for the IRQ path

Reverting commit 4f5022453acd0f7b28012e20b7d048470f129894 on kernel
versions >= v5.16-rc1 then resolves the problem, and the system does
not get stuck anymore.

Our customer's workloads are rather complex, but the steps shown below
can reproduce this problem, which appears to be prevalent on GCP
n2d-standard-2 machine types with local NVMe SSDs. It results in
instant high iowait and getting stuck on a simple fio load test
following the steps shown below.

Please let us know what you think or if you need more help reproducing
this issue.


== Hardware affected ==

GCP n2d-standard-2 machine type with local NVMe SSD attached


== Minimal steps to reproduce the problem ==

echo 128 > /sys/block/nvme0n1/queue/nr_requests

fio --filename=/dev/nvme0n1 --size=12G --direct=1 --rw=randrw \
    --bs=4k --ioengine=libaio --iodepth=256 --numjobs=4 \
    --name=iops-test-job --eta-newline=1


== Expected result ==

- fio benchmark can finish in 2 minutes.

- When fio is running, Ctrl+C can stop fio, CPU usage will drop
  immediately.


== Actual result ==

- fio benchmark will never finish, with ETA continue to increase.

- During the test, high iowait (50% or 99%) according to "mpstat 1".

- Ctrl+C and "kill -9 fio" cannot stop the task.

- Graceful reboot may not work. The only reliable way to reboot is
  running "echo b > /proc/sysrq-trigger"


== Unaffected kernel version ==

- v5.15 (8bb7eca972ad531c9b149c0a51ab43a417385813)


== Affected kernel versions ==

Running above steps can reproduce the issue with the following kernel
versions:

- v5.16 (df0cc57e057f18e44dac8e6c18aba47ab53202f9)
- v5.17.12
- v5.18.15
- v6.0-rc1-160-g4c2d0b039c5c


== Problematic commit ==

Commit 4f5022453acd0f7b28012e20b7d048470f129894 (nvme: wire up
completion batching for the IRQ path) introduced the problem and got
merged at 33c8846c814c1c27c6e33af005042d15061f948b (Merge tag
'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block).

After reverting 4f502245, the issue becomes irreproducible.


== Additional information - nr_requests and iodepth correlation ==

If /sys/block/nvme0n1/queue/nr_requests is set to values >= 256 or
<= 8, the issue is very unlikely to materialize.

When nr_requests is set to values between 16~128, one can observe a
correlation between fio's "--iodepth" argument and reproducing the
issue, as shown in the table below:

 nr_requests |     iodepth | reproducible? | Note
-------------+-------------+---------------+-----------------
        1023 | up to 16384 |            no | default nr_requests
         512 | up to 16384 |            no |
         256 | up to 16384 |            no |
         128 |       <= 64 |            no |
         128 |      >= 128 |           yes |
          64 |       <= 16 |            no |
          64 |       >= 32 |           yes |
          32 |       <=  8 |            no |
          32 |       >= 16 |           yes |
          16 |       <=  8 |            no |
          16 |       >= 16 |           yes |
           8 | up to 16384 |            no |
           4 | up to 16384 |            no |
         

== Detailed steps to reproduce the issue ==

1. In GCP console
   (https://console.cloud.google.com/compute/instancesAdd), start a VM
   with

   Region: us-west1
   Series: N2D
   Machine type: n2d-standard-2
   
   In "Boot disk", click "change"
       OS: "Fedora"
       Version: "Fedora Cloud Base 35 35-20211026.0"
       Click "select"

   In "advanced options" -> "Disks" -> "Add local SSD"
       Interface: NVMe
       Disk capacity: 375GB (1 * 375GB)
       Click "save"

   Click "Create"

2. SSH to the VM, update the kernel to 5.18.15

3. Issue the following commands

   sudo -s

   dnf -y install fio

   echo 128 > /sys/block/nvme0n1/queue/nr_requests

   fio --filename=/dev/nvme0n1 --size=12G --direct=1 --rw=randrw \
       --bs=4k --ioengine=libaio --iodepth=256 --numjobs=4 \
       --name=iops-test-job --eta-newline=1


== Additional information - tested Linux distributions and kernels ==

Kernels unaffected:

* v5.14.10 on Fedora 35 (5.14.10-300.fc35.x86_64)

* 5.15.0-1016-gcp on Ubuntu 22.04


Kernels affected:

* 5.18.15 (Fedora 35, 5.18.15-100.fc35.x86_64)

* v6.0-rc1-160-g4c2d0b039c5c (Fedora 35, and install
  6.0.0-0.rc1.20220819git4c2d0b039c5c.16.vanilla.1.fc35.x86_64 from
  kernel-vanilla-mainline repo)

* 5.18.17 (Ubuntu 22.04, and install
  https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.18.17/amd64/ )


== Additional information - tested GCP machine types ==

GCP machine types unaffected:

* n1-standard-1 (1 CPU)

GCP machine types affected:

* n1-standard-2 (2 CPU)

* n2-standard-2 (2 CPU)

* n2d-standard-2 (2 CPU)

Note: adding maxcpus=1 to kernel command line to affected GCP machine types
      will not change the result.
Comment 25 Keith Busch 2022-08-24 02:36:16 UTC
You should try adding this patch:

https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/

Or just use axboe's 6.1 block tree:

https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block
Comment 26 Orange Kao 2022-08-24 07:38:37 UTC
Hello.

I have tested axboe's 6.1 block tree
  https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block
The issue is still reproducible

I also tried 
  [PATCH] sbitmap: fix possible io hung due to lost wakeup
  https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/
based on
  5.18.15 (Fedora 35)
  v6.0-rc2-16-g072e51356cd5
      (072e51356cd5a4a1c12c1020bc054c99b98333df from 
       https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git )
The issue is still reproducible

Thank you.
Comment 27 Keith Busch 2022-08-24 16:40:03 UTC
Thanks for confirming. I think I know what's going on now. Your table of the requests/depth proved to be useful, and I'll send a proposal shortly.
Comment 28 Keith Busch 2022-08-24 16:53:04 UTC
Created attachment 301644 [details]
proposed fix

Try this patch. It should fix the issue.
Comment 29 Orange Kao 2022-08-25 06:32:53 UTC
Hi Keith,

Thank you. Your patch works!


I've tested the following combinations

Based on
  - 5.18.15
  - 5.18.19
then apply
  [PATCH] sbitmap: fix possible io hung due to lost wakeup
  https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/
then apply
  attachment 301644 [details]


I also tried
  https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block
then apply
  attachment 301644 [details]


The issue is not reproducible


I also tried nr_requests 32, 64, 128 with fio iodepth up to 16384,
the issue is not reproducible
Comment 30 Keith Busch 2022-08-25 13:12:18 UTC
Thanks, I'll work with the community on getting something like this applied. The proposal attached has an unlikely race condition, so I have to make some adjustments. The proposal should be fine for nvme since it runs interrupt completions single threaded, but I have to make it thread safe since this is a generic library.
Comment 31 Thomas Krennwallner 2022-09-04 18:15:39 UTC
We built a kernel with https://git.kernel.dk/cgit/linux-block/patch/?id=16ede66973c84f890c03584f79158dd5b2d725f5 (which is v3 of your patch: [PATCHv3] sbitmap: fix batched wait_cnt accounting from https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com), but we ended up in soft lockups when setting /sys/block/nvme0n1/queue/nr_requests to low values like 128 (maybe also with default settings, but didn't try).

Now I built a kernel with v2 of the patch ([PATCHv2] sbitmap: fix batched wait_cnt accounting from https://lore.kernel.org/all/20220824201440.127782-1-kbusch@fb.com/) and this one now does not experience the lockups for low (128) and default settings of /sys/block/nvme0n1/queue/nr_requests.

So something changed in patch v2 -> v3.

From dmesg we can see we are getting stuck in sbitmap_queue_wake_up when called from blk_mq_get_tag when running a kernel with v3 of the patch:

[Sat Sep  3 21:05:44 2022] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5
[Sat Sep  3 21:05:44 2022] RIP: 0010:sbitmap_queue_clear_batch+0xbb/0x100
...
[Sat Sep  3 21:05:44 2022] Call Trace:
[Sat Sep  3 21:05:44 2022]  <IRQ>
[Sat Sep  3 21:05:44 2022]  blk_mq_end_request_batch+0x383/0x440
[Sat Sep  3 21:05:44 2022]  ? enqueue_entity+0x1a1/0x520
[Sat Sep  3 21:05:44 2022]  nvme_irq+0x5f/0x70 [nvme]
[Sat Sep  3 21:05:44 2022]  ? nvme_resume+0x70/0x70 [nvme]
...
[Sat Sep  3 21:05:53 2022] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [dmcrypt_write/2:8278]
...
[Sat Sep  3 21:05:53 2022] Call Trace:
[Sat Sep  3 21:05:53 2022]  <TASK>
[Sat Sep  3 21:05:53 2022]  sbitmap_queue_wake_up+0x11/0x20
[Sat Sep  3 21:05:53 2022]  blk_mq_get_tag+0x18e/0x2a0
[Sat Sep  3 21:05:53 2022]  ? dequeue_task_stop+0x70/0x70
...
[Sat Sep  3 21:06:21 2022] Call Trace:
[Sat Sep  3 21:06:21 2022]  <TASK>
[Sat Sep  3 21:06:21 2022]  sbitmap_queue_wake_up+0x11/0x20
[Sat Sep  3 21:06:21 2022]  blk_mq_get_tag+0x18e/0x2a0
Comment 32 Keith Busch 2022-09-04 19:38:24 UTC
Yah, Jens tells me the same thing. I'll fix it up on Tuesday.
Comment 33 Thomas Krennwallner 2022-09-13 20:11:43 UTC
v5 of your patch works now for us, thanks!

We built the kernel with the following commits from https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block:

- https://git.kernel.dk/cgit/linux-block/patch/?id=040b83f (sbitmap: fix possible io hung due to lost wakeup)

- https://git.kernel.dk/cgit/linux-block/patch/?id=48c0333 (sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up())

- https://git.kernel.dk/cgit/linux-block/patch/?id=4acb834 (sbitmap: fix batched wait_cnt accounting)
Comment 34 Keith Busch 2022-09-13 20:42:29 UTC
Awesome, thank you for following the git tree and confirming. This isn't a clean backport, so I'll be sure to follow up with stable on the that after the 6.1 merge window opens.

Note You need to log in before you can comment on or make changes to this bug.