Created attachment 300564 [details] dmesg log after suspend resume, io stuck After system suspend/resume filesystem IO will stall, producing a 'kworker blocked for more than x sec" in dmesg, recovering after a long delay. See the attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue happening right after booting or after runtime suspend transitions. The same issue also happens on multiple SKL systems in the i915 team's CI farm, see: https://gitlab.freedesktop.org/drm/intel/-/issues/4547 I bisected the problem to commit 4f5022453acd0f7b28012e20b7d048470f129894 Author: Jens Axboe <axboe@kernel.dk> Date: Mon Oct 18 08:45:39 2021 -0600 nvme: wire up completion batching for the IRQ path By reverting it on top of 5.17.0-rc7, I can't reproduce the problem. Attached dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few suspend/resume.
Created attachment 300565 [details] dmesg log after suspend resume, io ok
Thanks for the bisect, that helps a lot. It sounds like a completion is missed somewhere in the batch attempt, I'll take a look this week. I'm not sure if we're getting an rc8 or not, so any potential fix may have to wait for stable.
(In reply to Keith Busch from comment #2) > Thanks for the bisect, that helps a lot. It sounds like a completion is > missed somewhere in the batch attempt, I'll take a look this week. I'm not > sure if we're getting an rc8 or not, so any potential fix may have to wait > for stable. So, we got a rc8; was any progress made to get this fixed? Otherwise I'll add it to the regression tracking so Linus gets aware of it.
(In reply to Keith Busch from comment #2) > Thanks for the bisect, that helps a lot. It sounds like a completion is > missed somewhere in the batch attempt, I'll take a look this week. I'm not > sure if we're getting an rc8 or not, so any potential fix may have to wait > for stable. Keith, is this still on your todo list? It looks like nothing happend in about a month here.
Sorry for the delay. After not finding any issues from code inspection, I tried recreating this, but haven't had any luck. I'll post this to the mailing list since more people subscribe to that.
On Sun, Mar 13, 2022 at 08:13:53PM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215679 > > Bug ID: 215679 > Summary: NVMe/writeback wb_workfn/blocked for more than 30 > seconds > Product: IO/Storage > Version: 2.5 > Kernel Version: 5.17.0-rc7 > Hardware: x86-64 > OS: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: NVMe > Assignee: io_nvme@kernel-bugs.kernel.org > Reporter: imre.deak@intel.com > Regression: No > > Created attachment 300564 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=300564&action=edit > dmesg log after suspend resume, io stuck > > After system suspend/resume filesystem IO will stall, producing a 'kworker > blocked for more than x sec" in dmesg, recovering after a long delay. See the > attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue > happening right after booting or after runtime suspend transitions. > > The same issue also happens on multiple SKL systems in the i915 team's CI > farm, > see: > > https://gitlab.freedesktop.org/drm/intel/-/issues/4547 > > I bisected the problem to > commit 4f5022453acd0f7b28012e20b7d048470f129894 > Author: Jens Axboe <axboe@kernel.dk> > Date: Mon Oct 18 08:45:39 2021 -0600 > > nvme: wire up completion batching for the IRQ path > > By reverting it on top of 5.17.0-rc7, I can't reproduce the problem. Attached > dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few > suspend/resume. Forwarding to linux-nvme for higher visibility.
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone. Jens, this regression is now a month old and the culprit was already identified when the issue was reported, nevertheless it seems no real progress was made to address this, at least afaics. What's up there? Should the culprit be reverted? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) P.S.: As the Linux kernel's regression tracker I deal with a lot of reports and sometimes miss something important when writing mails like this. If that's the case here, don't hesitate to tell me in a public reply, it's in everyone's interest to set the public record straight. #regzbot poke On 14.04.22 21:08, Keith Busch wrote: > On Sun, Mar 13, 2022 at 08:13:53PM +0000, bugzilla-daemon@kernel.org wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=215679 >> >> Bug ID: 215679 >> Summary: NVMe/writeback wb_workfn/blocked for more than 30 >> seconds >> Product: IO/Storage >> Version: 2.5 >> Kernel Version: 5.17.0-rc7 >> Hardware: x86-64 >> OS: Linux >> Tree: Mainline >> Status: NEW >> Severity: high >> Priority: P1 >> Component: NVMe >> Assignee: io_nvme@kernel-bugs.kernel.org >> Reporter: imre.deak@intel.com >> Regression: No >> >> Created attachment 300564 [details] >> --> https://bugzilla.kernel.org/attachment.cgi?id=300564&action=edit >> dmesg log after suspend resume, io stuck >> >> After system suspend/resume filesystem IO will stall, producing a 'kworker >> blocked for more than x sec" in dmesg, recovering after a long delay. See >> the >> attached dmesg-suspend-resume-nvme-stuck.txt. I also noticed the same issue >> happening right after booting or after runtime suspend transitions. >> >> The same issue also happens on multiple SKL systems in the i915 team's CI >> farm, >> see: >> >> https://gitlab.freedesktop.org/drm/intel/-/issues/4547 >> >> I bisected the problem to >> commit 4f5022453acd0f7b28012e20b7d048470f129894 >> Author: Jens Axboe <axboe@kernel.dk> >> Date: Mon Oct 18 08:45:39 2021 -0600 >> >> nvme: wire up completion batching for the IRQ path >> >> By reverting it on top of 5.17.0-rc7, I can't reproduce the problem. >> Attached >> dmesg-suspend-resume-nvme-ok.txt with the revert, captured after a few >> suspend/resume. > > Forwarding to linux-nvme for higher visibility. >
Imre, I pretty unhappy how the maintainers handle this report. Sorry, things shouldn't take this long for a bisected regression. But before escalating the problem further, could you please confirm that the regression is still present in latest mainline, aka 5.18-rc5 or later?
Created attachment 300895 [details] attachment-31105-0.html Take it to the list. As far as I can tell, this report isn't real, and I am pretty unhappy with how you're handling this. On Fri, May 6, 2022, 5:55 AM <bugzilla-daemon@kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=215679 > > --- Comment #8 from The Linux kernel's regression tracker (Thorsten > Leemhuis) (regressions@leemhuis.info) --- > Imre, I pretty unhappy how the maintainers handle this report. Sorry, > things > shouldn't take this long for a bisected regression. But before escalating > the > problem further, could you please confirm that the regression is still > present > in latest mainline, aka 5.18-rc5 or later? > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. > You are watching the assignee of the bug.
(In reply to keith.busch from comment #9) > Take it to the list. Sorry, now I'm confused. Take if *off* the list or tracked regression reports? Or take it to the mailing list for discussion? But the latter would make sense if this is true: > As far as I can tell, this report isn't real, So I assume it's the former. A quick explanation on why you think this report isn't real (just a few words) would have been good (or is this somewhere mentioned in this bug report? sorry, then I missed it and can't see it now), but FWIW, if Imre wouldn't have replied I'd like have dropped the ball myself. > and I am pretty unhappy with how you're handling this. Why? From my point it looks a lot like this is a regression that is not handled appropriately. If there is a reason why this report is bogus then sorry, I just missed it.
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10) > (In reply to keith.busch from comment #9) > > > Take it to the list. > > Sorry, now I'm confused. Take if *off* the list or tracked regression > reports? Or take it to the mailing list for discussion? But the latter would > make sense if this is true: The linux-nvme mailing list. bugzilla might be okay as a starting point, but the list is the appropriate place for nvme discussion. > > As far as I can tell, this report isn't real, > > So I assume it's the former. A quick explanation on why you think this > report isn't real (just a few words) would have been good (or is this > somewhere mentioned in this bug report? sorry, then I missed it and can't > see it now), but FWIW, if Imre wouldn't have replied I'd like have dropped > the ball myself. Because no one reported it to the official mailing list. But truthfully, I apologize for coming off as impatient in my previous message.
Created attachment 300897 [details] dmesg with the timeout on 5.18.0-rc5 Fwiw, the problem is still reproducible on 5.18.0-rc5, and without the commit I mentioned it's not reproducible. Please find the logs attached.
Created attachment 300898 [details] dmesg without the timeout on 5.18.0-rc5
(In reply to Imre Deak from comment #12) > Fwiw, the problem is still reproducible on 5.18.0-rc5, and without the > commit I mentioned it's not reproducible. As far as I can tell, it looks like you're just booting with an nvme ext4 root partition. I have been unsuccessful at reproducing the observation. Do you have any other hints on how I might be able to recreate this?
I run 'rtcwake -m mem -s 7' right after booting to reproduce the problem. This leads to the timeout in the log already during the first or if not then during the second suspend/resume sequence; at least in some if not in all the cases during the suspend phase. If it helps I can work on arranging remote access for you to this machine.
(In reply to Keith Busch from comment #11) > > The linux-nvme mailing list. bugzilla might be okay as a starting point, but > the list is the appropriate place for nvme discussion. My problem is: I have to deal with lot of subsystems and developers and some of them prefer to keep things in bugzilla. I can't keep track of who prefer what, so I just look and try to continue in the place where they interacted (to the best of my judgement, which sometimes is wrong). > [...] > But truthfully, I apologize for coming off as impatient in my previous > message. No worries!
(In reply to Imre Deak from comment #15) > If it helps I can work on arranging remote access for you to this machine. I'm not having any luck running in virtual machines, but I don't have any real hardware that I could attempt this kind of test. If you want to send me login details privately, you can reach me at kbusch@kernel.org and I'll take a look.
Keith, ok. Before doing that could you check the stats below, if with that it's worth continuing debugging the issue, or we should just replace the module (since these particular available_spare etc. metrics could cause the problem)? testrunner@skl-6600u:~$ sudo nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0x1 temperature : 34 C available_spare : 44% available_spare_threshold : 50% percentage_used : 39% data_units_read : 153 585 846 data_units_written : 152 504 340 host_read_commands : 2 017 620 857 host_write_commands : 3 581 240 952 controller_busy_time : 67 613 power_cycles : 179 437 power_on_hours : 36 652 unsafe_shutdowns : 4 206 media_errors : 0 num_err_log_entries : 30 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 34 C Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
(In reply to Imre Deak from comment #18) > Keith, ok. Before doing that could you check the stats below, if with that > it's worth continuing debugging the issue, or we should just replace the > module (since these particular available_spare etc. metrics could cause the > problem)? That doesn't sound related to your observation. Even if the hardware is becoming flakey as it ages, that shouldn't cause a stuck task.
I'll take this hardware under changes. First thing to do is update the Dell BIOS. This might not be direct cause, but shouldn't hurt either, and it's more comfortable to work on newest BIOS. If the issue persists after updating BIOS, I'll see if NVMe can be changed easily. This NVMe seems to be towards the end of its life, over half of the spare blocks are already used. This maybe shouldn't cause delays in writes, but very often it does. Reason could also relate to NVMe Firmware (I haven't checked if Toshiba offers updates). Changing the NVMe will clear these problems. If BIOS update and NVMe change are not helping, we'll try to arrange remote access to this host.
I've updated the BIOS and changed the NVMe: the BIOS was pretty vocal (audibly so) about the health of the NVMe. This will probably fix the issue, as it seemed to be very NVMe related. I have the old half-spares-gone Toshiba NVMe drive if there's interest investigating driver functionality with faulty drive.
Dell XPS 13 fi-skl-6600u is back in the farm, and looks all green: https://intel-gfx-ci.01.org/tree/drm-tip/fi-skl-6600u.html I tried to reproduce the issue with 'rtcwake -m mem -s 7' but no luck. i915-CI shard-skl hosts mentioned in fd.o#4547 issue are Chromebooks with eMMC: the problem is probably different there.
Okay then, the BIOS update and NVMe module change seem to have fixed the issue, I'm fine to close this ticket.
Hello. We had a similar observation (stuck workqueues with kcryptd_crypt [dm_crypt]) unrelated to the problematic system suspend/resume filesystem IO, which lead us to bisect to the very same problematic commit commit 4f5022453acd0f7b28012e20b7d048470f129894 Author: Jens Axboe <axboe@kernel.dk> Date: Mon Oct 18 08:45:39 2021 -0600 nvme: wire up completion batching for the IRQ path Reverting commit 4f5022453acd0f7b28012e20b7d048470f129894 on kernel versions >= v5.16-rc1 then resolves the problem, and the system does not get stuck anymore. Our customer's workloads are rather complex, but the steps shown below can reproduce this problem, which appears to be prevalent on GCP n2d-standard-2 machine types with local NVMe SSDs. It results in instant high iowait and getting stuck on a simple fio load test following the steps shown below. Please let us know what you think or if you need more help reproducing this issue. == Hardware affected == GCP n2d-standard-2 machine type with local NVMe SSD attached == Minimal steps to reproduce the problem == echo 128 > /sys/block/nvme0n1/queue/nr_requests fio --filename=/dev/nvme0n1 --size=12G --direct=1 --rw=randrw \ --bs=4k --ioengine=libaio --iodepth=256 --numjobs=4 \ --name=iops-test-job --eta-newline=1 == Expected result == - fio benchmark can finish in 2 minutes. - When fio is running, Ctrl+C can stop fio, CPU usage will drop immediately. == Actual result == - fio benchmark will never finish, with ETA continue to increase. - During the test, high iowait (50% or 99%) according to "mpstat 1". - Ctrl+C and "kill -9 fio" cannot stop the task. - Graceful reboot may not work. The only reliable way to reboot is running "echo b > /proc/sysrq-trigger" == Unaffected kernel version == - v5.15 (8bb7eca972ad531c9b149c0a51ab43a417385813) == Affected kernel versions == Running above steps can reproduce the issue with the following kernel versions: - v5.16 (df0cc57e057f18e44dac8e6c18aba47ab53202f9) - v5.17.12 - v5.18.15 - v6.0-rc1-160-g4c2d0b039c5c == Problematic commit == Commit 4f5022453acd0f7b28012e20b7d048470f129894 (nvme: wire up completion batching for the IRQ path) introduced the problem and got merged at 33c8846c814c1c27c6e33af005042d15061f948b (Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block). After reverting 4f502245, the issue becomes irreproducible. == Additional information - nr_requests and iodepth correlation == If /sys/block/nvme0n1/queue/nr_requests is set to values >= 256 or <= 8, the issue is very unlikely to materialize. When nr_requests is set to values between 16~128, one can observe a correlation between fio's "--iodepth" argument and reproducing the issue, as shown in the table below: nr_requests | iodepth | reproducible? | Note -------------+-------------+---------------+----------------- 1023 | up to 16384 | no | default nr_requests 512 | up to 16384 | no | 256 | up to 16384 | no | 128 | <= 64 | no | 128 | >= 128 | yes | 64 | <= 16 | no | 64 | >= 32 | yes | 32 | <= 8 | no | 32 | >= 16 | yes | 16 | <= 8 | no | 16 | >= 16 | yes | 8 | up to 16384 | no | 4 | up to 16384 | no | == Detailed steps to reproduce the issue == 1. In GCP console (https://console.cloud.google.com/compute/instancesAdd), start a VM with Region: us-west1 Series: N2D Machine type: n2d-standard-2 In "Boot disk", click "change" OS: "Fedora" Version: "Fedora Cloud Base 35 35-20211026.0" Click "select" In "advanced options" -> "Disks" -> "Add local SSD" Interface: NVMe Disk capacity: 375GB (1 * 375GB) Click "save" Click "Create" 2. SSH to the VM, update the kernel to 5.18.15 3. Issue the following commands sudo -s dnf -y install fio echo 128 > /sys/block/nvme0n1/queue/nr_requests fio --filename=/dev/nvme0n1 --size=12G --direct=1 --rw=randrw \ --bs=4k --ioengine=libaio --iodepth=256 --numjobs=4 \ --name=iops-test-job --eta-newline=1 == Additional information - tested Linux distributions and kernels == Kernels unaffected: * v5.14.10 on Fedora 35 (5.14.10-300.fc35.x86_64) * 5.15.0-1016-gcp on Ubuntu 22.04 Kernels affected: * 5.18.15 (Fedora 35, 5.18.15-100.fc35.x86_64) * v6.0-rc1-160-g4c2d0b039c5c (Fedora 35, and install 6.0.0-0.rc1.20220819git4c2d0b039c5c.16.vanilla.1.fc35.x86_64 from kernel-vanilla-mainline repo) * 5.18.17 (Ubuntu 22.04, and install https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.18.17/amd64/ ) == Additional information - tested GCP machine types == GCP machine types unaffected: * n1-standard-1 (1 CPU) GCP machine types affected: * n1-standard-2 (2 CPU) * n2-standard-2 (2 CPU) * n2d-standard-2 (2 CPU) Note: adding maxcpus=1 to kernel command line to affected GCP machine types will not change the result.
You should try adding this patch: https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/ Or just use axboe's 6.1 block tree: https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block
Hello. I have tested axboe's 6.1 block tree https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block The issue is still reproducible I also tried [PATCH] sbitmap: fix possible io hung due to lost wakeup https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/ based on 5.18.15 (Fedora 35) v6.0-rc2-16-g072e51356cd5 (072e51356cd5a4a1c12c1020bc054c99b98333df from https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ) The issue is still reproducible Thank you.
Thanks for confirming. I think I know what's going on now. Your table of the requests/depth proved to be useful, and I'll send a proposal shortly.
Created attachment 301644 [details] proposed fix Try this patch. It should fix the issue.
Hi Keith, Thank you. Your patch works! I've tested the following combinations Based on - 5.18.15 - 5.18.19 then apply [PATCH] sbitmap: fix possible io hung due to lost wakeup https://lore.kernel.org/lkml/20220803121504.212071-1-yukuai1@huaweicloud.com/ then apply attachment 301644 [details] I also tried https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block then apply attachment 301644 [details] The issue is not reproducible I also tried nr_requests 32, 64, 128 with fio iodepth up to 16384, the issue is not reproducible
Thanks, I'll work with the community on getting something like this applied. The proposal attached has an unlikely race condition, so I have to make some adjustments. The proposal should be fine for nvme since it runs interrupt completions single threaded, but I have to make it thread safe since this is a generic library.
We built a kernel with https://git.kernel.dk/cgit/linux-block/patch/?id=16ede66973c84f890c03584f79158dd5b2d725f5 (which is v3 of your patch: [PATCHv3] sbitmap: fix batched wait_cnt accounting from https://lore.kernel.org/r/20220825145312.1217900-1-kbusch@fb.com), but we ended up in soft lockups when setting /sys/block/nvme0n1/queue/nr_requests to low values like 128 (maybe also with default settings, but didn't try). Now I built a kernel with v2 of the patch ([PATCHv2] sbitmap: fix batched wait_cnt accounting from https://lore.kernel.org/all/20220824201440.127782-1-kbusch@fb.com/) and this one now does not experience the lockups for low (128) and default settings of /sys/block/nvme0n1/queue/nr_requests. So something changed in patch v2 -> v3. From dmesg we can see we are getting stuck in sbitmap_queue_wake_up when called from blk_mq_get_tag when running a kernel with v3 of the patch: [Sat Sep 3 21:05:44 2022] NMI watchdog: Watchdog detected hard LOCKUP on cpu 5 [Sat Sep 3 21:05:44 2022] RIP: 0010:sbitmap_queue_clear_batch+0xbb/0x100 ... [Sat Sep 3 21:05:44 2022] Call Trace: [Sat Sep 3 21:05:44 2022] <IRQ> [Sat Sep 3 21:05:44 2022] blk_mq_end_request_batch+0x383/0x440 [Sat Sep 3 21:05:44 2022] ? enqueue_entity+0x1a1/0x520 [Sat Sep 3 21:05:44 2022] nvme_irq+0x5f/0x70 [nvme] [Sat Sep 3 21:05:44 2022] ? nvme_resume+0x70/0x70 [nvme] ... [Sat Sep 3 21:05:53 2022] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [dmcrypt_write/2:8278] ... [Sat Sep 3 21:05:53 2022] Call Trace: [Sat Sep 3 21:05:53 2022] <TASK> [Sat Sep 3 21:05:53 2022] sbitmap_queue_wake_up+0x11/0x20 [Sat Sep 3 21:05:53 2022] blk_mq_get_tag+0x18e/0x2a0 [Sat Sep 3 21:05:53 2022] ? dequeue_task_stop+0x70/0x70 ... [Sat Sep 3 21:06:21 2022] Call Trace: [Sat Sep 3 21:06:21 2022] <TASK> [Sat Sep 3 21:06:21 2022] sbitmap_queue_wake_up+0x11/0x20 [Sat Sep 3 21:06:21 2022] blk_mq_get_tag+0x18e/0x2a0
Yah, Jens tells me the same thing. I'll fix it up on Tuesday.
v5 of your patch works now for us, thanks! We built the kernel with the following commits from https://git.kernel.dk/cgit/linux-block/log/?h=for-6.1/block: - https://git.kernel.dk/cgit/linux-block/patch/?id=040b83f (sbitmap: fix possible io hung due to lost wakeup) - https://git.kernel.dk/cgit/linux-block/patch/?id=48c0333 (sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()) - https://git.kernel.dk/cgit/linux-block/patch/?id=4acb834 (sbitmap: fix batched wait_cnt accounting)
Awesome, thank you for following the git tree and confirming. This isn't a clean backport, so I'll be sure to follow up with stable on the that after the 6.1 merge window opens.