|Summary:||blk-mq / write back cache massive performance regression on SATA HDDs|
|Product:||IO/Storage||Reporter:||Enrico Tagliavini (enrico.tagliavini)|
|Component:||SCSI||Assignee:||Jens Axboe (axboe)|
|Severity:||high||CC:||germano.massullo, jasona99, nemesis, snitzer, s_chriscollins, tod.jackson, tom.leiming|
|Kernel Version:||5.0 - 5.9||Tree:||Mainline|
full dmesg output
htop screenshot while reproducing the problem
strace -f -vtTT of dolphin
Reduce sbitmap wait queues
Description Enrico Tagliavini 2019-07-21 12:05:08 UTC
Created attachment 283877 [details] full dmesg output After the switch to mlk-mq the HDD SATA disk on my laptop, a Dell Inspiron 15 7577, became almost unusable. Throughput is fine with default settings, but latency in request is unbearable, up to 90 seconds even, making the computer almost unusable. Latency spikes to happen more if there are background write requests. Switching off blk-mq fo scsi_mod was mitigating the issue, but this is not possible any longer as the old blk queue system has now been removed, making this laptop an expensive paper holder. How to reproduce: 1. first some info about the config (see also attachments): laptop has full disk encryption via dmcryt-LUKS, on top of which there are LVM volumes. OS is installed on nvme disk and has no problem. /home is on the SATA HDD. 2. start a write background process: dd if=/dev/zero of=test.10g bs=1M count=10240 conv=fdatasync 3. try to use any desktop application, I used dolphin for my experiments as it's quite easy to see the issue with it. Actual result: dolphin window border will show up but it will hang and not show up / work until the background write is finished (minutes later) Expected result: dolphin should work with an acceptable performance penalty as it was the case before the introduction of blk-mq. A few interesting facts I noticed. Looking in htop I see the kcryptd kworker threads using a significant amount of CPU which I don't see in my other systems with similar configuration. See screenshot attached. Lowering VM dirty_bytes partially helps. Laptop is still not smooth at all, but it becomes much better, latency goes done about 1 order of magnitude for the dolphin use case, but it doesn't help all use cases (e.g. firefox is still slow to the point of not being usable). I used the following command: echo 10000000 > /proc/sys/vm/dirty_bytes I've tried to play around with a lot of options in sda/queue/ but nothing worked. I tried to change scheduler, using also bfq (which is supposed to share the bandwith) and tuning its setting to no change. I think the problem is not the scheduler itself. I tried to change wbt_lat_usec (which is -1 by default for dm devices) and that also changed nothing. I'll attach htop screenshot and dolphin strace showing the very long poll() system calls. I'd appreciate any help on this issue since the computer is now unusable. Thank you. Kind regards.
Comment 1 Enrico Tagliavini 2019-07-21 12:06:17 UTC
Created attachment 283879 [details] htop screenshot while reproducing the problem
Comment 2 Enrico Tagliavini 2019-07-21 12:12:42 UTC
Created attachment 283881 [details] strace -f -vtTT of dolphin You can use something like the following to find the super long poll() grep poll dolphin.strace | grep -o '<[0-9.]*>' | tr -d '<>' | sort -n
Comment 3 Enrico Tagliavini 2019-07-21 12:13:02 UTC
Created attachment 283883 [details] lsblk output
Comment 4 Enrico Tagliavini 2019-08-11 10:36:59 UTC
Situation is unchanged with kernel 5.2
Comment 5 S. Christian Collins 2019-08-15 13:53:43 UTC
I have also noticed a big difference in HDD performance after switching to the multiqueue scheduler, particularly when making a copy of a large file on the same drive. I tested this with a 3G video file on two different hard drives by running the following command: cp trimmed.mkv trimmed2.mkv Here are the results: Kernel 5.0.0-25-generic using [mq-deadline] multi-queue scheduler: . HDD1: 49 seconds . HDD2: 1 minute, 21 seconds Kernel 4.15.0-58-generic using [cfq] single-queue scheduler: . HDD1: 38 seconds . HDD2: 1 minute, 2 seconds Kernel 4.15.0-58-generic using [none] multi-queue scheduler: . HDD1: 54 seconds . HDD2: 1 minute, 25 seconds You can see that the cfq single-queue scheduler is roughly 30 percent faster than the mq-deadline multi-queue scheduler when copying files on the same drive. Since I was also using different kernels to achieve these results, I also ran a third test with multi-queue scheduling enabled on the 4.15 kernel, though the only scheduler option was [none]. However, the resultant times do show the scheduler to be the factor here. ** My System ** OS: KDE Neon 5.16 64-bit (Plasma Desktop 5.16.4, KDE Frameworks 5.61.0, Qt 5.12.3) Motherboard: ASRock X58 Extreme3 (Intel X58 chipset) CPU: Intel Core i7-990x Bloomfield (3.46 GHz hexa-core, Socket 1366) RAM: 12GB DDR3
Comment 6 Enrico Tagliavini 2019-08-24 19:07:09 UTC
Can reproduce on my Alienware Aurora R8 desktop computer. In this one I have only the backup on the SATA HDD (luckily for me or this would be a brick too). In this case the test can be VLC playing a 60 fps full HD video from the HDD while writing a lot of data to it. Video is encoded in H.265 and is very small in size, 34.5 MB (it's short). If not in cache already it takes 10-20 seconds to start and will hang multiple times during normal playback. Again kcryptd behaving very weird. So what I said in the original description about not being able to see this weird CPU usage by kcryptd is not valid. I can see it, I probably had to insist a bit longer since this is a much powerful system. This needs an urgent fix, using disk encryption for /home right now is just impossible if using HDDs.
Comment 7 tod.jackson 2019-11-09 10:25:13 UTC
I have a similar laptop to the Dell mentioned here, an Inspiron 15 Gaming, not sure of the 7*** model number. I have to use an older kernel with elevator=cfq and scsi_mod.use_blk_mq=0 or I get major UI and mouse unresponsiveness. I tried kyber and the other new options and they were very problematic. It's just an ext4 partition with no RAID setup or anything complicated.
Comment 8 tod.jackson 2019-11-10 21:39:22 UTC
It's probably not a coincidence that only Dell computers are mentioned here (Alienware is really Dell as far as I know. I'll include lspci output below. Maybe we have some hardware component in common causing problems. An easy way to make my laptop nearly unusable with the new schedulers is to do something like extract a large .iso to the same drive with p7zip. Also note that I don't use disk encryption unlike those above. 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 05) 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 05) 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04) 00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 05) 00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31) 00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31) 00:15.0 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #0 (rev 31) 00:15.1 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Serial IO I2C Controller #1 (rev 31) 00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31) 00:17.0 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller [RAID mode] (rev 31) 00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1) 00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1) 00:1f.0 ISA bridge: Intel Corporation HM175 Chipset LPC/eSPI Controller (rev 31) 00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31) 00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31) 00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31) 01:00.0 VGA compatible controller: NVIDIA Corporation GP107M [GeForce GTX 1050 Mobile] (rev a1) 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15) 03:00.0 Network controller: Intel Corporation Wireless 3165 (rev 79)
Comment 9 Jason Ashley 2019-11-18 19:57:31 UTC
I would like to chime in that I can also reproduce this issue in 5.3.10. Once again, it's a Dell machine. Latitude E7450. Especially noticeable on high bitrate video playback or extracting large files, or after continued use as RAM fills, as mentioned in previous posts. The command in the original post brings my laptop to a crawl as well. The disk is not encrypted, unlike most of the above.
Comment 10 Enrico Tagliavini 2020-02-15 11:38:38 UTC
Switched the component to SCSI as based on other comments this is not related to encryption or LVM/DM as I initially thought and changing an option for the module scsi_mod (for older version providing such option) would mitigate the problem. Very very concerning nobody cared to leave a comment on this. Reproducing should be fairly trivial too. Please note: SATA SSDs are also affected. I was able to get reproduce on those as well. However being so fast the effects are not as bad and might not be noticed by the user. Also recent kernels (5.3 and later, possibly some 5.2 bugfix release) are a bit better in the day to day use for me. However as soon as Steam has a game update machine is again slowed down terribly. Not as bad as before, but still very very annoying.
Comment 11 tod.jackson 2020-02-15 16:10:13 UTC
It's a massive performance hit that needs addressing. I really can't use the new schedulers at all.
Comment 12 Mike Snitzer 2020-02-26 15:13:40 UTC
Check with Jens real quick, Ming: Jens is curious if your recent commit might help? commit 01e99aeca3979600302913cef3f89076786f32c8 Author: Ming Lei <email@example.com> Date: Tue Feb 25 09:04:32 2020 +0800 blk-mq: insert passthrough request into hctx->dispatch directly
Comment 13 Lei Ming 2020-02-26 22:00:04 UTC
(In reply to Mike Snitzer from comment #12) > Check with Jens real quick, Ming: Jens is curious if your recent commit > might help? > > commit 01e99aeca3979600302913cef3f89076786f32c8 > Author: Ming Lei <firstname.lastname@example.org> > Date: Tue Feb 25 09:04:32 2020 +0800 > > blk-mq: insert passthrough request into hctx->dispatch directly Hi Mike, No. This performance issue on HDD. should be caused by killing ioc batching and BDI congestion, see another report:  https://lore.kernel.org/linux-scsi/Pine.LNX.4.44L0.email@example.com/  https://lore.kernel.org/linux-scsi/20191226083706.GA17974@ming.t460p/ thanks, Ming
Comment 14 Jens Axboe 2020-02-27 02:38:07 UTC
Created attachment 287655 [details] Reduce sbitmap wait queues I wonder if something like this would make a difference.
Comment 15 Lei Ming 2020-02-27 07:49:56 UTC
Hi Guys, Please try Jens's patch in comment 14. If it still doesn't work, please collect log by the following scirpt when doing the slow write on HDD. After the test is done, terminate the script via 'ctrl + C' and post the log here. Firstly, you need to figure out the HDD's MAJOR and MINOR by lsblk. Secondly, bcc has to be installed on your machine.  #!/bin/sh MAJ=$1 MIN=$2 MAJ=$(( $MAJ << 20 )) DEV=$(( $MAJ | $MIN )) /usr/share/bcc/tools/trace -t -C \ 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \ 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
Comment 16 S. Christian Collins 2020-03-27 16:41:20 UTC
Hi Lei, how would a regular KDE NEON user like me be able to test this patch? Does this involve compiling a kernel?
Comment 17 Ryan Underwood 2020-05-22 17:14:29 UTC
Is anyone here experiencing a regression on HDD and _not_ using ext4? For those who are using ext4, have you tried the dioread_nolock boot parameter?
Comment 18 S. Christian Collins 2020-05-29 13:53:23 UTC
I experience the regressing also on my NTFS partitions (ntfs-3g). I'm a bit uncertain about trying dioread_nolock, as it appears to disable ext4 data journaling. Is this safe to do on my production system?
Comment 19 Enrico Tagliavini 2020-08-17 14:45:48 UTC
Hello there. Just want to say the bug is still fully there. My server is crawling compared to before and my laptop is almost an expensive paper holder since this change.
Comment 20 Ryan Underwood 2020-08-17 15:35:51 UTC
Enrico, which filesystem(s) are you using?
Comment 21 Enrico Tagliavini 2020-08-17 16:30:50 UTC
On the laptop ext4, on the server XFS.
Comment 22 Ryan Underwood 2020-10-24 02:10:57 UTC
I've published findings scoped to ext4 specifically: https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions The TL;DR is that kernels between 4.5 and 5.6 have some ext4 performance bugs which cause some nasty tail latency under the specific conditions described in the post. I hope this helps.
Comment 23 Enrico Tagliavini 2020-11-20 10:36:33 UTC
I switched one of my systems from XFS to ext4 and it's much better. I think XFS is much more aggressive in the way it pushes the data down to the device amplifying the problem which is actually in the layers underneath it in the first place. To be clear: ext4 is not immune from the issue, quite the opposite. When doing profiling it's quite easy to see, but the system doesn't freeze as bad and it's more acceptable for the human perception. To be even more clear: this is not a file system issue, this issue started happening when Linux forced the switch to blk-mq. This is not to say the problem is blk-mq strictly speaking, it could also be in the surrounding changes made to accommodate. As far as I can see this issue can happen with all kind of SATA devices, including SSDs. However SSDs being much faster and lower latency devices the issue is less likely to be perceived by the human user.