Bug 117051
Summary: | Very slow discard(trim) with mdadm raid0 array | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Park Ju Hyung (qkrwngud825) |
Component: | MD | Assignee: | io_md |
Status: | NEW --- | ||
Severity: | high | CC: | aleksey.obitotskiy, holger.kiehl, matt, neilb, qkrwngud825, samm, shli, snitzer, tonyluzhigang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.4-rc1 and all above | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
blktrace md raid0 fstrim of two NVMe PCIe SSD's
Kernel config |
Description
Park Ju Hyung
2016-04-24 11:43:43 UTC
is it possible you can test the trim speed of single disk? say create a fs in sdc1/sdb1 and compare fstrim speed of 4.3/4.4. if it is not possible, can you capture blktrace log in the raid0 test? (In reply to Shaohua Li from comment #1) > is it possible you can test the trim speed of single disk? say create a fs > in sdc1/sdb1 and compare fstrim speed of 4.3/4.4. if it is not possible, can > you capture blktrace log in the raid0 test? I believe the issue is on mdadm. It's all fine on single disk setup. /dev/sdc1 on /tmp/mount type ext4 (rw,relatime,data=ordered) $ cat /sys/block/sdc/queue/discard* discard_granularity - 512 discard_max_bytes - 2147450880 discard_max_hw_bytes - 2147450880 4.4.0 $ fstrim -v /tmp/mount /tmp/mount: 234.6 GiB (251844898816 bytes) trimmed real 0m1.382s user 0m0.000s sys 0m0.032s 4.3.0 $ fstrim -v /tmp/mount /tmp/mount: 234.6 GiB (251844898816 bytes) trimmed real 0m3.412s user 0m0.041s sys 0m0.000s I haven't looked in any detail here, but the last time this sort of thing came up, it was because the md array broke a very large "Discard" into chunk-sized pieces, then the component devices didn't merge their share back together again but handled them one at a time. If that is what is happening here, then I think the correct fix would be to get the merged of discards in the member device to work properly. An interesting experiment would be to see if doubling the chunk size halved the total time for the discard (as it would halve the total number of chunks). I got the reason. the recent arbitrary size bio patch breaks bio merge for stacked block device. when upper layer bio comes, blk_queue_split sets REQ_NOMERGE for the bio and split, then the bio can't be merged when we dispatch it to under layer disk. I'm cooking patches. I'm willing to testdrive a patch. Let me know. Yup that fixed it. Thanks! $ time fstrim -v /media/androidbuild /media/androidbuild: 341.4 GiB (366532702208 bytes) trimmed real 0m14.908s user 0m0.002s sys 0m1.927s Created attachment 214701 [details]
blktrace md raid0 fstrim of two NVMe PCIe SSD's
The patch does work for 2 Samsung 850 Pro SATA SSD's in a raid0 but not for two Intel P3700 1.6TB NVMe PCIe SSD's in a raid0. A fstrim here still takes more then 4 hours. For the Intel SSD's I have used no partitions, so the raid0 is across the whole disk. On the 2 Samsung 850 Pro SATA SSD's I did however use partitions.
So as Shaohua Li asked I did a blktrace for the Intel SSD case using whole disk.
Regards,
Holger
The discard is always dispatched from workqueue, which is weird, seems it doesn't go to the plug path at all. did you set /sys/block/nvme0n1/queue/nomerges? No I did not set it. But it is set, see: cat /sys/block/nvme[01]n1/queue/nomerges 2 2 I googled a bit and it looks as this is always set for all nvme devices. See https://lists.gnu.org/archive/html/qemu-block/2015-06/msg00007.html Setting this to 1 for both SSD's, it is now much much better! Watching with dstat it is now sometimes writing at approx. 4000 MB/s. Before, it only was always just doing it at a constant 128 MB/s. It now only took approx. 20 minutes. Is it save to set nomerges to 1? Why does the kernel set this by default to 2? I wonder if I now can enable the discard mount option for ext4 again. Performance here was so bad when discard is set that I had to disable it. which kernel are you using? I didn't see latest kernel sets nomerge for nvme by default. Created attachment 214961 [details]
Kernel config
That is a plain 4.4.8 from kernel.org. I did check /etc direcory if some where it is set, but that is not the case. Distro I am using is Scientific Linux 7.2 (Redhat).
I encountered the same issue with raid10 on the long-term Linux 3.10.101. And I noticed that the difference between softraid devices and SSD is discard_max_bytes. [root~]# cat /sys/block/md0/queue/discard_max_bytes 524288 ---------------->softraid chunk size [root~]# cat /sys/block/sda/queue/discard_max_bytes 2147450880 I wonder if there is any way to make the bio mergeable for Linux 3.10. (In reply to Zhigang Lu from comment #12) > I encountered the same issue with raid10 on the long-term Linux 3.10.101. > And I noticed that the difference between softraid devices and SSD is > discard_max_bytes. > > [root~]# cat /sys/block/md0/queue/discard_max_bytes > 524288 ---------------->softraid chunk size > [root~]# cat /sys/block/sda/queue/discard_max_bytes > 2147450880 > > I wonder if there is any way to make the bio mergeable for Linux 3.10. this issue (for 4.4) is introduced recently. don't think the 3.10 issue is the same. (In reply to h_o_l_g_e_r from comment #11) > Created attachment 214961 [details] > Kernel config > > That is a plain 4.4.8 from kernel.org. I did check /etc direcory if some > where it is set, but that is not the case. Distro I am using is Scientific > Linux 7.2 (Redhat). I think something in your distro changes the nomerges setting. kernel should have it to be 0 by default as far as I check. My test distro (ubuntu) doesn't change nomerges to 2 We've noticed device dropouts from all our RAID arrays on kernel-ml-4.5.4-1 on CentOS 7. These appeared to start occurring around the time that Kernel 4.4 was released. The block layout for each disk is as such: sda 0 512B 4G 1 └─sda1 0 512B 4G 1 └─md100 0 512B 512K 1 └─one-r0 0 512B 512K 1 └─drbd0 0 512B 512K 1 So, device -> partition -> MDADM -> LVM -> DRBD From there we host an iSCSI LUN using LIO-t to our Virtualisation hosts, which do not support and issue discards / TRIM at present. Discard is enabled in: - /etc/fstab - /etc/lvm/lvm.conf When we see the dropout it's during the standard mdadm cron checks, we have just applied an increased timeout as suggested here: https://blog.fastmail.com/2010/08/18/scsi-hbas-raid-controllers-and-timeouts/ FYI - when we see the disks dropping out we get errors as such: ``` [Sun May 15 13:48:07 2016] sd 0:0:2:0: attempting task abort! scmd(ffff8804621da180) [Sun May 15 13:48:07 2016] sd 0:0:2:0: [sdc] tag#35 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [Sun May 15 13:48:07 2016] scsi target0:0:2: handle(0x000c), sas_address(0x500304801eccd382), phy(2) [Sun May 15 13:48:07 2016] scsi target0:0:2: enclosure_logical_id(0x500304801eccd3bf), slot(2) [Sun May 15 13:48:07 2016] scsi target0:0:2: enclosure level(0x0000),connector name( ) [Sun May 15 13:48:08 2016] sd 0:0:2:0: task abort: SUCCESS scmd(ffff8804621da180) [Sun May 15 13:48:08 2016] sd 0:0:2:0: [sdc] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [Sun May 15 13:48:08 2016] mpt3sas_cm0: sas_address(0x500304801eccd382), phy(2) [Sun May 15 13:48:08 2016] mpt3sas_cm0: enclosure_logical_id(0x500304801eccd3bf),slot(2) [Sun May 15 13:48:08 2016] mpt3sas_cm0: enclosure level(0x0000), connector name( ) [Sun May 15 13:48:08 2016] mpt3sas_cm0: handle(0x000c), ioc_status(success)(0x0000), smid(107) [Sun May 15 13:48:08 2016] mpt3sas_cm0: request_len(0), underflow(0), resid(-65536) [Sun May 15 13:48:08 2016] mpt3sas_cm0: tag(65535), transfer_count(65536), sc->result(0x00000000) [Sun May 15 13:48:08 2016] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01) [Sun May 15 13:48:08 2016] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18) [Sun May 15 13:48:08 2016] blk_update_request: I/O error, dev sdc, sector 2064 [Sun May 15 13:48:08 2016] md: super_written gets error=-5 [Sun May 15 13:48:08 2016] md/raid10:md100: Disk failure on sdc1, disabling device. md/raid10:md100: Operation continuing on 7 devices. [Sun May 15 13:48:08 2016] md: md100: data-check interrupted. ``` Sam, please open a new track for your issue and not report unrelated issue here. Park Ju Hyung, can you close this bug please? Patch is in upstream already. (In reply to Zhigang Lu from comment #12) > I encountered the same issue with raid10 on the long-term Linux 3.10.101. > And I noticed that the difference between softraid devices and SSD is > discard_max_bytes. > > [root~]# cat /sys/block/md0/queue/discard_max_bytes > 524288 ---------------->softraid chunk size > [root~]# cat /sys/block/sda/queue/discard_max_bytes > 2147450880 > > I wonder if there is any way to make the bio mergeable for Linux 3.10. Proposed patch enable DISCARD merging. Created and tested against 3.10.0-327. Some of the changes already in upstream: e548ca4ee 9c573de32 03100aada ef2d4615c +++ linux-3.10.0-327.el7/include/linux/blkdev.h 2016-05-19 10:23:04.828316387 -0400 @@ -969,7 +969,7 @@ static inline unsigned int blk_rq_get_ma if (unlikely(rq->cmd_type == REQ_TYPE_BLOCK_PC)) return q->limits.max_hw_sectors; - if (!q->limits.chunk_sectors) + if (!q->limits.chunk_sectors || (rq->cmd_flags & REQ_DISCARD)) return blk_queue_get_max_sectors(q, rq->cmd_flags); return min(blk_max_size_offset(q, blk_rq_pos(rq)), --- linux-3.10.0-327.el7-orig/block/blk-merge.c 2015-10-29 16:56:51.000000000 -0400 +++ linux-3.10.0-327.el7/block/blk-merge.c 2016-05-23 05:48:29.816414501 -0400 @@ -271,6 +271,9 @@ static inline int ll_new_hw_segment(stru { int nr_phys_segs = bio_phys_segments(q, bio); + if (bio->bi_rw & REQ_DISCARD && nr_phys_segs == 0) + nr_phys_segs = 1; + if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q)) goto no_merge; --- linux-3.10.0-327.el7-orig/drivers/md/md.c 2015-10-29 16:56:51.000000000 -0400 +++ linux-3.10.0-327.el7/drivers/md/md.c 2016-05-17 02:57:34.735055093 -0400 @@ -283,6 +283,8 @@ static void md_make_request(struct reque * go away inside make_request */ sectors = bio_sectors(bio); + /* bio could be mergeable after passing to underlayer */ + bio->bi_rw &= ~REQ_NOMERGE; mddev->pers->make_request(mddev, bio); cpu = part_stat_lock(); --- linux-3.10.0-327.el7-orig/drivers/block/nvme-core.c 2015-10-29 16:56:51.000000000 -0400 +++ linux-3.10.0-327.el7/drivers/block/nvme-core.c 2016-05-23 05:37:44.780382376 -0400 @@ -2015,9 +2015,10 @@ static void nvme_alloc_ns(struct nvme_de ns->queue = blk_mq_init_queue(&dev->tagset); if (IS_ERR(ns->queue)) goto out_free_ns; - queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue); - queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue); ns->dev = dev; ns->queue->queuedata = ns; |