Bug 117051

Summary: Very slow discard(trim) with mdadm raid0 array
Product: IO/Storage Reporter: Park Ju Hyung (qkrwngud825)
Component: MDAssignee: io_md
Status: NEW ---    
Severity: high CC: aleksey.obitotskiy, holger.kiehl, matt, neilb, qkrwngud825, samm, shli, snitzer, tonyluzhigang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4-rc1 and all above Subsystem:
Regression: No Bisected commit-id:
Attachments: blktrace md raid0 fstrim of two NVMe PCIe SSD's
Kernel config

Description Park Ju Hyung 2016-04-24 11:43:43 UTC
I'm currently running 2 SATA SSDs(Samsung 850 PRO) with mdadm raid0 configured with 64k chunks.

All Linux versions since 4.4 is causing discard(trim) very, very slow.
'fstrim -v' takes several minutes, compared to few seconds on 4.3(and below) kernels.

(I'm not entirely sure if this is actually *intended*)

Note that Samsung SSDs has queued TRIM disabled.

$ dmesg|grep 'ata5'
[    4.961746] ata5: SATA max UDMA/133 abar m2048@0xfb426000 port 0xfb426100 irq 50
[    5.288745] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    5.293536] ata5.00: supports DRM functions and may not be fully accessible
[    5.295628] ata5.00: disabling queued TRIM support
[    5.295629] ata5.00: ATA-9: Samsung SSD 850 PRO 256GB, EXM02B6Q, max UDMA/133
[    5.296641] ata5.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    5.299963] ata5.00: supports DRM functions and may not be fully accessible
[    5.301131] ata5.00: disabling queued TRIM support
[    5.301182] ata5.00: configured for UDMA/133

$ dmesg|grep 'ata6'
[    4.962857] ata6: SATA max UDMA/133 abar m2048@0xfb426000 port 0xfb426180 irq 50
[    5.285905] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    5.289693] ata6.00: supports DRM functions and may not be fully accessible
[    5.290662] ata6.00: READ LOG DMA EXT failed, trying unqueued
[    5.291610] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
[    5.291611] ata6.00: ATA-9: Samsung SSD 850 PRO 256GB, EXM01B6Q, max UDMA/133
[    5.292563] ata6.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    5.297710] ata6.00: supports DRM functions and may not be fully accessible
[    5.298816] ata6.00: failed to get NCQ Send/Recv Log Emask 0x1
[    5.298863] ata6.00: configured for UDMA/133

$ cat /proc/mdstat
Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid0 sdc1[1] sdb1[0]
      499853312 blocks super 1.2 64k chunks
      
unused devices: <none>

$ cat /sys/block/md0/queue/discard*
discard_granularity - 512
discard_max_bytes - 65536
discard_max_hw_bytes - 65536

Linux 4.3
$ time fstrim -v /media/androidbuild
/media/androidbuild: 343.5 GiB (368801468416 bytes) trimmed

real	0m12.085s
user	0m0.000s
sys	0m2.352s

Linux 4.4
$ time fstrim -v /media/androidbuild
/media/androidbuild: 343.5 GiB (368801468416 bytes) trimmed

real	5m44.257s
user	0m0.000s
sys	0m7.820s

Linux 4.6-rc4
$ time fstrim -v /media/androidbuild
/media/androidbuild: 343.5 GiB (368801468416 bytes) trimmed

real	5m45.644s
user	0m0.000s
sys	0m11.664s
Comment 1 Shaohua Li 2016-04-25 15:38:27 UTC
is it possible you can test the trim speed of single disk? say create a fs in sdc1/sdb1 and compare fstrim speed of 4.3/4.4. if it is not possible, can you capture blktrace log in the raid0 test?
Comment 2 Park Ju Hyung 2016-04-25 16:56:00 UTC
(In reply to Shaohua Li from comment #1)
> is it possible you can test the trim speed of single disk? say create a fs
> in sdc1/sdb1 and compare fstrim speed of 4.3/4.4. if it is not possible, can
> you capture blktrace log in the raid0 test?

I believe the issue is on mdadm.
It's all fine on single disk setup.

/dev/sdc1 on /tmp/mount type ext4 (rw,relatime,data=ordered)

$ cat /sys/block/sdc/queue/discard*
discard_granularity - 512
discard_max_bytes - 2147450880
discard_max_hw_bytes - 2147450880

4.4.0
$ fstrim -v /tmp/mount
/tmp/mount: 234.6 GiB (251844898816 bytes) trimmed

real    0m1.382s
user    0m0.000s
sys     0m0.032s

4.3.0
$ fstrim -v /tmp/mount
/tmp/mount: 234.6 GiB (251844898816 bytes) trimmed

real    0m3.412s
user    0m0.041s
sys     0m0.000s
Comment 3 Neil Brown 2016-04-25 20:52:29 UTC
I haven't looked in any detail here, but the last time this sort of thing came up, it was because the md array broke a very large "Discard" into chunk-sized pieces, then the component devices didn't merge their share back together again but handled them one at a time.

If that is what is happening here, then I think the correct fix would be to get the merged of discards in the member device to work properly.

An interesting experiment would be to see if doubling the chunk size halved the total time for the discard (as it would halve the total number of chunks).
Comment 4 Shaohua Li 2016-04-25 22:02:24 UTC
I got the reason. the recent arbitrary size bio patch breaks bio merge for stacked block device. when upper layer bio comes, blk_queue_split sets REQ_NOMERGE for the bio and split, then the bio can't be merged when we dispatch it to under layer disk. I'm cooking patches.
Comment 5 Park Ju Hyung 2016-04-25 22:56:20 UTC
I'm willing to testdrive a patch.
Let me know.
Comment 6 Park Ju Hyung 2016-04-26 00:22:41 UTC
Yup that fixed it.

Thanks!

$ time fstrim -v /media/androidbuild
/media/androidbuild: 341.4 GiB (366532702208 bytes) trimmed

real	0m14.908s
user	0m0.002s
sys	0m1.927s
Comment 7 h_o_l_g_e_r 2016-04-29 09:15:06 UTC
Created attachment 214701 [details]
blktrace md raid0 fstrim of two NVMe PCIe SSD's

The patch does work for 2 Samsung 850 Pro SATA SSD's in a raid0 but not for two Intel P3700 1.6TB NVMe PCIe SSD's in a raid0. A fstrim here still takes more then 4 hours. For the Intel SSD's I have used no partitions, so the raid0 is across the whole disk. On the 2 Samsung 850 Pro SATA SSD's I did however use partitions.

So as Shaohua Li asked I did a blktrace for the Intel SSD case using whole disk.

Regards,
Holger
Comment 8 Shaohua Li 2016-04-29 20:27:56 UTC
The discard is always dispatched from workqueue, which is weird, seems it doesn't go to the plug path at all. did you set /sys/block/nvme0n1/queue/nomerges?
Comment 9 h_o_l_g_e_r 2016-04-30 17:24:14 UTC
No I did not set it. But it is set, see:

cat /sys/block/nvme[01]n1/queue/nomerges
2
2

I googled a bit and it looks as this is always set for all nvme devices. See

https://lists.gnu.org/archive/html/qemu-block/2015-06/msg00007.html

Setting this to 1 for both SSD's, it is now much much better! Watching with dstat it is now sometimes writing at approx. 4000 MB/s. Before, it only was always just doing it at a constant 128 MB/s. It now only took approx. 20 minutes.

Is it save to set nomerges to 1? Why does the kernel set this by default to 2?

I wonder if I now can enable the discard mount option for ext4 again. Performance here was so bad when discard is set that I had to disable it.
Comment 10 Shaohua Li 2016-05-02 03:32:25 UTC
which kernel are you using? I didn't see latest kernel sets nomerge for nvme by default.
Comment 11 h_o_l_g_e_r 2016-05-02 08:24:50 UTC
Created attachment 214961 [details]
Kernel config

That is a plain 4.4.8 from kernel.org. I did check /etc direcory if some where it is set, but that is not the case. Distro I am using is Scientific Linux 7.2 (Redhat).
Comment 12 Zhigang Lu 2016-05-04 10:06:39 UTC
I encountered the same issue with raid10 on the long-term Linux 3.10.101. And I noticed that the difference between softraid devices and SSD is discard_max_bytes. 

[root~]# cat /sys/block/md0/queue/discard_max_bytes 
524288   ---------------->softraid chunk size
[root~]# cat /sys/block/sda/queue/discard_max_bytes 
2147450880

I wonder if there is any way to make the bio mergeable for Linux 3.10.
Comment 13 Shaohua Li 2016-05-09 04:10:49 UTC
(In reply to Zhigang Lu from comment #12)
> I encountered the same issue with raid10 on the long-term Linux 3.10.101.
> And I noticed that the difference between softraid devices and SSD is
> discard_max_bytes. 
> 
> [root~]# cat /sys/block/md0/queue/discard_max_bytes 
> 524288   ---------------->softraid chunk size
> [root~]# cat /sys/block/sda/queue/discard_max_bytes 
> 2147450880
> 
> I wonder if there is any way to make the bio mergeable for Linux 3.10.

this issue (for 4.4) is introduced recently. don't think the 3.10 issue is the same.
Comment 14 Shaohua Li 2016-05-09 04:13:27 UTC
(In reply to h_o_l_g_e_r from comment #11)
> Created attachment 214961 [details]
> Kernel config
> 
> That is a plain 4.4.8 from kernel.org. I did check /etc direcory if some
> where it is set, but that is not the case. Distro I am using is Scientific
> Linux 7.2 (Redhat).

I think something in your distro changes the nomerges setting. kernel should have it to be 0 by default as far as I check. My test distro (ubuntu) doesn't change nomerges to 2
Comment 15 Sam McLeod 2016-05-16 02:14:18 UTC
We've noticed device dropouts from all our RAID arrays on kernel-ml-4.5.4-1 on CentOS 7.

These appeared to start occurring around the time that Kernel 4.4 was released.

The block layout for each disk is as such:

sda                  0      512B       4G         1
└─sda1               0      512B       4G         1
  └─md100            0      512B     512K         1
    └─one-r0         0      512B     512K         1
      └─drbd0        0      512B     512K         1

So, device -> partition -> MDADM -> LVM -> DRBD

From there we host an iSCSI LUN using LIO-t to our Virtualisation hosts, which do not support and issue discards / TRIM at present.


Discard is enabled in:

- /etc/fstab
- /etc/lvm/lvm.conf

When we see the dropout it's during the standard mdadm cron checks, we have just applied an increased timeout as suggested here: https://blog.fastmail.com/2010/08/18/scsi-hbas-raid-controllers-and-timeouts/


FYI - when we see the disks dropping out we get errors as such:

```
[Sun May 15 13:48:07 2016] sd 0:0:2:0: attempting task abort! scmd(ffff8804621da180)
[Sun May 15 13:48:07 2016] sd 0:0:2:0: [sdc] tag#35 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[Sun May 15 13:48:07 2016] scsi target0:0:2: handle(0x000c), sas_address(0x500304801eccd382), phy(2)
[Sun May 15 13:48:07 2016] scsi target0:0:2: enclosure_logical_id(0x500304801eccd3bf), slot(2)
[Sun May 15 13:48:07 2016] scsi target0:0:2: enclosure level(0x0000),connector name(    )
[Sun May 15 13:48:08 2016] sd 0:0:2:0: task abort: SUCCESS scmd(ffff8804621da180)
[Sun May 15 13:48:08 2016] sd 0:0:2:0: [sdc] tag#0 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     sas_address(0x500304801eccd382), phy(2)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     enclosure_logical_id(0x500304801eccd3bf),slot(2)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     enclosure level(0x0000), connector name(     )
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     handle(0x000c), ioc_status(success)(0x0000), smid(107)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     request_len(0), underflow(0), resid(-65536)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     tag(65535), transfer_count(65536), sc->result(0x00000000)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[Sun May 15 13:48:08 2016] mpt3sas_cm0:     [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
[Sun May 15 13:48:08 2016] blk_update_request: I/O error, dev sdc, sector 2064
[Sun May 15 13:48:08 2016] md: super_written gets error=-5
[Sun May 15 13:48:08 2016] md/raid10:md100: Disk failure on sdc1, disabling device.
md/raid10:md100: Operation continuing on 7 devices.
[Sun May 15 13:48:08 2016] md: md100: data-check interrupted.
```
Comment 16 Shaohua Li 2016-05-16 16:49:15 UTC
Sam, please open a new track for your issue and not report unrelated issue here.

Park Ju Hyung, can you close this bug please? Patch is in upstream already.
Comment 17 Aleksey Obitotskiy 2016-05-25 09:07:10 UTC
(In reply to Zhigang Lu from comment #12)
> I encountered the same issue with raid10 on the long-term Linux 3.10.101.
> And I noticed that the difference between softraid devices and SSD is
> discard_max_bytes. 
> 
> [root~]# cat /sys/block/md0/queue/discard_max_bytes 
> 524288   ---------------->softraid chunk size
> [root~]# cat /sys/block/sda/queue/discard_max_bytes 
> 2147450880
> 
> I wonder if there is any way to make the bio mergeable for Linux 3.10.

Proposed patch enable DISCARD merging. Created and tested against 3.10.0-327.
Some of the changes already in upstream: 
e548ca4ee
9c573de32
03100aada
ef2d4615c

+++ linux-3.10.0-327.el7/include/linux/blkdev.h 2016-05-19 10:23:04.828316387 -0400
@@ -969,7 +969,7 @@ static inline unsigned int blk_rq_get_ma
        if (unlikely(rq->cmd_type == REQ_TYPE_BLOCK_PC))
                return q->limits.max_hw_sectors;

-       if (!q->limits.chunk_sectors)
+       if (!q->limits.chunk_sectors || (rq->cmd_flags & REQ_DISCARD))
                return blk_queue_get_max_sectors(q, rq->cmd_flags);

        return min(blk_max_size_offset(q, blk_rq_pos(rq)),
--- linux-3.10.0-327.el7-orig/block/blk-merge.c 2015-10-29 16:56:51.000000000 -0400
+++ linux-3.10.0-327.el7/block/blk-merge.c      2016-05-23 05:48:29.816414501 -0400
@@ -271,6 +271,9 @@ static inline int ll_new_hw_segment(stru
 {
        int nr_phys_segs = bio_phys_segments(q, bio);

+       if (bio->bi_rw & REQ_DISCARD && nr_phys_segs == 0)
+               nr_phys_segs = 1;
+
        if (req->nr_phys_segments + nr_phys_segs > queue_max_segments(q))
                goto no_merge;

--- linux-3.10.0-327.el7-orig/drivers/md/md.c   2015-10-29 16:56:51.000000000 -0400
+++ linux-3.10.0-327.el7/drivers/md/md.c        2016-05-17 02:57:34.735055093 -0400
@@ -283,6 +283,8 @@ static void md_make_request(struct reque
         * go away inside make_request
         */
        sectors = bio_sectors(bio);
+       /* bio could be mergeable after passing to underlayer */
+       bio->bi_rw &= ~REQ_NOMERGE;
        mddev->pers->make_request(mddev, bio);

        cpu = part_stat_lock();
--- linux-3.10.0-327.el7-orig/drivers/block/nvme-core.c 2015-10-29 16:56:51.000000000 -0400
+++ linux-3.10.0-327.el7/drivers/block/nvme-core.c      2016-05-23 05:37:44.780382376 -0400
@@ -2015,9 +2015,10 @@ static void nvme_alloc_ns(struct nvme_de
        ns->queue = blk_mq_init_queue(&dev->tagset);
        if (IS_ERR(ns->queue))
                goto out_free_ns;
-       queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
        queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
-       queue_flag_set_unlocked(QUEUE_FLAG_SG_GAPS, ns->queue);
        ns->dev = dev;
        ns->queue->queuedata = ns;