JFS provides three modes, journal, ordered and writeback. The first mode is denoted as ‘journal mode’in the following context. In the journal mode, data should be written twice, one for the journal area and the other for the client file system. If the journal area and the client file system are both located in the disk, it has at least 50% performance degradation compared to ordered mode. But what if we put the journal area in a ramdisk? I did the following tests. It shows the ext4 with full data journaling has unreasonable performance degradation even the journal area is located in the ramdisk. Test environment-- CPU: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz RAM:8GB Filesystem:Ext4 Linux version:3.4.6 RAID are composed of 6 x 1TB HD Command: time dd if=/dev/zero of= Write_File bs=1M count=51200 Volume_type ordered_mode Journal_mode degradation Single_disk 173MB/s 144MB/s 17% RAID0 937MB/s 375MB/s 60% RAID5 732MB/s 132MB/s 82% Does anyone know where the bottleneck may be?
On Fri, Mar 07, 2014 at 11:39:40AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > JFS provides three modes, journal, ordered and writeback. > The first mode is denoted as ‘journal mode’in the following context. > In the journal mode, data should be written twice, one for the journal area > and > the other for the client file system. If the journal area and the client file > system are both located in the disk, it has at least 50% performance > degradation compared to ordered mode. > But what if we put the journal area in a ramdisk? How big was the ramdisk? Since all of the blocks are going through the journal, even if it is on the journal, it requires more commits and thus more checkpoint operations, which means more updates to the disk. A bigger journal will help minimize this issue. Would you be willing to grab block traces for both the disk and the external journal device? I will add that the workload of "dd if=/dev/zero of=file" is probably the worst case for data=journal, and if that's really what you are doing, it's a case of "doctor, doctor, it hurts when I do that". All file systems modes will have strengths and weaknesses, and your use case one where I would simply tell people, "don't use that mode". If you want to work on improving it, that's great. Gather data, and if we can figure out an easy way to improve things, great. But I'll warn you ahead of time this is not necessarily something I view as "unreasonable", nor is it something that I would consider a high priority thing to fix. - Ted
> > How big was the ramdisk? Since all of the blocks are going through > the journal, even if it is on the journal, it requires more commits > and thus more checkpoint operations, which means more updates to the > disk. A bigger journal will help minimize this issue. > > Would you be willing to grab block traces for both the disk and the > external journal device? > > I will add that the workload of "dd if=/dev/zero of=file" is probably > the worst case for data=journal, and if that's really what you are > doing, it's a case of "doctor, doctor, it hurts when I do that". All > file systems modes will have strengths and weaknesses, and your use > case one where I would simply tell people, "don't use that mode". > > If you want to work on improving it, that's great. Gather data, and > if we can figure out an easy way to improve things, great. But I'll > warn you ahead of time this is not necessarily something I view as > "unreasonable", nor is it something that I would consider a high > priority thing to fix. > - Ted I use two sizes of ramdisk, 128MB and 1024MB. With 1024MB journal area, the performance is slight improved. But performance degradation is still significant. I am willing to grab block traces. Please tell me how to get the traces you want. As you can see that the performance degradation of applying data=journal in raid5 is 80%, which makes it hard to use. If I know where the problem is, I will try to improve it. Thanks for your help.
Test command : dd if=/dev/zero of=/share/CACHEDEV2_DATA/orderno5G bs=1M count=5120 The results of block trace can be download as follow. https://dl.dropboxusercontent.com/u/32959539/blktrace_original_data.7z The results of btt can be download as follow. https://dl.dropboxusercontent.com/u/32959539/blktrace_output.7z There are three folders: /journal -->ext4,data=journal with 375MB/s /order_delalloc -->ext4,data=ordered,delalloc with 937MB/s /order_nodelalloc -->ext4,data=ordered,nodelalloc with 353MB/s The size of write request with data=ordered,delalloc is 64k mostly. The size of write requests with data=journal and data=ordered,nodelalloc are 4k mostly. If we observe the average Q2Q in btt results, which means time interval between requests, we have /journal 0.00081 /order_nodelalloc 0.00031 It seems to be the reason of performance degradation. Is any one know why the time interval of requests is significantly longer with data=journal?
Could you retry your measurements using the latest kernel? At least 3.11, and preferably 3.13. We significantly optimized the write path for the nodelalloc case in 3.11. That should fix the average size being so small for the nodelalloc case: commit 20970ba65d5a22f2e4efbfa100377722fde56935 Author: Theodore Ts'o <tytso@mit.edu> Date: Thu Jun 6 14:00:46 2013 -0400 ext4: use ext4_da_writepages() for all modes Rename ext4_da_writepages() to ext4_writepages() and use it for all modes. We still need to iterate over all the pages in the case of data=journalling, but in the case of nodelalloc/data=ordered (which is what file systems mounted using ext3 backwards compatibility will use) this will allow us to use a much more efficient I/O submission path. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
(In reply to Theodore Tso from comment #4) > Could you retry your measurements using the latest kernel? At least 3.11, > and preferably 3.13. > > We significantly optimized the write path for the nodelalloc case in 3.11. > That should fix the average size being so small for the nodelalloc case: > > Thanks for your advice. The kernel version is changed from 3.4 to 3.11. The write throughput of data=journal is still only 40% of data=order. Do you think what may be wrong in this mode ? --------------------- Test environment-- CPU: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz RAM:8GB Filesystem:Ext4 Linux version: ubuntu-saucy with kernel 3.11 128MB ramdisk is used as journaling area RAID0 are composed of 6 x 1TB HD Command: time dd if=/dev/zero of= Write_File bs=1M count=5120 data=journal -> 398MB/s data=ordered ->1.1GB/s data=ordered ,nodelalloc -> 1G/s blktrace results can be downloaded in https://dl.dropboxusercontent.com/u/32959539/blktrace_with_3_11.7z btt results can be downloaded in https://dl.dropboxusercontent.com/u/32959539/BTT_with_3_11.7z For linux 3.4, the write throughputs are data=journal -> 397MB/s data=ordered -> 937MB/s data=ordered ,nodelalloc -> 863MB/s* *PS. the previous data is mistaken, please use this version.
Could you try changing JBD2_NR_BATCH (defined in include/linux/jbd2.h) from 64 to 256? If that improves, you can try 512, but that might not make a further difference.
(In reply to Theodore Tso from comment #6) > Could you try changing JBD2_NR_BATCH (defined in include/linux/jbd2.h) from > 64 to 256? If that improves, you can try 512, but that might not make a > further difference. Thanks for your advice. The performance has the following improvement. data=journal JBD2_NR_BATCH=64 -> 398MB/s data=journal JBD2_NR_BATCH=256 -> 412MB/s data=ordered ->1.1GB/s Do you have any other suggestions?
it's clear this isn't going to get performance up to 1.1 GB/s, but I'm curious how much setting JBD2_NR_BATCH changes things at 512 and 1024 and possibly even 2048. Once it no longer maters a difference, if you could do another blktrace, and also gather lock_stat information, that would be useful. To gather lock_stat information, enable CONFIG_LOCK_STAT, and then "echo 0 > /proc/lock_stat" before you start the workload, and then capture the output of /proc/lock_stat after you finish running your workload/benchmark. If you also regather numbers with lock_stat enabled on a stock 3.11 kernel (and also get a /proc/lock_stat report from a stock 3.11 kernel, with and without data=journal), that would be useful. If it turns out that there is some lock contention going on with some of the jbd2 spinlocks, there are some patches queued for 3.15 that I may have to ask you to try (which will mean going to something like 3.14-rc7 plus some additional patches from the ext4 git tree). Thanks for your benchmarking!
(In reply to Theodore Tso from comment #8) > it's clear this isn't going to get performance up to 1.1 GB/s, but I'm > curious how much setting JBD2_NR_BATCH changes things at 512 and 1024 and > possibly even 2048. Once it no longer maters a difference, if you could do > another blktrace, and also gather lock_stat information, that would be > useful. > > To gather lock_stat information, enable CONFIG_LOCK_STAT, and then "echo 0 > > /proc/lock_stat" before you start the workload, and then capture the > output of /proc/lock_stat after you finish running your workload/benchmark. > > If you also regather numbers with lock_stat enabled on a stock 3.11 kernel > (and also get a /proc/lock_stat report from a stock 3.11 kernel, with and > without data=journal), that would be useful. > > If it turns out that there is some lock contention going on with some of the > jbd2 spinlocks, there are some patches queued for 3.15 that I may have to > ask you to try (which will mean going to something like 3.14-rc7 plus some > additional patches from the ext4 git tree). > > Thanks for your benchmarking! Thanks for your advice. With the same environment setting and 'dd' command, The benchmarks of JBD2_NR_batch with data=journal are showed as following, JBD2_NR_batch =64 ->386 MB/s JBD2_NR_batch =254 ->400MB/s JBD2_NR_batch =512 ->407MB/s JBD2_NR_batch =1024 with CONFIG_LOCK_STAT enable ->304MB/s JBD2_NR_batch =2048 ->440MB/s --------------------------- /proc/lock_stat report with data=journal data collected at the end of "dd" https://dl.dropboxusercontent.com/u/32959539/lock_stat_result data collected in the middle of "dd" execution https://dl.dropboxusercontent.com/u/32959539/lock_stat_result2 /proc/lock_stat report with data=ordered https://dl.dropboxusercontent.com/u/32959539/lock_stat_ordered ------------------------------ blktrace of JBD2_NR_batch =2048 with data=journal https://dl.dropboxusercontent.com/u/32959539/JBD2_NR_batch_2048.7z Please tell me if you need further information.
BTW, if we use 1G ramdisk as journal area, the performance with JBD2_NR_batch =2048 can be 571MB/s Please download the result of blktrace as following, https://dl.dropboxusercontent.com/u/32959539/JBD2_NR_batch_2048_1M.7z