Bug 71641 - Unreasonable performance degradation in ext4 with full data journaling
Summary: Unreasonable performance degradation in ext4 with full data journaling
Status: REOPENED
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-07 11:39 UTC by Chia-Hung Chang
Modified: 2016-03-23 18:11 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Chia-Hung Chang 2014-03-07 11:39:40 UTC
JFS provides three modes, journal, ordered and writeback.
The first mode is denoted as ‘journal mode’in the following context.
In the journal mode, data should be written twice, one for the journal area and the other for the client file system. If the journal area and the client file system are both located in the disk, it has at least 50% performance degradation compared to ordered mode.
But what if we put the journal area in a ramdisk?
I did the following tests. It shows the ext4 with full data journaling has unreasonable performance degradation even the journal area is located in the ramdisk.

Test environment--  
CPU: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz
RAM:8GB
Filesystem:Ext4
Linux version:3.4.6
RAID are composed of 6 x 1TB HD
Command: time dd if=/dev/zero of= Write_File bs=1M count=51200

Volume_type ordered_mode Journal_mode degradation
Single_disk	173MB/s	144MB/s	17%
RAID0    	937MB/s	375MB/s	60%
RAID5   	732MB/s	132MB/s	82%

Does anyone know where the bottleneck may be?
Comment 1 Theodore Tso 2014-03-07 16:20:34 UTC
On Fri, Mar 07, 2014 at 11:39:40AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> JFS provides three modes, journal, ordered and writeback.
> The first mode is denoted as ‘journal mode’in the following context.
> In the journal mode, data should be written twice, one for the journal area
> and
> the other for the client file system. If the journal area and the client file
> system are both located in the disk, it has at least 50% performance
> degradation compared to ordered mode.
> But what if we put the journal area in a ramdisk?

How big was the ramdisk?  Since all of the blocks are going through
the journal, even if it is on the journal, it requires more commits
and thus more checkpoint operations, which means more updates to the
disk.  A bigger journal will help minimize this issue.

Would you be willing to grab block traces for both the disk and the
external journal device?

I will add that the workload of "dd if=/dev/zero of=file" is probably
the worst case for data=journal, and if that's really what you are
doing, it's a case of "doctor, doctor, it hurts when I do that".  All
file systems modes will have strengths and weaknesses, and your use
case one where I would simply tell people, "don't use that mode".

If you want to work on improving it, that's great.  Gather data, and
if we can figure out an easy way to improve things, great.  But I'll
warn you ahead of time this is not necessarily something I view as
"unreasonable", nor is it something that I would consider a high
priority thing to fix.

						- Ted
Comment 2 Chia-Hung Chang 2014-03-07 17:57:32 UTC
> 
> How big was the ramdisk?  Since all of the blocks are going through
> the journal, even if it is on the journal, it requires more commits
> and thus more checkpoint operations, which means more updates to the
> disk.  A bigger journal will help minimize this issue.
> 
> Would you be willing to grab block traces for both the disk and the
> external journal device?
> 
> I will add that the workload of "dd if=/dev/zero of=file" is probably
> the worst case for data=journal, and if that's really what you are
> doing, it's a case of "doctor, doctor, it hurts when I do that".  All
> file systems modes will have strengths and weaknesses, and your use
> case one where I would simply tell people, "don't use that mode".
> 
> If you want to work on improving it, that's great.  Gather data, and
> if we can figure out an easy way to improve things, great.  But I'll
> warn you ahead of time this is not necessarily something I view as
> "unreasonable", nor is it something that I would consider a high
> priority thing to fix.
> 
  					- Ted
I use two sizes of ramdisk, 128MB and 1024MB. With 1024MB journal area, the performance is slight improved. But performance degradation is still significant. 
 
I am willing to grab block traces. Please tell me how to get the traces you want.

As you can see that the performance degradation of applying data=journal in raid5 is 80%, which makes it hard to use. If I know where the problem is, I will try to improve it.

Thanks for your help.
Comment 3 Chia-Hung Chang 2014-03-19 09:35:03 UTC
Test command :  dd if=/dev/zero of=/share/CACHEDEV2_DATA/orderno5G bs=1M count=5120

The results of block trace can be download as follow.
https://dl.dropboxusercontent.com/u/32959539/blktrace_original_data.7z

The results of btt can be download as follow.
https://dl.dropboxusercontent.com/u/32959539/blktrace_output.7z

There are three folders:
/journal              -->ext4,data=journal              with 375MB/s
/order_delalloc       -->ext4,data=ordered,delalloc     with 937MB/s
/order_nodelalloc     -->ext4,data=ordered,nodelalloc   with 353MB/s


The size of write request with data=ordered,delalloc is 64k mostly.
The size of write requests with data=journal and data=ordered,nodelalloc are 4k mostly. 

If we observe the average Q2Q in btt results, which means time interval between requests, we have

/journal           0.00081
/order_nodelalloc  0.00031

It seems to be the reason of performance degradation.
Is any one know why the time interval of requests is significantly longer with data=journal?
Comment 4 Theodore Tso 2014-03-20 04:09:50 UTC
Could you retry your measurements using the latest kernel?  At least 3.11, and preferably 3.13.

We significantly optimized the write path for the nodelalloc case in 3.11.  That should fix the average size being so small for the nodelalloc case:


commit 20970ba65d5a22f2e4efbfa100377722fde56935
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Thu Jun 6 14:00:46 2013 -0400

    ext4: use ext4_da_writepages() for all modes
    
    Rename ext4_da_writepages() to ext4_writepages() and use it for all
    modes.  We still need to iterate over all the pages in the case of
    data=journalling, but in the case of nodelalloc/data=ordered (which is
    what file systems mounted using ext3 backwards compatibility will use)
    this will allow us to use a much more efficient I/O submission path.
    
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Comment 5 Chia-Hung Chang 2014-03-21 08:48:34 UTC
(In reply to Theodore Tso from comment #4)
> Could you retry your measurements using the latest kernel?  At least 3.11,
> and preferably 3.13.
> 
> We significantly optimized the write path for the nodelalloc case in 3.11. 
> That should fix the average size being so small for the nodelalloc case:
> 
> 

Thanks for your advice. The kernel version is changed from 3.4 to 3.11.  The write throughput of data=journal is still only 40% of data=order. Do you think what may be wrong in this mode ?
---------------------

Test environment--  
CPU: Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz
RAM:8GB
Filesystem:Ext4
Linux version: ubuntu-saucy with kernel 3.11

128MB ramdisk is used as journaling area

RAID0 are composed of 6 x 1TB HD
Command: time dd if=/dev/zero of= Write_File bs=1M count=5120


data=journal               -> 398MB/s
data=ordered               ->1.1GB/s
data=ordered ,nodelalloc   -> 1G/s

blktrace results can be downloaded in
https://dl.dropboxusercontent.com/u/32959539/blktrace_with_3_11.7z

btt results can be downloaded in
https://dl.dropboxusercontent.com/u/32959539/BTT_with_3_11.7z


For linux 3.4, the write throughputs are  

data=journal               -> 397MB/s
data=ordered               -> 937MB/s
data=ordered ,nodelalloc   -> 863MB/s*

*PS. the previous data is mistaken, please use this version.
Comment 6 Theodore Tso 2014-03-21 16:04:12 UTC
Could you try changing JBD2_NR_BATCH (defined in include/linux/jbd2.h) from 64 to 256?  If that improves, you can try 512, but that might not make a further difference.
Comment 7 Chia-Hung Chang 2014-03-26 10:30:36 UTC
(In reply to Theodore Tso from comment #6)
> Could you try changing JBD2_NR_BATCH (defined in include/linux/jbd2.h) from
> 64 to 256?  If that improves, you can try 512, but that might not make a
> further difference.

Thanks for your advice. The performance has the following improvement.

data=journal     JBD2_NR_BATCH=64          -> 398MB/s
data=journal     JBD2_NR_BATCH=256         -> 412MB/s
data=ordered               ->1.1GB/s

Do you have any other suggestions?
Comment 8 Theodore Tso 2014-03-26 13:40:42 UTC
it's clear this isn't going to get performance up to 1.1 GB/s, but I'm curious how much setting JBD2_NR_BATCH changes things at 512 and 1024 and possibly even 2048.   Once it no longer maters a difference, if you could do another blktrace, and also gather lock_stat information, that would be useful.

To gather lock_stat information, enable CONFIG_LOCK_STAT, and then "echo 0  > /proc/lock_stat" before you start the workload, and then capture the output of /proc/lock_stat after you finish running your workload/benchmark.

If you also regather numbers with lock_stat enabled on a stock 3.11 kernel (and also get a /proc/lock_stat report from a stock 3.11 kernel, with and without data=journal), that would be useful.

If it turns out that there is some lock contention going on with some of the jbd2 spinlocks, there are some patches queued for 3.15 that I may have to ask you to try (which will mean going to something like 3.14-rc7 plus some additional patches from the ext4 git tree).

Thanks for your benchmarking!
Comment 9 Chia-Hung Chang 2014-03-28 10:23:52 UTC
(In reply to Theodore Tso from comment #8)
> it's clear this isn't going to get performance up to 1.1 GB/s, but I'm
> curious how much setting JBD2_NR_BATCH changes things at 512 and 1024 and
> possibly even 2048.   Once it no longer maters a difference, if you could do
> another blktrace, and also gather lock_stat information, that would be
> useful.
> 
> To gather lock_stat information, enable CONFIG_LOCK_STAT, and then "echo 0 
> > /proc/lock_stat" before you start the workload, and then capture the
> output of /proc/lock_stat after you finish running your workload/benchmark.
> 
> If you also regather numbers with lock_stat enabled on a stock 3.11 kernel
> (and also get a /proc/lock_stat report from a stock 3.11 kernel, with and
> without data=journal), that would be useful.
> 
> If it turns out that there is some lock contention going on with some of the
> jbd2 spinlocks, there are some patches queued for 3.15 that I may have to
> ask you to try (which will mean going to something like 3.14-rc7 plus some
> additional patches from the ext4 git tree).
> 
> Thanks for your benchmarking!

Thanks for your advice.
With the same environment setting and 'dd' command,
The benchmarks of JBD2_NR_batch with data=journal are showed as following,
 
JBD2_NR_batch =64   ->386 MB/s

JBD2_NR_batch =254  ->400MB/s

JBD2_NR_batch =512  ->407MB/s

JBD2_NR_batch =1024 with  CONFIG_LOCK_STAT enable ->304MB/s

JBD2_NR_batch =2048 ->440MB/s

---------------------------
/proc/lock_stat report with data=journal

data collected at the end of "dd"
https://dl.dropboxusercontent.com/u/32959539/lock_stat_result

data collected in the middle of "dd" execution
https://dl.dropboxusercontent.com/u/32959539/lock_stat_result2

/proc/lock_stat report with data=ordered
https://dl.dropboxusercontent.com/u/32959539/lock_stat_ordered

------------------------------
blktrace of JBD2_NR_batch =2048 with data=journal
https://dl.dropboxusercontent.com/u/32959539/JBD2_NR_batch_2048.7z


Please tell me if you need further information.
Comment 10 Chia-Hung Chang 2014-03-28 10:55:10 UTC
BTW, if we use 1G ramdisk as journal area, the performance with JBD2_NR_batch =2048 can be 571MB/s

Please download the result of blktrace as following,
https://dl.dropboxusercontent.com/u/32959539/JBD2_NR_batch_2048_1M.7z

Note You need to log in before you can comment on or make changes to this bug.