Bug 86831 - wrong count of dirty pages when using AIO
Summary: wrong count of dirty pages when using AIO
Status: RESOLVED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-24 15:33 UTC by Markus Königshaus
Modified: 2014-12-12 16:13 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.14.21
Tree: Mainline
Regression: No


Attachments

Description Markus Königshaus 2014-10-24 15:33:02 UTC
Hello,

we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO support, on a ext3 formatted Harddrive) leads to a negative number of dirty pages (underrun to the counter). The negative number results in a drastic reduction of the write performance because the page cache is not used, because the kernel thinks it is still 2 ^ 32 dirty pages open. I found, the problem is mm/truncate.c->cancel_dirty_page()

To reproduce,first change cancel_dirty_page()

[...]
if (mapping && mapping_cap_account_dirty (mapping)) {
++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
dec_zone_page_state (page, NR_FILE_DIRTY);
[...]

And test ->

[   21.427121] ------------[ cut here ]------------
[   21.431769] WARNING: CPU: 0 PID: 309 at mm/truncate.c:78 cancel_dirty_page+0x144/0x1cc()
[   21.439899] Modules linked in: wutbox_cp sata_mv
[   21.444572] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #77
[   21.451395] Workqueue: events free_ioctx
[   21.455362] [<c0016a64>] (unwind_backtrace) from [<c0012f88>] (show_stack+0x20/0x24)
[   21.463164] [<c0012f88>] (show_stack) from [<c03f8bec>] (dump_stack+0x24/0x28)
[   21.470441] [<c03f8bec>] (dump_stack) from [<c0023ae4>] (warn_slowpath_common+0x84/0x9c)
[   21.478577] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>] (warn_slowpath_null+0x2c/0x34)
[   21.487402] [<c0023bb8>] (warn_slowpath_null) from [<c00c05f0>] (cancel_dirty_page+0x144/0x1cc)
[   21.496125] [<c00c05f0>] (cancel_dirty_page) from [<c00c073c>] (truncate_inode_page+0x8c/0x158)
[   21.504874] [<c00c073c>] (truncate_inode_page) from [<c00c09c4>] (truncate_inode_pages_range+0x11c/0x53c)
[   21.514493] [<c00c09c4>] (truncate_inode_pages_range) from [<c00c0e9c>] (truncate_pagecache+0x88/0xac)
[   21.523855] [<c00c0e9c>] (truncate_pagecache) from [<c00c0f1c>] (truncate_setsize+0x5c/0x74)
[   21.532332] [<c00c0f1c>] (truncate_setsize) from [<c013b2d8>] (put_aio_ring_file.isra.14+0x34/0x90)
[   21.541428] [<c013b2d8>] (put_aio_ring_file.isra.14) from [<c013b354>] (aio_free_ring+0x20/0xcc)
[   21.550262] [<c013b354>] (aio_free_ring) from [<c013b424>] (free_ioctx+0x24/0x44)
[   21.557791] [<c013b424>] (free_ioctx) from [<c003d8d8>] (process_one_work+0x134/0x47c)
[   21.565732] [<c003d8d8>] (process_one_work) from [<c003e988>] (worker_thread+0x130/0x414)
[   21.575225] [<c003e988>] (worker_thread) from [<c00448ac>] (kthread+0xd4/0xec)
[   21.590695] [<c00448ac>] (kthread) from [<c000ec18>] (ret_from_fork+0x14/0x20)
[   21.600721] ---[ end trace a9d8eecdd373e997 ]---

With disabled AIO the error disappear. 

Markus
Comment 1 Andrew Morton 2014-10-27 23:46:43 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 24 Oct 2014 15:33:02 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=86831
> 
>             Bug ID: 86831
>            Summary: wrong count of dirty pages when using AIO
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.14.21
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: m.koenigshaus@wut.de
>         Regression: No
> 
> Hello,
> 
> we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO support,
> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
> (underrun to the counter). The negative number results in a drastic reduction
> of the write performance because the page cache is not used, because the
> kernel
> thinks it is still 2 ^ 32 dirty pages open. I found, the problem is
> mm/truncate.c->cancel_dirty_page()
> 
> To reproduce,first change cancel_dirty_page()
> 
> [...]
> if (mapping && mapping_cap_account_dirty (mapping)) {
> ++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
> dec_zone_page_state (page, NR_FILE_DIRTY);
> [...]
> 
> And test ->

hm, I wonder what AIO is doing differently - cancel_dirty_page() isn't
specific to aio - it's used by all truncations.

Just to sanity check, could you please try something like this?

--- a/include/linux/vmstat.h~a
+++ a/include/linux/vmstat.h
@@ -241,6 +241,8 @@ static inline void __inc_zone_state(stru
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	atomic_long_dec(&zone->vm_stat[item]);
+	WARN_ON_ONCE(item == NR_FILE_DIRTY &&
+		atomic_long_read(&zone->vm_stat[item]) < 0);
 	atomic_long_dec(&vm_stat[item]);
 }
 

That should catch the first offending decrement, although yes, it's
probably cancel_dirty_page().

(This assumes you're using an SMP kernel.  If not,
mm/vmstat.c:dec_zone_page_state() will need to be changed instead)
Comment 2 Markus Königshaus 2014-10-28 10:02:43 UTC
Am 28.10.2014 um 00:46 schrieb bugzilla-daemon@bugzilla.kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>
> --- Comment #1 from Andrew Morton <akpm@linux-foundation.org> ---
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Fri, 24 Oct 2014 15:33:02 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>
>>             Bug ID: 86831
>>            Summary: wrong count of dirty pages when using AIO
>>            Product: Memory Management
>>            Version: 2.5
>>     Kernel Version: 3.14.21
>>           Hardware: All
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Other
>>           Assignee: akpm@linux-foundation.org
>>           Reporter: m.koenigshaus@wut.de
>>         Regression: No
>>
>> Hello,
>>
>> we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO
>> support,
>> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
>> (underrun to the counter). The negative number results in a drastic
>> reduction
>> of the write performance because the page cache is not used, because the
>> kernel
>> thinks it is still 2 ^ 32 dirty pages open. I found, the problem is
>> mm/truncate.c->cancel_dirty_page()
>>
>> To reproduce,first change cancel_dirty_page()
>>
>> [...]
>> if (mapping && mapping_cap_account_dirty (mapping)) {
>> ++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
>> dec_zone_page_state (page, NR_FILE_DIRTY);
>> [...]
>>
>> And test ->
> hm, I wonder what AIO is doing differently - cancel_dirty_page() isn't
> specific to aio - it's used by all truncations.
Thanks for your answer. I wonder too, but if I disable AIO - interface
in mysqld, the error disappearsreproducible.
>
> Just to sanity check, could you please try something like this?
>
> --- a/include/linux/vmstat.h~a
> +++ a/include/linux/vmstat.h
> @@ -241,6 +241,8 @@ static inline void __inc_zone_state(stru
>  static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item
> item)
>  {
>      atomic_long_dec(&zone->vm_stat[item]);
> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
> +        atomic_long_read(&zone->vm_stat[item]) < 0);
>      atomic_long_dec(&vm_stat[item]);
>  }
We do not use an SMP kernel, since we use only a single core CPU. But
your proposed change is for the #else - branch and not for #ifdef
SMP_CONFIG. So it fits, my modification:

--- linux-3.14.1.org/include/linux/vmstat.h    2014-04-14
15:50:10.000000000 +0200
+++ linux-3.14.1/include/linux/vmstat.h    2014-10-28 10:28:40.385327906
+0100
@@ -239,6 +239,7 @@
 static inline void __dec_zone_state(struct zone *zone, enum
zone_stat_item item)
 {
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
 }
 
Here are the results:

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
(show_stack+0x20/0x24)
[   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
(dump_stack+0x24/0x28)
[   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
(cancel_dirty_page+0x164/0x224)
[   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
(truncate_inode_page+0x8c/0x158)
[   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
[<c00c0f6c>] (truncate_pagecache+0x88/0xac)
[   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
(truncate_setsize+0x5c/0x74)
[   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
[<c013b424>] (aio_free_ring+0x20/0xcc)
[   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
(free_ioctx+0x24/0x44)
[   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
(process_one_work+0x134/0x47c)
[   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
(worker_thread+0x130/0x414)
[   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
(kthread+0xd4/0xec)
[   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

>
>
> That should catch the first offending decrement, although yes, it's
> probably cancel_dirty_page().
>
> (This assumes you're using an SMP kernel.  If not,
> mm/vmstat.c:dec_zone_page_state() will need to be changed instead)
>

-- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
auf Grundlage dieser Aussagen treffen.
Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
Registergericht: Amtsgericht Wuppertal, HRB 6377 
Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
Comment 3 Gu Zheng 2014-10-29 09:01:47 UTC
+cc Ben

Hi Markus,
It seems that truncating aio ring file contributes to accounting dirty pages,
buf aio ring file is based on aiofs (an anonymous fs), which should not do
this.
I think the cause is that aio fs uses *default_backing_dev_info* as the
backing dev, which does not disable the dirty pages accounting capability.
Could you please try the following patch?
It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
WRITEBACK/ACCT_WB capabilities.

Thanks,
Gu

---
diff --git a/fs/aio.c b/fs/aio.c
index 84a7510..eb6c22f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static struct backing_dev_info aio_fs_backing_dev_info = {
+       .name           = "aiofs",
+       .state          = 0,
+       .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
+};
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
        struct qstr this = QSTR_INIT("[aio]", 5);
@@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 
        inode->i_mapping->a_ops = &aio_ctx_aops;
        inode->i_mapping->private_data = ctx;
+       inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
        inode->i_size = PAGE_SIZE * nr_pages;
 
        path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
@@ -220,6 +227,9 @@ static int __init aio_setup(void)
        if (IS_ERR(aio_mnt))
                panic("Failed to create aio fs mount.");
 
+       if (bdi_init(&aio_fs_backing_dev_info))
+               panic("Failed to init aio fs backing dev info.");
+
        kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
        kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);


On 10/28/2014 05:57 PM, Markus Königshaus wrote:

> 
> Am 28.10.2014 um 00:46 schrieb bugzilla-daemon@bugzilla.kernel.org:
>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>
>> --- Comment #1 from Andrew Morton <akpm@linux-foundation.org> ---
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Fri, 24 Oct 2014 15:33:02 +0000 bugzilla-daemon@bugzilla.kernel.org
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>>
>>>             Bug ID: 86831
>>>            Summary: wrong count of dirty pages when using AIO
>>>            Product: Memory Management
>>>            Version: 2.5
>>>     Kernel Version: 3.14.21
>>>           Hardware: All
>>>                 OS: Linux
>>>               Tree: Mainline
>>>             Status: NEW
>>>           Severity: normal
>>>           Priority: P1
>>>          Component: Other
>>>           Assignee: akpm@linux-foundation.org
>>>           Reporter: m.koenigshaus@wut.de
>>>         Regression: No
>>>
>>> Hello,
>>>
>>> we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO
>>> support,
>>> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
>>> (underrun to the counter). The negative number results in a drastic
>>> reduction
>>> of the write performance because the page cache is not used, because the
>>> kernel
>>> thinks it is still 2 ^ 32 dirty pages open. I found, the problem is
>>> mm/truncate.c->cancel_dirty_page()
>>>
>>> To reproduce,first change cancel_dirty_page()
>>>
>>> [...]
>>> if (mapping && mapping_cap_account_dirty (mapping)) {
>>> ++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
>>> dec_zone_page_state (page, NR_FILE_DIRTY);
>>> [...]
>>>
>>> And test ->
>> hm, I wonder what AIO is doing differently - cancel_dirty_page() isn't
>> specific to aio - it's used by all truncations.
> Thanks for your answer. I wonder too, but if I disable AIO - interface
> in mysqld, the error disappearsreproducible.
>>
>> Just to sanity check, could you please try something like this?
>>
>> --- a/include/linux/vmstat.h~a
>> +++ a/include/linux/vmstat.h
>> @@ -241,6 +241,8 @@ static inline void __inc_zone_state(stru
>>  static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item
>> item)
>>  {
>>      atomic_long_dec(&zone->vm_stat[item]);
>> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>> +        atomic_long_read(&zone->vm_stat[item]) < 0);
>>      atomic_long_dec(&vm_stat[item]);
>>  }
> We do not use an SMP kernel, since we use only a single core CPU. But
> your proposed change is for the #else - branch and not for #ifdef
> SMP_CONFIG. So it fits, my modification:
> 
> --- linux-3.14.1.org/include/linux/vmstat.h    2014-04-14
> 15:50:10.000000000 +0200
> +++ linux-3.14.1/include/linux/vmstat.h    2014-10-28 10:28:40.385327906
> +0100
> @@ -239,6 +239,7 @@
>  static inline void __dec_zone_state(struct zone *zone, enum
> zone_stat_item item)
>  {
>      atomic_long_dec(&zone->vm_stat[item]);
> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
> atomic_long_read(&zone->vm_stat[item]) < 0);
>      atomic_long_dec(&vm_stat[item]);
>  }
>  
> Here are the results:
> 
> [   21.341632] ------------[ cut here ]------------
> [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
> cancel_dirty_page+0x164/0x224()
> [   21.355296] Modules linked in: wutbox_cp sata_mv
> [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
> [   21.366793] Workqueue: events free_ioctx
> [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
> (show_stack+0x20/0x24)
> [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
> (dump_stack+0x24/0x28)
> [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
> (warn_slowpath_common+0x84/0x9c)
> [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
> (warn_slowpath_null+0x2c/0x34)
> [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
> (cancel_dirty_page+0x164/0x224)
> [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
> (truncate_inode_page+0x8c/0x158)
> [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
> (truncate_inode_pages_range+0x11c/0x53c)
> [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
> [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
> [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
> (truncate_setsize+0x5c/0x74)
> [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
> (put_aio_ring_file.isra.14+0x34/0x90)
> [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
> [<c013b424>] (aio_free_ring+0x20/0xcc)
> [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
> (free_ioctx+0x24/0x44)
> [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
> (process_one_work+0x134/0x47c)
> [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
> (worker_thread+0x130/0x414)
> [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
> (kthread+0xd4/0xec)
> [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
> (ret_from_fork+0x14/0x20)
> [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
> 
>>
>>
>> That should catch the first offending decrement, although yes, it's
>> probably cancel_dirty_page().
>>
>> (This assumes you're using an SMP kernel.  If not,
>> mm/vmstat.c:dec_zone_page_state() will need to be changed instead)
>>
> 
> -- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
> Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
> auf Grundlage dieser Aussagen treffen.
> Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
> Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
> Registergericht: Amtsgericht Wuppertal, HRB 6377 
> Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
> .
>
Comment 4 Markus Königshaus 2014-10-29 15:27:04 UTC
Hi Gu,

Many thanks for your patch. It works! Crossceckingwithout your patch
leads back to the error.
Is your patch only preliminary or can I apply this change permanently
for my system?

Best regards, Markus
Am 29.10.2014 um 09:46 schrieb Gu Zheng:
> +cc Ben
>
> Hi Markus,
> It seems that truncating aio ring file contributes to accounting dirty pages,
> buf aio ring file is based on aiofs (an anonymous fs), which should not do
> this.
> I think the cause is that aio fs uses *default_backing_dev_info* as the
> backing dev, which does not disable the dirty pages accounting capability.
> Could you please try the following patch?
> It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
> WRITEBACK/ACCT_WB capabilities.
>
> Thanks,
> Gu
>
> ---
> diff --git a/fs/aio.c b/fs/aio.c
> index 84a7510..eb6c22f 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
>  static const struct file_operations aio_ring_fops;
>  static const struct address_space_operations aio_ctx_aops;
>  
> +static struct backing_dev_info aio_fs_backing_dev_info = {
> +       .name           = "aiofs",
> +       .state          = 0,
> +       .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
> +};
> +
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  {
>         struct qstr this = QSTR_INIT("[aio]", 5);
> @@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx *ctx,
> loff_t nr_pages)
>  
>         inode->i_mapping->a_ops = &aio_ctx_aops;
>         inode->i_mapping->private_data = ctx;
> +       inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
>         inode->i_size = PAGE_SIZE * nr_pages;
>  
>         path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
> @@ -220,6 +227,9 @@ static int __init aio_setup(void)
>         if (IS_ERR(aio_mnt))
>                 panic("Failed to create aio fs mount.");
>  
> +       if (bdi_init(&aio_fs_backing_dev_info))
> +               panic("Failed to init aio fs backing dev info.");
> +
>         kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>         kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>
>
> On 10/28/2014 05:57 PM, Markus Königshaus wrote:
>
>> Am 28.10.2014 um 00:46 schrieb bugzilla-daemon@bugzilla.kernel.org:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>>
>>> --- Comment #1 from Andrew Morton <akpm@linux-foundation.org> ---
>>> (switched to email.  Please respond via emailed reply-to-all, not via the
>>> bugzilla web interface).
>>>
>>> On Fri, 24 Oct 2014 15:33:02 +0000 bugzilla-daemon@bugzilla.kernel.org
>>> wrote:
>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>>>
>>>>             Bug ID: 86831
>>>>            Summary: wrong count of dirty pages when using AIO
>>>>            Product: Memory Management
>>>>            Version: 2.5
>>>>     Kernel Version: 3.14.21
>>>>           Hardware: All
>>>>                 OS: Linux
>>>>               Tree: Mainline
>>>>             Status: NEW
>>>>           Severity: normal
>>>>           Priority: P1
>>>>          Component: Other
>>>>           Assignee: akpm@linux-foundation.org
>>>>           Reporter: m.koenigshaus@wut.de
>>>>         Regression: No
>>>>
>>>> Hello,
>>>>
>>>> we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO
>>>> support,
>>>> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
>>>> (underrun to the counter). The negative number results in a drastic
>>>> reduction
>>>> of the write performance because the page cache is not used, because the
>>>> kernel
>>>> thinks it is still 2 ^ 32 dirty pages open. I found, the problem is
>>>> mm/truncate.c->cancel_dirty_page()
>>>>
>>>> To reproduce,first change cancel_dirty_page()
>>>>
>>>> [...]
>>>> if (mapping && mapping_cap_account_dirty (mapping)) {
>>>> ++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
>>>> dec_zone_page_state (page, NR_FILE_DIRTY);
>>>> [...]
>>>>
>>>> And test ->
>>> hm, I wonder what AIO is doing differently - cancel_dirty_page() isn't
>>> specific to aio - it's used by all truncations.
>> Thanks for your answer. I wonder too, but if I disable AIO - interface
>> in mysqld, the error disappearsreproducible.
>>> Just to sanity check, could you please try something like this?
>>>
>>> --- a/include/linux/vmstat.h~a
>>> +++ a/include/linux/vmstat.h
>>> @@ -241,6 +241,8 @@ static inline void __inc_zone_state(stru
>>>  static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item
>>> item)
>>>  {
>>>      atomic_long_dec(&zone->vm_stat[item]);
>>> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>>> +        atomic_long_read(&zone->vm_stat[item]) < 0);
>>>      atomic_long_dec(&vm_stat[item]);
>>>  }
>> We do not use an SMP kernel, since we use only a single core CPU. But
>> your proposed change is for the #else - branch and not for #ifdef
>> SMP_CONFIG. So it fits, my modification:
>>
>> --- linux-3.14.1.org/include/linux/vmstat.h    2014-04-14
>> 15:50:10.000000000 +0200
>> +++ linux-3.14.1/include/linux/vmstat.h    2014-10-28 10:28:40.385327906
>> +0100
>> @@ -239,6 +239,7 @@
>>  static inline void __dec_zone_state(struct zone *zone, enum
>> zone_stat_item item)
>>  {
>>      atomic_long_dec(&zone->vm_stat[item]);
>> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>> atomic_long_read(&zone->vm_stat[item]) < 0);
>>      atomic_long_dec(&vm_stat[item]);
>>  }
>>  
>> Here are the results:
>>
>> [   21.341632] ------------[ cut here ]------------
>> [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
>> cancel_dirty_page+0x164/0x224()
>> [   21.355296] Modules linked in: wutbox_cp sata_mv
>> [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
>> [   21.366793] Workqueue: events free_ioctx
>> [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
>> (show_stack+0x20/0x24)
>> [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
>> (dump_stack+0x24/0x28)
>> [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
>> (warn_slowpath_common+0x84/0x9c)
>> [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
>> (warn_slowpath_null+0x2c/0x34)
>> [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
>> (cancel_dirty_page+0x164/0x224)
>> [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
>> (truncate_inode_page+0x8c/0x158)
>> [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
>> (truncate_inode_pages_range+0x11c/0x53c)
>> [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
>> [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
>> [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
>> (truncate_setsize+0x5c/0x74)
>> [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
>> (put_aio_ring_file.isra.14+0x34/0x90)
>> [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
>> [<c013b424>] (aio_free_ring+0x20/0xcc)
>> [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
>> (free_ioctx+0x24/0x44)
>> [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
>> (process_one_work+0x134/0x47c)
>> [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
>> (worker_thread+0x130/0x414)
>> [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
>> (kthread+0xd4/0xec)
>> [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
>> (ret_from_fork+0x14/0x20)
>> [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
>>
>>>
>>> That should catch the first offending decrement, although yes, it's
>>> probably cancel_dirty_page().
>>>
>>> (This assumes you're using an SMP kernel.  If not,
>>> mm/vmstat.c:dec_zone_page_state() will need to be changed instead)
>>>
>> -- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
>> Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
>> auf Grundlage dieser Aussagen treffen.
>> Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
>> Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
>> Registergericht: Amtsgericht Wuppertal, HRB 6377 
>> Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majordomo@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>> .
>>
>

-- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen
auf Grundlage dieser Aussagen treffen.
Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
Registergericht: Amtsgericht Wuppertal, HRB 6377
Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
Comment 5 Andrew Morton 2014-10-29 18:08:10 UTC
On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com> wrote:

> +cc Ben
> 
> Hi Markus,
> It seems that truncating aio ring file contributes to accounting dirty pages,
> buf aio ring file is based on aiofs (an anonymous fs), which should not do
> this.
> I think the cause is that aio fs uses *default_backing_dev_info* as the
> backing dev, which does not disable the dirty pages accounting capability.
> Could you please try the following patch?
> It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
> WRITEBACK/ACCT_WB capabilities.
> 

Why does aio_ctx_aops implement a null .set_page_dirty?

This is not explained in the code or in the 36bc08cc changelog.  Please
add a code comment to aio_set_page_dirty() explaining the reason?
Comment 6 Gu Zheng 2014-10-30 06:20:19 UTC
Hi Markus,
On 10/29/2014 11:26 PM, Markus Königshaus wrote:

> Hi Gu,
> 
> Many thanks for your patch. It works! Crossceckingwithout your patch
> leads back to the error.
> Is your patch only preliminary or can I apply this change permanently
> for my system?

You can use this patch temporarily. I'll send a formal one to the community
soon, if on objection, it should be applied to stable too. And you can
switch to the new stable version then.

Thanks,
Gu

> 
> Best regards, Markus
> Am 29.10.2014 um 09:46 schrieb Gu Zheng:
>> +cc Ben
>>
>> Hi Markus,
>> It seems that truncating aio ring file contributes to accounting dirty
>> pages,
>> buf aio ring file is based on aiofs (an anonymous fs), which should not do
>> this.
>> I think the cause is that aio fs uses *default_backing_dev_info* as the
>> backing dev, which does not disable the dirty pages accounting capability.
>> Could you please try the following patch?
>> It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
>> WRITEBACK/ACCT_WB capabilities.
>>
>> Thanks,
>> Gu
>>
>> ---
>> diff --git a/fs/aio.c b/fs/aio.c
>> index 84a7510..eb6c22f 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
>>  static const struct file_operations aio_ring_fops;
>>  static const struct address_space_operations aio_ctx_aops;
>>  
>> +static struct backing_dev_info aio_fs_backing_dev_info = {
>> +       .name           = "aiofs",
>> +       .state          = 0,
>> +       .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
>> +};
>> +
>>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>>  {
>>         struct qstr this = QSTR_INIT("[aio]", 5);
>> @@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx *ctx,
>> loff_t nr_pages)
>>  
>>         inode->i_mapping->a_ops = &aio_ctx_aops;
>>         inode->i_mapping->private_data = ctx;
>> +       inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
>>         inode->i_size = PAGE_SIZE * nr_pages;
>>  
>>         path.dentry = d_alloc_pseudo(aio_mnt->mnt_sb, &this);
>> @@ -220,6 +227,9 @@ static int __init aio_setup(void)
>>         if (IS_ERR(aio_mnt))
>>                 panic("Failed to create aio fs mount.");
>>  
>> +       if (bdi_init(&aio_fs_backing_dev_info))
>> +               panic("Failed to init aio fs backing dev info.");
>> +
>>         kiocb_cachep = KMEM_CACHE(kiocb, SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>>         kioctx_cachep = KMEM_CACHE(kioctx,SLAB_HWCACHE_ALIGN|SLAB_PANIC);
>>
>>
>> On 10/28/2014 05:57 PM, Markus Königshaus wrote:
>>
>>> Am 28.10.2014 um 00:46 schrieb bugzilla-daemon@bugzilla.kernel.org:
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>>>
>>>> --- Comment #1 from Andrew Morton <akpm@linux-foundation.org> ---
>>>> (switched to email.  Please respond via emailed reply-to-all, not via the
>>>> bugzilla web interface).
>>>>
>>>> On Fri, 24 Oct 2014 15:33:02 +0000 bugzilla-daemon@bugzilla.kernel.org
>>>> wrote:
>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=86831
>>>>>
>>>>>             Bug ID: 86831
>>>>>            Summary: wrong count of dirty pages when using AIO
>>>>>            Product: Memory Management
>>>>>            Version: 2.5
>>>>>     Kernel Version: 3.14.21
>>>>>           Hardware: All
>>>>>                 OS: Linux
>>>>>               Tree: Mainline
>>>>>             Status: NEW
>>>>>           Severity: normal
>>>>>           Priority: P1
>>>>>          Component: Other
>>>>>           Assignee: akpm@linux-foundation.org
>>>>>           Reporter: m.koenigshaus@wut.de
>>>>>         Regression: No
>>>>>
>>>>> Hello,
>>>>>
>>>>> we use a ARM custom Board with mysqld. Shuting down mysqld (with IAO
>>>>> support,
>>>>> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
>>>>> (underrun to the counter). The negative number results in a drastic
>>>>> reduction
>>>>> of the write performance because the page cache is not used, because the
>>>>> kernel
>>>>> thinks it is still 2 ^ 32 dirty pages open. I found, the problem is
>>>>> mm/truncate.c->cancel_dirty_page()
>>>>>
>>>>> To reproduce,first change cancel_dirty_page()
>>>>>
>>>>> [...]
>>>>> if (mapping && mapping_cap_account_dirty (mapping)) {
>>>>> ++WARN_ON ((int) global_page_state(NR_FILE_DIRTY) <0);
>>>>> dec_zone_page_state (page, NR_FILE_DIRTY);
>>>>> [...]
>>>>>
>>>>> And test ->
>>>> hm, I wonder what AIO is doing differently - cancel_dirty_page() isn't
>>>> specific to aio - it's used by all truncations.
>>> Thanks for your answer. I wonder too, but if I disable AIO - interface
>>> in mysqld, the error disappearsreproducible.
>>>> Just to sanity check, could you please try something like this?
>>>>
>>>> --- a/include/linux/vmstat.h~a
>>>> +++ a/include/linux/vmstat.h
>>>> @@ -241,6 +241,8 @@ static inline void __inc_zone_state(stru
>>>>  static inline void __dec_zone_state(struct zone *zone, enum
>>>>  zone_stat_item
>>>> item)
>>>>  {
>>>>      atomic_long_dec(&zone->vm_stat[item]);
>>>> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>>>> +        atomic_long_read(&zone->vm_stat[item]) < 0);
>>>>      atomic_long_dec(&vm_stat[item]);
>>>>  }
>>> We do not use an SMP kernel, since we use only a single core CPU. But
>>> your proposed change is for the #else - branch and not for #ifdef
>>> SMP_CONFIG. So it fits, my modification:
>>>
>>> --- linux-3.14.1.org/include/linux/vmstat.h    2014-04-14
>>> 15:50:10.000000000 +0200
>>> +++ linux-3.14.1/include/linux/vmstat.h    2014-10-28 10:28:40.385327906
>>> +0100
>>> @@ -239,6 +239,7 @@
>>>  static inline void __dec_zone_state(struct zone *zone, enum
>>> zone_stat_item item)
>>>  {
>>>      atomic_long_dec(&zone->vm_stat[item]);
>>> +    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>>> atomic_long_read(&zone->vm_stat[item]) < 0);
>>>      atomic_long_dec(&vm_stat[item]);
>>>  }
>>>  
>>> Here are the results:
>>>
>>> [   21.341632] ------------[ cut here ]------------
>>> [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
>>> cancel_dirty_page+0x164/0x224()
>>> [   21.355296] Modules linked in: wutbox_cp sata_mv
>>> [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT
>>> #80
>>> [   21.366793] Workqueue: events free_ioctx
>>> [   21.370760] [<c0016a64>] (unwind_backtrace) from [<c0012f88>]
>>> (show_stack+0x20/0x24)
>>> [   21.378562] [<c0012f88>] (show_stack) from [<c03f8ccc>]
>>> (dump_stack+0x24/0x28)
>>> [   21.385840] [<c03f8ccc>] (dump_stack) from [<c0023ae4>]
>>> (warn_slowpath_common+0x84/0x9c)
>>> [   21.393976] [<c0023ae4>] (warn_slowpath_common) from [<c0023bb8>]
>>> (warn_slowpath_null+0x2c/0x34)
>>> [   21.402800] [<c0023bb8>] (warn_slowpath_null) from [<c00c0688>]
>>> (cancel_dirty_page+0x164/0x224)
>>> [   21.411524] [<c00c0688>] (cancel_dirty_page) from [<c00c080c>]
>>> (truncate_inode_page+0x8c/0x158)
>>> [   21.420272] [<c00c080c>] (truncate_inode_page) from [<c00c0a94>]
>>> (truncate_inode_pages_range+0x11c/0x53c)
>>> [   21.429890] [<c00c0a94>] (truncate_inode_pages_range) from
>>> [<c00c0f6c>] (truncate_pagecache+0x88/0xac)
>>> [   21.439252] [<c00c0f6c>] (truncate_pagecache) from [<c00c0fec>]
>>> (truncate_setsize+0x5c/0x74)
>>> [   21.447731] [<c00c0fec>] (truncate_setsize) from [<c013b3a8>]
>>> (put_aio_ring_file.isra.14+0x34/0x90)
>>> [   21.456826] [<c013b3a8>] (put_aio_ring_file.isra.14) from
>>> [<c013b424>] (aio_free_ring+0x20/0xcc)
>>> [   21.465660] [<c013b424>] (aio_free_ring) from [<c013b4f4>]
>>> (free_ioctx+0x24/0x44)
>>> [   21.473190] [<c013b4f4>] (free_ioctx) from [<c003d8d8>]
>>> (process_one_work+0x134/0x47c)
>>> [   21.481132] [<c003d8d8>] (process_one_work) from [<c003e988>]
>>> (worker_thread+0x130/0x414)
>>> [   21.489350] [<c003e988>] (worker_thread) from [<c00448ac>]
>>> (kthread+0xd4/0xec)
>>> [   21.496621] [<c00448ac>] (kthread) from [<c000ec18>]
>>> (ret_from_fork+0x14/0x20)
>>> [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
>>>
>>>>
>>>> That should catch the first offending decrement, although yes, it's
>>>> probably cancel_dirty_page().
>>>>
>>>> (This assumes you're using an SMP kernel.  If not,
>>>> mm/vmstat.c:dec_zone_page_state() will need to be changed instead)
>>>>
>>> -- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
>>> Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
>>> auf Grundlage dieser Aussagen treffen.
>>> Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
>>> Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
>>> Registergericht: Amtsgericht Wuppertal, HRB 6377 
>>> Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>>> the body to majordomo@kvack.org.  For more info on Linux AIO,
>>> see: http://www.kvack.org/aio/
>>> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>>> .
>>>
>>
> 
> -- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
> Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
> auf Grundlage dieser Aussagen treffen.
> Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
> Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
> Registergericht: Amtsgericht Wuppertal, HRB 6377 
> Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
> .
>
Comment 7 Gu Zheng 2014-10-30 06:32:02 UTC
On 10/30/2014 02:08 AM, Andrew Morton wrote:

> On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com> wrote:
> 
>> +cc Ben
>>
>> Hi Markus,
>> It seems that truncating aio ring file contributes to accounting dirty
>> pages,
>> buf aio ring file is based on aiofs (an anonymous fs), which should not do
>> this.
>> I think the cause is that aio fs uses *default_backing_dev_info* as the
>> backing dev, which does not disable the dirty pages accounting capability.
>> Could you please try the following patch?
>> It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
>> WRITEBACK/ACCT_WB capabilities.
>>
> 
> Why does aio_ctx_aops implement a null .set_page_dirty?

It's just an anon nop method for people use .page_mkwrite on aio ring file.
Because we marked aio ringbuffer pages as dirty when setting up aio context,
so nothing is needed here.   

> 
> This is not explained in the code or in the 36bc08cc changelog.  Please
> add a code comment to aio_set_page_dirty() explaining the reason?

Yes, comment is necessary here.

> 
> .
>
Comment 8 Gu Zheng 2014-10-30 06:48:10 UTC
Hi Andrew,
On 10/30/2014 02:08 AM, Andrew Morton wrote:

> On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com> wrote:
> 
>> +cc Ben
>>
>> Hi Markus,
>> It seems that truncating aio ring file contributes to accounting dirty
>> pages,
>> buf aio ring file is based on aiofs (an anonymous fs), which should not do
>> this.
>> I think the cause is that aio fs uses *default_backing_dev_info* as the
>> backing dev, which does not disable the dirty pages accounting capability.
>> Could you please try the following patch?
>> It introduces an aio private backing dev info which disabled the ACCT_DIRTY/
>> WRITEBACK/ACCT_WB capabilities.
>>
> 
> Why does aio_ctx_aops implement a null .set_page_dirty?

It's just an anon nop method for people use .page_mkwrite on aio ring file.
Because we marked aio ringbuffer pages as dirty when setting up aio context,
so nothing is needed here.   

> 
> This is not explained in the code or in the 36bc08cc changelog.  Please
> add a code comment to aio_set_page_dirty() explaining the reason?

Yes, comment is necessary here.

Regards,
Gu

> 
> .
>
Comment 9 Markus Königshaus 2014-10-30 11:45:01 UTC
Am 30.10.2014 um 07:05 schrieb Gu Zheng:
> Hi Markus,
> On 10/29/2014 11:26 PM, Markus Königshaus wrote:
>
>> Hi Gu,
>>
>> Many thanks for your patch. It works! Crossceckingwithout your patch
>> leads back to the error.
>> Is your patch only preliminary or can I apply this change permanently
>> for my system?
> You can use this patch temporarily. I'll send a formal one to the community
> soon, if on objection, it should be applied to stable too. And you can
> switch to the new stable version then.
>
> Thanks,
> Gu
Fine, I will do so, thanks again.
I'm not sure, should I close the bug now?

Markus

-- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen
auf Grundlage dieser Aussagen treffen.
Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
Registergericht: Amtsgericht Wuppertal, HRB 6377
Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
Comment 10 Andrew Morton 2014-10-30 15:49:22 UTC
On Thu, 30 Oct 2014 14:16:36 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com> wrote:

> On 10/30/2014 02:08 AM, Andrew Morton wrote:
> 
> > On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com>
> wrote:
> > 
> >> +cc Ben
> >>
> >> Hi Markus,
> >> It seems that truncating aio ring file contributes to accounting dirty
> pages,
> >> buf aio ring file is based on aiofs (an anonymous fs), which should not do
> >> this.
> >> I think the cause is that aio fs uses *default_backing_dev_info* as the
> >> backing dev, which does not disable the dirty pages accounting capability.
> >> Could you please try the following patch?
> >> It introduces an aio private backing dev info which disabled the
> ACCT_DIRTY/
> >> WRITEBACK/ACCT_WB capabilities.
> >>
> > 
> > Why does aio_ctx_aops implement a null .set_page_dirty?
> 
> It's just an anon nop method for people use .page_mkwrite on aio ring file.
> Because we marked aio ringbuffer pages as dirty when setting up aio context,
> so nothing is needed here.   

Well yes, but there's no comment there either :( Why are these pages
marked dirty at init time?
Comment 11 Andrew Morton 2014-10-30 16:10:16 UTC
On Thu, 30 Oct 2014 11:50:43 -0400 Benjamin LaHaise <bcrl@kvack.org> wrote:

> On Thu, Oct 30, 2014 at 08:49:40AM -0700, Andrew Morton wrote:
> > On Thu, 30 Oct 2014 14:16:36 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com>
> wrote:
> > 
> > > On 10/30/2014 02:08 AM, Andrew Morton wrote:
> > > 
> > > > On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com>
> wrote:
> > > > 
> > > >> +cc Ben
> > > >>
> > > >> Hi Markus,
> > > >> It seems that truncating aio ring file contributes to accounting dirty
> pages,
> > > >> buf aio ring file is based on aiofs (an anonymous fs), which should
> not do
> > > >> this.
> > > >> I think the cause is that aio fs uses *default_backing_dev_info* as
> the
> > > >> backing dev, which does not disable the dirty pages accounting
> capability.
> > > >> Could you please try the following patch?
> > > >> It introduces an aio private backing dev info which disabled the
> ACCT_DIRTY/
> > > >> WRITEBACK/ACCT_WB capabilities.
> > > >>
> > > > 
> > > > Why does aio_ctx_aops implement a null .set_page_dirty?
> > > 
> > > It's just an anon nop method for people use .page_mkwrite on aio ring
> file.
> > > Because we marked aio ringbuffer pages as dirty when setting up aio
> context,
> > > so nothing is needed here.   
> > 
> > Well yes, but there's no comment there either :( Why are these pages
> > marked dirty at init time?
> 
> Because they're always dirty.  They will never be clean -- they're pinned 
> pages.

Well, running an open-coded SetPageDirty() against an
address_space-backed page is what caused this bug, because
SetPageDirty() bypasses VFS dirty page accounting.  It seems a peculiar
thing to do and it's still unclear to me why it's done.

What goes wrong if we just leave this alone?  Don't manually set the
dirty flags, don't disable set_page_dirty(), rely on default behaviour?
Comment 12 Gu Zheng 2014-10-31 02:12:56 UTC
On 10/30/2014 07:44 PM, Markus Königshaus wrote:

> 
> Am 30.10.2014 um 07:05 schrieb Gu Zheng:
>> Hi Markus,
>> On 10/29/2014 11:26 PM, Markus Königshaus wrote:
>>
>>> Hi Gu,
>>>
>>> Many thanks for your patch. It works! Crossceckingwithout your patch
>>> leads back to the error.
>>> Is your patch only preliminary or can I apply this change permanently
>>> for my system?
>> You can use this patch temporarily. I'll send a formal one to the community
>> soon, if on objection, it should be applied to stable too. And you can
>> switch to the new stable version then.
>>
>> Thanks,
>> Gu
> Fine, I will do so, thanks again.
> I'm not sure, should I close the bug now?

Hold on for a moment, something is still unclear.
Please focus on this thread.

Thanks,
Gu

> 
> Markus
> 
> -- Unsere Aussagen koennen Irrtuemer und Missverstaendnisse enthalten.
> Bitte pruefen Sie die Aussagen fuer Ihren Fall, bevor Sie Entscheidungen 
> auf Grundlage dieser Aussagen treffen.
> Wiesemann & Theis GmbH, Porschestr. 12, D-42279 Wuppertal
> Geschaeftsfuehrer: Dipl.-Ing. Ruediger Theis
> Registergericht: Amtsgericht Wuppertal, HRB 6377 
> Tel. +49-202/2680-0, Fax +49-202/2680-265, http://www.wut.de
> .
>
Comment 13 Gu Zheng 2014-10-31 03:20:54 UTC
Hi Andrew,
On 10/31/2014 12:10 AM, Andrew Morton wrote:

> On Thu, 30 Oct 2014 11:50:43 -0400 Benjamin LaHaise <bcrl@kvack.org> wrote:
> 
>> On Thu, Oct 30, 2014 at 08:49:40AM -0700, Andrew Morton wrote:
>>> On Thu, 30 Oct 2014 14:16:36 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com>
>>> wrote:
>>>
>>>> On 10/30/2014 02:08 AM, Andrew Morton wrote:
>>>>
>>>>> On Wed, 29 Oct 2014 16:46:55 +0800 Gu Zheng <guz.fnst@cn.fujitsu.com>
>>>>> wrote:
>>>>>
>>>>>> +cc Ben
>>>>>>
>>>>>> Hi Markus,
>>>>>> It seems that truncating aio ring file contributes to accounting dirty
>>>>>> pages,
>>>>>> buf aio ring file is based on aiofs (an anonymous fs), which should not
>>>>>> do
>>>>>> this.
>>>>>> I think the cause is that aio fs uses *default_backing_dev_info* as the
>>>>>> backing dev, which does not disable the dirty pages accounting
>>>>>> capability.
>>>>>> Could you please try the following patch?
>>>>>> It introduces an aio private backing dev info which disabled the
>>>>>> ACCT_DIRTY/
>>>>>> WRITEBACK/ACCT_WB capabilities.
>>>>>>
>>>>>
>>>>> Why does aio_ctx_aops implement a null .set_page_dirty?
>>>>
>>>> It's just an anon nop method for people use .page_mkwrite on aio ring
>>>> file.
>>>> Because we marked aio ringbuffer pages as dirty when setting up aio
>>>> context,
>>>> so nothing is needed here.   
>>>
>>> Well yes, but there's no comment there either :( Why are these pages
>>> marked dirty at init time?
>>
>> Because they're always dirty.  They will never be clean -- they're pinned 
>> pages.
> 
> Well, running an open-coded SetPageDirty() against an
> address_space-backed page is what caused this bug, because
> SetPageDirty() bypasses VFS dirty page accounting.  It seems a peculiar
> thing to do and it's still unclear to me why it's done.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty.
But as you said it bypasses VFS dirty page accounting and leads to the bug.
When thinking more, it seems that pinned pages can already achieve the goal,
so the SetPageDirty is unnecessary.

> 
> What goes wrong if we just leave this alone?  Don't manually set the
> dirty flags, don't disable set_page_dirty(), rely on default behaviour?

Seems unaffected. But we still need to disable the ACCT_DIRTY/WRITEBACK/ACCT_WB
capabilities, because it is an anon fs, write-back makes no sense to it.
And use the __set_page_dirty_no_writeback instead of the nop one.

If I missed something, please correct me.

Regards,
Gu

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>
Comment 14 JPT 2014-11-13 16:07:06 UTC
Please don't be angry if a total noob suggests: 
What about adding a check for the dirty page counter being >= 0?
Whenever again a bug like this appears, you can write an error-message instead of slowing down the whole system to 0.1% of it's speed. 

I applied the patch to my kernel, built and it seems to work now. 
top shows usage or idle instead of 99% waiting.

btw: the problem occured while doing a debian downgrade from testing to stable. 
a few tasks took almost forever (waited over night until I killed it)

it was always apt-show-version which hung in balance_dirty_pages_ratelimited

iostat showed something like: 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             152.11        16.53       633.37    1397910   53549872
Comment 15 Markus Königshaus 2014-11-13 16:29:03 UTC
That would be a workaround. Gu's patch is working properly. An examination of whether the counter is greater than 0 would only mask the problem. Without such testing, the error will affect directly, it will be found quickly.

Markus
Comment 16 Markus Königshaus 2014-12-12 16:13:48 UTC
Bug fixes with commit 835f252c6debd204fcd607c79975089b1ecd3472
Thanks all, Markus

Note You need to log in before you can comment on or make changes to this bug.