Bug 31142

Summary: Large write to USB stick freezes unrelated tasks for a long time
Product: IO/Storage Reporter: Alex Villacis Lasso (avillaci)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: CLOSED CODE_FIX    
Severity: normal CC: alan, de.techno, florian, maciej.rutecki, makovick
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc8 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel backtraces from hung Eclipse task while writing to usb stick
sysrw-w trace with hung thunderbird and gedit
sysrq-w trace with patch applied
sysrq-w trace with patch from comment #21 applied
sysrq-w trace for aa.git kernel
second sysrq-w trace for aa.git kernel
sysrq-w trace for 2.6.39-rc1 kernel

Description Alex Villacis Lasso 2011-03-15 15:55:49 UTC
Created attachment 50902 [details]
kernel backtraces from hung Eclipse task while writing to usb stick

System is Fedora 14 x86_64 with 4 GB RAM, running vanilla kernel 2.6.38-rc8.

I have a USB 2.0 high-speed memory stick with around 7.5 GB of space. Whenenver I write a large amount of data (several GBs of files) through any means (cp, nautilus GUI, etc), I notice some large applications that I consider unrelated to the I/O operation (Firefox web browser, Thunderbird email viewer, Eclipse IDE) may randomly freeze whenever I try to interact with them. I use Compiz, and I notice the apps getting grayed out, but I have also seen the freeze happening with Metacity and Gnome-shell, so I believe the window manager is irrelevant. Sometimes other smaller tasks (gnome-terminal, gedit) also freeze. For Eclipse, the hang also cause a series of kernel backtraces, attached to this report. The hang usually lasts for several tens of seconds, and may freeze and unfreeze several times while the file copying to USB takes place. All of the hung applications unfreeze themselves after write activity (as seen from the LED in the memory stick) ceases.

Reproducibility: always (with sufficiently large bulk write)
To reproduce: 
1) have an usb stick with several GB of free space, with any filesystem (tried vfat and udf)
2) prepare several gb of files to copy from hard disk to usb stick
3) start large application (firefox, eclipse, or thunderbird)
4) check that application is responsive before file copy starts
5) insert usb stick and (auto)mount it. Previously started app is still responsive.
6) start file copy to usb stick with any command
7) attempt to interact with chosen application during the entirety of the file write
Expected result: I/O to usb stick takes place in background, unrelated apps continue to be responsive in foreground.
Actual result: some large tasks freeze for tens of seconds while write takes place.

Feel free to reassign this bug to a different category. It involves I/O, block, USB, and mmap.
Comment 1 Alex Villacis Lasso 2011-03-15 16:45:02 UTC
I must note that no such freeze happens when I *read* files from USB to the hard disk, no matter how large the transfer.
Comment 2 Andrew Morton 2011-03-15 20:54:09 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 15 Mar 2011 15:55:52 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=31142
> 
>            Summary: Large write to USB stick freezes unrelated tasks for a
>                     long time
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 2.6.38-rc8
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Block Layer
>         AssignedTo: axboe@kernel.dk
>         ReportedBy: avillaci@ceibo.fiec.espol.edu.ec
>         Regression: No
> 
> 
> Created an attachment (id=50902)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=50902)
> kernel backtraces from hung Eclipse task while writing to usb stick
> 
> System is Fedora 14 x86_64 with 4 GB RAM, running vanilla kernel 2.6.38-rc8.
> 
> I have a USB 2.0 high-speed memory stick with around 7.5 GB of space.
> Whenenver
> I write a large amount of data (several GBs of files) through any means (cp,
> nautilus GUI, etc), I notice some large applications that I consider
> unrelated
> to the I/O operation (Firefox web browser, Thunderbird email viewer, Eclipse
> IDE) may randomly freeze whenever I try to interact with them. I use Compiz,
> and I notice the apps getting grayed out, but I have also seen the freeze
> happening with Metacity and Gnome-shell, so I believe the window manager is
> irrelevant. Sometimes other smaller tasks (gnome-terminal, gedit) also
> freeze.
> For Eclipse, the hang also cause a series of kernel backtraces, attached to
> this report. The hang usually lasts for several tens of seconds, and may
> freeze
> and unfreeze several times while the file copying to USB takes place. All of
> the hung applications unfreeze themselves after write activity (as seen from
> the LED in the memory stick) ceases.
> 
> Reproducibility: always (with sufficiently large bulk write)
> To reproduce: 
> 1) have an usb stick with several GB of free space, with any filesystem
> (tried
> vfat and udf)
> 2) prepare several gb of files to copy from hard disk to usb stick
> 3) start large application (firefox, eclipse, or thunderbird)
> 4) check that application is responsive before file copy starts
> 5) insert usb stick and (auto)mount it. Previously started app is still
> responsive.
> 6) start file copy to usb stick with any command
> 7) attempt to interact with chosen application during the entirety of the
> file
> write
> Expected result: I/O to usb stick takes place in background, unrelated apps
> continue to be responsive in foreground.
> Actual result: some large tasks freeze for tens of seconds while write takes
> place.
> 
> Feel free to reassign this bug to a different category. It involves I/O,
> block,
> USB, and mmap.

rofl, will we ever fix this.

Please enable sysrq and do a sysrq-w when the tasks are blocked so we
can find where things are getting stuck.  Please avoid email client
wordwrapping when sending us the sysrq output.
Comment 3 Anonymous Emailer 2011-03-15 22:54:38 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 15/03/11 15:53, Andrew Morton escribió:
>
> rofl, will we ever fix this.
Does this mean there is already a duplicate of this issue? If so, which one?
> Please enable sysrq and do a sysrq-w when the tasks are blocked so we
> can find where things are getting stuck.  Please avoid email client
> wordwrapping when sending us the sysrq output.
>
Comment 4 Andrew Morton 2011-03-15 23:20:25 UTC
On Tue, 15 Mar 2011 17:53:16 -0500
Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:

> El 15/03/11 15:53, Andrew Morton escribi__:
> >
> > rofl, will we ever fix this.
> Does this mean there is already a duplicate of this issue? If so, which one?

Nothing specific.  Nonsense like this has been happening for at least a
decade and it never seems to get a lot better.

> > Please enable sysrq and do a sysrq-w when the tasks are blocked so we
> > can find where things are getting stuck.  Please avoid email client
> > wordwrapping when sending us the sysrq output.
> >
Comment 5 Alex Villacis Lasso 2011-03-16 15:24:40 UTC
Created attachment 50952 [details]
sysrw-w trace with hung thunderbird and gedit

This report was generated with 2.6.38. I started copying about 3 GB of data to my usb stick (with nautilus), and after a while, thunderbird froze. I did the sysrq-w action, but when pasting the report into gedit and trying to save, gedit also froze. This report shows the second sysrq report. Does this help?
Comment 6 Anonymous Emailer 2011-03-16 15:26:36 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 15/03/11 18:19, Andrew Morton escribió:
> On Tue, 15 Mar 2011 17:53:16 -0500
> Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec>  wrote:
>
>> El 15/03/11 15:53, Andrew Morton escribi__:
>>> rofl, will we ever fix this.
>> Does this mean there is already a duplicate of this issue? If so, which one?
> Nothing specific.  Nonsense like this has been happening for at least a
> decade and it never seems to get a lot better.
>
>>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we
>>> can find where things are getting stuck.  Please avoid email client
>>> wordwrapping when sending us the sysrq output.
>>>
Posted sysrq-w report into original bug report to avoid email word-wrap.
Comment 7 Andrew Morton 2011-03-16 22:03:13 UTC
On Wed, 16 Mar 2011 10:25:16 -0500
Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:

> El 15/03/11 18:19, Andrew Morton escribi__:
> > On Tue, 15 Mar 2011 17:53:16 -0500
> > Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec>  wrote:
> >
> >> El 15/03/11 15:53, Andrew Morton escribi__:
> >>> rofl, will we ever fix this.
> >> Does this mean there is already a duplicate of this issue? If so, which
> one?
> > Nothing specific.  Nonsense like this has been happening for at least a
> > decade and it never seems to get a lot better.
> >
> >>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we
> >>> can find where things are getting stuck.  Please avoid email client
> >>> wordwrapping when sending us the sysrq output.
> >>>
> Posted sysrq-w report into original bug report to avoid email word-wrap.

https://bugzilla.kernel.org/attachment.cgi?id=50952

Interesting bits:

[70874.969550] thunderbird-bin D 000000010434e04e     0 32283  32279 0x00000080
[70874.969553]  ffff88011ba91838 0000000000000086 ffff880100000000 0000000000013880
[70874.969557]  0000000000013880 ffff88010a231780 ffff88011ba91fd8 0000000000013880
[70874.969560]  0000000000013880 0000000000013880 0000000000013880 ffff88011ba91fd8
[70874.969564] Call Trace:
[70874.969567]  [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d
[70874.969569]  [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d
[70874.969572]  [<ffffffff8147cedb>] io_schedule+0x47/0x62
[70874.969575]  [<ffffffff810d7f1c>] sync_page+0x49/0x4d
[70874.969577]  [<ffffffff8147d36a>] __wait_on_bit+0x48/0x7b
[70874.969580]  [<ffffffff810d8100>] wait_on_page_bit+0x72/0x79
[70874.969583]  [<ffffffff8106cdb4>] ? wake_bit_function+0x0/0x31
[70874.969586]  [<ffffffff8111833c>] migrate_pages+0x1ac/0x38d
[70874.969589]  [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4
[70874.969592]  [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e
[70874.969595]  [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1
[70874.969599]  [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea
[70874.969602]  [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5
[70874.969605]  [<ffffffff810ddbb8>] ? drain_local_pages+0x0/0x17
[70874.969607]  [<ffffffff810df0b0>] __alloc_pages_nodemask+0x69e/0x766
[70874.969610]  [<ffffffff81113701>] ? __slab_free+0x6d/0xf6
[70874.969614]  [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1
[70874.969617]  [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267
[70874.969620]  [<ffffffff810f2497>] ? pmd_offset+0x19/0x40
[70874.969623]  [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f
[70874.969626]  [<ffffffff8100f26a>] ? arch_get_unmapped_area_topdown+0x195/0x28f
[70874.969629]  [<ffffffff8148178c>] do_page_fault+0x33b/0x35d
[70874.969632]  [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4
[70874.969635]  [<ffffffff8112dcee>] ? path_put+0x22/0x27
[70874.969638]  [<ffffffff8147f145>] page_fault+0x25/0x30


[70874.969731] gedit           D 000000010434dfb0     0 32356      1 0x00000080
[70874.969734]  ffff8800982ab558 0000000000000082 ffff880102400001 0000000000013880
[70874.969737]  0000000000013880 ffff880117408000 ffff8800982abfd8 0000000000013880
[70874.969741]  0000000000013880 0000000000013880 0000000000013880 ffff8800982abfd8
[70874.969744] Call Trace:
[70874.969747]  [<ffffffff8147cedb>] io_schedule+0x47/0x62
[70874.969750]  [<ffffffff8121c403>] get_request_wait+0x10a/0x197
[70874.969753]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
[70874.969756]  [<ffffffff8121ccc4>] __make_request+0x2c8/0x3e0
[70874.969759]  [<ffffffff8111487d>] ? kmem_cache_alloc+0x73/0xeb
[70874.969762]  [<ffffffff8121bb67>] generic_make_request+0x2bc/0x336
[70874.969765]  [<ffffffff811213a5>] ? lookup_page_cgroup+0x36/0x4c
[70874.969768]  [<ffffffff8121bcc1>] submit_bio+0xe0/0xff
[70874.969770]  [<ffffffff8114d72d>] ? bio_alloc_bioset+0x4d/0xc4
[70874.969773]  [<ffffffff810edf1f>] ? inc_zone_page_state+0x2d/0x2f
[70874.969776]  [<ffffffff81149274>] submit_bh+0xe8/0x10e
[70874.969779]  [<ffffffff8114b9fa>] __block_write_full_page+0x1ea/0x2da
[70874.969782]  [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf]
[70874.969785]  [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d
[70874.969788]  [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d
[70874.969791]  [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf]
[70874.969794]  [<ffffffff8114bb76>] block_write_full_page_endio+0x8c/0x98
[70874.969796]  [<ffffffff8114bb97>] block_write_full_page+0x15/0x17
[70874.969800]  [<ffffffffa0686027>] udf_writepage+0x18/0x1a [udf]
[70874.969803]  [<ffffffff81117fed>] move_to_new_page+0x106/0x195
[70874.969806]  [<ffffffff811183de>] migrate_pages+0x24e/0x38d
[70874.969809]  [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4
[70874.969812]  [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e
[70874.969815]  [<ffffffff81049c78>] ? load_balance+0xcb/0x6b0
[70874.969818]  [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1
[70874.969821]  [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea
[70874.969824]  [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5
[70874.969827]  [<ffffffff810dee79>] __alloc_pages_nodemask+0x467/0x766
[70874.969830]  [<ffffffff810fcfd3>] ? anon_vma_alloc+0x1a/0x1c
[70874.969833]  [<ffffffff81048a30>] ? get_parent_ip+0x11/0x41
[70874.969833]  [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1
[70874.969833]  [<ffffffff8123430e>] ? rb_insert_color+0x66/0xe1
[70874.969833]  [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267
[70874.969833]  [<ffffffff810f2497>] ? pmd_offset+0x19/0x40
[70874.969833]  [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f
[70874.969833]  [<ffffffff8100f298>] ? arch_get_unmapped_area_topdown+0x1c3/0x28f
[70874.969833]  [<ffffffff8148178c>] do_page_fault+0x33b/0x35d
[70874.969833]  [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4
[70874.969833]  [<ffffffff8112dcee>] ? path_put+0x22/0x27
[70874.969833]  [<ffffffff8147f145>] page_fault+0x25/0x30

So it appears that the system is full of dirty pages against a slow
device and your foreground processes have got stuck in direct reclaim
-> compaction -> migration.   That's Mel ;)

What happened to the plans to eliminate direct reclaim?
Comment 8 Anonymous Emailer 2011-03-17 21:29:00 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 16/03/11 17:02, Andrew Morton escribió:
> On Wed, 16 Mar 2011 10:25:16 -0500
> Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec>  wrote:
>
>> El 15/03/11 18:19, Andrew Morton escribi__:
>>> On Tue, 15 Mar 2011 17:53:16 -0500
>>> Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec>   wrote:
>>>
>>>> El 15/03/11 15:53, Andrew Morton escribi__:
>>>>> rofl, will we ever fix this.
>>>> Does this mean there is already a duplicate of this issue? If so, which
>>>> one?
>>> Nothing specific.  Nonsense like this has been happening for at least a
>>> decade and it never seems to get a lot better.
>>>
>>>>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we
>>>>> can find where things are getting stuck.  Please avoid email client
>>>>> wordwrapping when sending us the sysrq output.
>>>>>
>> Posted sysrq-w report into original bug report to avoid email word-wrap.
> https://bugzilla.kernel.org/attachment.cgi?id=50952
>
> Interesting bits:
>
> [70874.969550] thunderbird-bin D 000000010434e04e     0 32283  32279
> 0x00000080
> [70874.969553]  ffff88011ba91838 0000000000000086 ffff880100000000
> 0000000000013880
> [70874.969557]  0000000000013880 ffff88010a231780 ffff88011ba91fd8
> 0000000000013880
> [70874.969560]  0000000000013880 0000000000013880 0000000000013880
> ffff88011ba91fd8
> [70874.969564] Call Trace:
> [70874.969567]  [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d
> [70874.969569]  [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d
> [70874.969572]  [<ffffffff8147cedb>] io_schedule+0x47/0x62
> [70874.969575]  [<ffffffff810d7f1c>] sync_page+0x49/0x4d
> [70874.969577]  [<ffffffff8147d36a>] __wait_on_bit+0x48/0x7b
> [70874.969580]  [<ffffffff810d8100>] wait_on_page_bit+0x72/0x79
> [70874.969583]  [<ffffffff8106cdb4>] ? wake_bit_function+0x0/0x31
> [70874.969586]  [<ffffffff8111833c>] migrate_pages+0x1ac/0x38d
> [70874.969589]  [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4
> [70874.969592]  [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e
> [70874.969595]  [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1
> [70874.969599]  [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea
> [70874.969602]  [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5
> [70874.969605]  [<ffffffff810ddbb8>] ? drain_local_pages+0x0/0x17
> [70874.969607]  [<ffffffff810df0b0>] __alloc_pages_nodemask+0x69e/0x766
> [70874.969610]  [<ffffffff81113701>] ? __slab_free+0x6d/0xf6
> [70874.969614]  [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1
> [70874.969617]  [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267
> [70874.969620]  [<ffffffff810f2497>] ? pmd_offset+0x19/0x40
> [70874.969623]  [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f
> [70874.969626]  [<ffffffff8100f26a>] ?
> arch_get_unmapped_area_topdown+0x195/0x28f
> [70874.969629]  [<ffffffff8148178c>] do_page_fault+0x33b/0x35d
> [70874.969632]  [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4
> [70874.969635]  [<ffffffff8112dcee>] ? path_put+0x22/0x27
> [70874.969638]  [<ffffffff8147f145>] page_fault+0x25/0x30
>
>
> [70874.969731] gedit           D 000000010434dfb0     0 32356      1
> 0x00000080
> [70874.969734]  ffff8800982ab558 0000000000000082 ffff880102400001
> 0000000000013880
> [70874.969737]  0000000000013880 ffff880117408000 ffff8800982abfd8
> 0000000000013880
> [70874.969741]  0000000000013880 0000000000013880 0000000000013880
> ffff8800982abfd8
> [70874.969744] Call Trace:
> [70874.969747]  [<ffffffff8147cedb>] io_schedule+0x47/0x62
> [70874.969750]  [<ffffffff8121c403>] get_request_wait+0x10a/0x197
> [70874.969753]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
> [70874.969756]  [<ffffffff8121ccc4>] __make_request+0x2c8/0x3e0
> [70874.969759]  [<ffffffff8111487d>] ? kmem_cache_alloc+0x73/0xeb
> [70874.969762]  [<ffffffff8121bb67>] generic_make_request+0x2bc/0x336
> [70874.969765]  [<ffffffff811213a5>] ? lookup_page_cgroup+0x36/0x4c
> [70874.969768]  [<ffffffff8121bcc1>] submit_bio+0xe0/0xff
> [70874.969770]  [<ffffffff8114d72d>] ? bio_alloc_bioset+0x4d/0xc4
> [70874.969773]  [<ffffffff810edf1f>] ? inc_zone_page_state+0x2d/0x2f
> [70874.969776]  [<ffffffff81149274>] submit_bh+0xe8/0x10e
> [70874.969779]  [<ffffffff8114b9fa>] __block_write_full_page+0x1ea/0x2da
> [70874.969782]  [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf]
> [70874.969785]  [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d
> [70874.969788]  [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d
> [70874.969791]  [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf]
> [70874.969794]  [<ffffffff8114bb76>] block_write_full_page_endio+0x8c/0x98
> [70874.969796]  [<ffffffff8114bb97>] block_write_full_page+0x15/0x17
> [70874.969800]  [<ffffffffa0686027>] udf_writepage+0x18/0x1a [udf]
> [70874.969803]  [<ffffffff81117fed>] move_to_new_page+0x106/0x195
> [70874.969806]  [<ffffffff811183de>] migrate_pages+0x24e/0x38d
> [70874.969809]  [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4
> [70874.969812]  [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e
> [70874.969815]  [<ffffffff81049c78>] ? load_balance+0xcb/0x6b0
> [70874.969818]  [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1
> [70874.969821]  [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea
> [70874.969824]  [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5
> [70874.969827]  [<ffffffff810dee79>] __alloc_pages_nodemask+0x467/0x766
> [70874.969830]  [<ffffffff810fcfd3>] ? anon_vma_alloc+0x1a/0x1c
> [70874.969833]  [<ffffffff81048a30>] ? get_parent_ip+0x11/0x41
> [70874.969833]  [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1
> [70874.969833]  [<ffffffff8123430e>] ? rb_insert_color+0x66/0xe1
> [70874.969833]  [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267
> [70874.969833]  [<ffffffff810f2497>] ? pmd_offset+0x19/0x40
> [70874.969833]  [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f
> [70874.969833]  [<ffffffff8100f298>] ?
> arch_get_unmapped_area_topdown+0x1c3/0x28f
> [70874.969833]  [<ffffffff8148178c>] do_page_fault+0x33b/0x35d
> [70874.969833]  [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4
> [70874.969833]  [<ffffffff8112dcee>] ? path_put+0x22/0x27
> [70874.969833]  [<ffffffff8147f145>] page_fault+0x25/0x30
>
> So it appears that the system is full of dirty pages against a slow
> device and your foreground processes have got stuck in direct reclaim
> ->  compaction ->  migration.   That's Mel ;)
>
> What happened to the plans to eliminate direct reclaim?
>
>
Browsing around bugzilla, I believe that bug 12309 looks very similar to the issue I am experiencing, especially from comment #525 onwards. Am I correct in this?
Comment 9 Andrew Morton 2011-03-17 21:47:58 UTC
On Thu, 17 Mar 2011 16:27:29 -0500
Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:

> > So it appears that the system is full of dirty pages against a slow
> > device and your foreground processes have got stuck in direct reclaim
> > ->  compaction ->  migration.   That's Mel ;)
> >
> > What happened to the plans to eliminate direct reclaim?
> >
> >
> Browsing around bugzilla, I believe that bug 12309 looks very similar to the
> issue I am experiencing, especially from comment #525 onwards. Am I correct
> in this?

ah, the epic 12309.  https://bugzilla.kernel.org/show_bug.cgi?id=12309.
If you're ever wondering how much we suck, go read that one.

I think what we're seeing in 31142 is a large amount of dirty data
buffered against a slow device.  Innocent processes enter page reclaim
and end up getting stuck trying to write to that heavily-queued and
slow device.

If so, that's probably what some of the 12309 participants are seeing. 
But there are lots of other things in that report too.


Now, the problem you're seeing in 31142 isn't really supposed to
happen.  In the direct-reclaim case the code will try to avoid
initiation of blocking I/O against a congested device, via the
bdi_write_congested() test in may_write_to_queue().  Although that code
now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case,
whodidthat.

However in the case of the new(ish) compaction/migration code I don't
think we're performing that test.  migrate_pages()->unmap_and_move()
will get stuck behind that large&slow IO queue if page reclaim decided
to pass it down sync==true, as it apparently has done.

IOW, Mel broke it ;)
Comment 10 Anonymous Emailer 2011-03-17 22:12:31 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 17/03/11 16:47, Andrew Morton escribió:
>
> ah, the epic 12309.  https://bugzilla.kernel.org/show_bug.cgi?id=12309.
> If you're ever wondering how much we suck, go read that one.
>
> I think what we're seeing in 31142 is a large amount of dirty data
> buffered against a slow device.  Innocent processes enter page reclaim
> and end up getting stuck trying to write to that heavily-queued and
> slow device.
>
> If so, that's probably what some of the 12309 participants are seeing.
> But there are lots of other things in that report too.
>
>
> Now, the problem you're seeing in 31142 isn't really supposed to
> happen.  In the direct-reclaim case the code will try to avoid
> initiation of blocking I/O against a congested device, via the
> bdi_write_congested() test in may_write_to_queue().  Although that code
> now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case,
> whodidthat.
>
> However in the case of the new(ish) compaction/migration code I don't
> think we're performing that test.  migrate_pages()->unmap_and_move()
> will get stuck behind that large&slow IO queue if page reclaim decided
> to pass it down sync==true, as it apparently has done.
>
> IOW, Mel broke it ;)
>
I don't quite follow. In my case, the congested device is the USB stick, but the affected processes should be reading/writing on the hard disk. What kind of queue(s) implementation results in pending writes to the USB stick interfering with I/O to the hard 
disk? Or am I misunderstanding? I had the (possibly incorrect) impression that each block device had its own I/O queue.
Comment 11 Andrew Morton 2011-03-17 22:27:06 UTC
On Thu, 17 Mar 2011 17:11:05 -0500
Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:

> El 17/03/11 16:47, Andrew Morton escribi__:
> >
> > ah, the epic 12309.  https://bugzilla.kernel.org/show_bug.cgi?id=12309.
> > If you're ever wondering how much we suck, go read that one.
> >
> > I think what we're seeing in 31142 is a large amount of dirty data
> > buffered against a slow device.  Innocent processes enter page reclaim
> > and end up getting stuck trying to write to that heavily-queued and
> > slow device.
> >
> > If so, that's probably what some of the 12309 participants are seeing.
> > But there are lots of other things in that report too.
> >
> >
> > Now, the problem you're seeing in 31142 isn't really supposed to
> > happen.  In the direct-reclaim case the code will try to avoid
> > initiation of blocking I/O against a congested device, via the
> > bdi_write_congested() test in may_write_to_queue().  Although that code
> > now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case,
> > whodidthat.
> >
> > However in the case of the new(ish) compaction/migration code I don't
> > think we're performing that test.  migrate_pages()->unmap_and_move()
> > will get stuck behind that large&slow IO queue if page reclaim decided
> > to pass it down sync==true, as it apparently has done.
> >
> > IOW, Mel broke it ;)
> >
> I don't quite follow. In my case, the congested device is the USB stick, but
> the affected processes should be reading/writing on the hard disk. What kind
> of queue(s) implementation results in pending writes to the USB stick
> interfering with I/O to the hard 
> disk? Or am I misunderstanding? I had the (possibly incorrect) impression
> that each block device had its own I/O queue.

Your web browser is just trying to allocate some memory.  As part of
that operation it entered the kernel's page reclaim and while scanning
for memory to free, page reclaim encountered a page which was queued
for IO.  Then page reclaim waited for the IO to complete against that
page.  So the browser got stuck.

Page reclaim normally tries to avoid this situation by not waiting on
such pages, unless the calling processes was itself involved in writing
to the page's device (stored in current->backing_dev_info).  But afaict
the new compaction/migration code forgot to do this.
Comment 12 Anonymous Emailer 2011-03-18 11:13:35 UTC
Reply-To: mel@csn.ul.ie

On Thu, Mar 17, 2011 at 02:47:27PM -0700, Andrew Morton wrote:
> On Thu, 17 Mar 2011 16:27:29 -0500
> Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:
> 
> > > So it appears that the system is full of dirty pages against a slow
> > > device and your foreground processes have got stuck in direct reclaim
> > > ->  compaction ->  migration.   That's Mel ;)
> > >
> > > What happened to the plans to eliminate direct reclaim?
> > >
> > >
> > Browsing around bugzilla, I believe that bug 12309 looks very similar to
> the issue I am experiencing, especially from comment #525 onwards. Am I
> correct in this?
> 
> ah, the epic 12309.  https://bugzilla.kernel.org/show_bug.cgi?id=12309.
> If you're ever wondering how much we suck, go read that one.
> 

I'm reasonably sure over the last few series that we've taken a number of
steps to mitigate the problems described in #12309 although it's been a while
since I double checked. When I last stopped looking at it, we had reached
the stage where the dirty pages encountered by writeback was greatly reduced
which should have affected the stalls reported in that bug. I stopped working
on it further to see see how the IO-less dirty balancing being worked on by
Wu and Jan worked out because the next reasonable step was making sure the
flusher threads were behaving as expected. That is still a work in progress.

> I think what we're seeing in 31142 is a large amount of dirty data
> buffered against a slow device.  Innocent processes enter page reclaim
> and end up getting stuck trying to write to that heavily-queued and
> slow device.
> 
> If so, that's probably what some of the 12309 participants are seeing. 
> But there are lots of other things in that report too.
> 
> 
> Now, the problem you're seeing in 31142 isn't really supposed to
> happen.  In the direct-reclaim case the code will try to avoid
> initiation of blocking I/O against a congested device, via the
> bdi_write_congested() test in may_write_to_queue().  Although that code
> now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case,
> whodidthat.
> 
> However in the case of the new(ish) compaction/migration code I don't
> think we're performing that test.  migrate_pages()->unmap_and_move()
> will get stuck behind that large&slow IO queue if page reclaim decided
> to pass it down sync==true, as it apparently has done.
> 
> IOW, Mel broke it ;)
> 

\o/ ... no wait, it's the other one - :(

If you look at the stack traces though, all of them had called
do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger
is new because it's THP triggering compaction that is causing the stalls
rather than page reclaim doing direct writeback which was the culprit in
the past.

To confirm if this is the case, I'd be very interested in hearing if this
problem persists in the following cases

1. 2.6.38-rc8 with defrag disabled by
   echo never >/sys/kernel/mm/transparent_hugepage/defrag
   (this will stop THP allocations calling into compaction)
2. 2.6.38-rc8 with THP disabled by
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
   (if the problem still persists, then page reclaim is still a problem
    but we should still stop THP doing sync writes)
3. 2.6.37 vanilla
   (in case this is a new regression introduced since then)

Migration can do sync writes on dirty pages which is why it looks so similar
to page reclaim but this can be controlled by the value of sync_migration
passed into try_to_compact_pages(). If we find that option 1 above makes
the regression go away or at least helps a lot, then a reasonable fix may
be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
THP allocations. I've added Andrea to the cc to see what he thinks.

Thanks for the report.
Comment 13 Anonymous Emailer 2011-03-18 12:07:42 UTC
Reply-To: mel@csn.ul.ie

On Thu, Mar 17, 2011 at 02:47:27PM -0700, Andrew Morton wrote:
> On Thu, 17 Mar 2011 16:27:29 -0500
> Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote:
> 
> > > So it appears that the system is full of dirty pages against a slow
> > > device and your foreground processes have got stuck in direct reclaim
> > > ->  compaction ->  migration.   That's Mel ;)
> > >
> > > What happened to the plans to eliminate direct reclaim?
> > >
> > >
> > Browsing around bugzilla, I believe that bug 12309 looks very similar to
> the issue I am experiencing, especially from comment #525 onwards. Am I
> correct in this?
> 
> ah, the epic 12309.  https://bugzilla.kernel.org/show_bug.cgi?id=12309.
> If you're ever wondering how much we suck, go read that one.
> 

I'm reasonably sure over the last few series that we've taken a number of
steps to mitigate the problems described in #12309 although it's been a while
since I double checked. When I last stopped looking at it, we had reached
the stage where the dirty pages encountered by writeback was greatly reduced
which should have affected the stalls reported in that bug. I stopped working
on it further to see see how the IO-less dirty balancing being worked on by
Wu and Jan worked out because the next reasonable step was making sure the
flusher threads were behaving as expected. That is still a work in progress.

> I think what we're seeing in 31142 is a large amount of dirty data
> buffered against a slow device.  Innocent processes enter page reclaim
> and end up getting stuck trying to write to that heavily-queued and
> slow device.
> 
> If so, that's probably what some of the 12309 participants are seeing. 
> But there are lots of other things in that report too.
> 
> 
> Now, the problem you're seeing in 31142 isn't really supposed to
> happen.  In the direct-reclaim case the code will try to avoid
> initiation of blocking I/O against a congested device, via the
> bdi_write_congested() test in may_write_to_queue().  Although that code
> now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case,
> whodidthat.
> 
> However in the case of the new(ish) compaction/migration code I don't
> think we're performing that test.  migrate_pages()->unmap_and_move()
> will get stuck behind that large&slow IO queue if page reclaim decided
> to pass it down sync==true, as it apparently has done.
> 
> IOW, Mel broke it ;)
> 

\o/ ... no wait, it's the other one - :(

If you look at the stack traces though, all of them had called
do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger
is new because it's THP triggering compaction that is causing the stalls
rather than page reclaim doing direct writeback which was the culprit in
the past.

To confirm if this is the case, I'd be very interested in hearing if this
problem persists in the following cases

1. 2.6.38-rc8 with defrag disabled by
   echo never >/sys/kernel/mm/transparent_hugepage/defrag
   (this will stop THP allocations calling into compaction)
2. 2.6.38-rc8 with THP disabled by
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
   (if the problem still persists, then page reclaim is still a problem
    but we should still stop THP doing sync writes)
3. 2.6.37 vanilla
   (in case this is a new regression introduced since then)

Migration can do sync writes on dirty pages which is why it looks so similar
to page reclaim but this can be controlled by the value of sync_migration
passed into try_to_compact_pages(). If we find that option 1 above makes
the regression go away or at least helps a lot, then a reasonable fix may
be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
THP allocations. I've added Andrea to the cc to see what he thinks.

Thanks for the report.
Comment 14 Anonymous Emailer 2011-03-18 13:16:39 UTC
Reply-To: aarcange@redhat.com

On Fri, Mar 18, 2011 at 11:13:00AM +0000, Mel Gorman wrote:
> To confirm if this is the case, I'd be very interested in hearing if this
> problem persists in the following cases
> 
> 1. 2.6.38-rc8 with defrag disabled by
>    echo never >/sys/kernel/mm/transparent_hugepage/defrag
>    (this will stop THP allocations calling into compaction)
> 2. 2.6.38-rc8 with THP disabled by
>    echo never > /sys/kernel/mm/transparent_hugepage/enabled
>    (if the problem still persists, then page reclaim is still a problem
>     but we should still stop THP doing sync writes)
> 3. 2.6.37 vanilla
>    (in case this is a new regression introduced since then)
> 
> Migration can do sync writes on dirty pages which is why it looks so similar
> to page reclaim but this can be controlled by the value of sync_migration
> passed into try_to_compact_pages(). If we find that option 1 above makes
> the regression go away or at least helps a lot, then a reasonable fix may
> be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
> THP allocations. I've added Andrea to the cc to see what he thinks.

I agree. Forcing sync=0 when __GFP_NO_KSWAPD is set, sounds good to
me, if it is proven to resolve these I/O waits.

Also note that 2.6.38 upstream still misses a couple of important
compaction fixes that are in aa.git (everything relevant is already
queued in -mm but it was a bit late for 2.6.38), so I'd also be
interested to know if you can reproduce in current aa.git
origin/master branch.

If it's a __GFP_NO_KSWAPD allocation (do_huge_pmd_anonymous_page())
that is present in the hanging stack traces, I strongly doubt any of
the changes in aa.git is going to help at all, but it worth a try to
be sure.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=48ad57f498835621d8bad83b972ee6e6c395523a
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=8f6854f7cbf71bc61758bcd92497378e1f677552
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=8ff6d16eb15d2b328bbe715fcaf453b6fedb2cf9
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=e31adb46cd8c4f331cfb02c938e88586d5846bf8

This is the implementation of Mel's idea that you can apply to
upstream or aa.git to see what happens...

===
Subject: compaction: use async migrate for __GFP_NO_KSWAPD

From: Andrea Arcangeli <aarcange@redhat.com>

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to
succeed (they have graceful fallback). Waiting for I/O in those, tends to be
overkill in terms of latencies, so we can reduce their latency by disabling
sync migrate.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd76256..36d1c79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 15 Anonymous Emailer 2011-03-18 18:27:50 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 18/03/11 06:13, Mel Gorman escribió:
>
> \o/ ... no wait, it's the other one - :(
>
> If you look at the stack traces though, all of them had called
> do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger
> is new because it's THP triggering compaction that is causing the stalls
> rather than page reclaim doing direct writeback which was the culprit in
> the past.
>
> To confirm if this is the case, I'd be very interested in hearing if this
> problem persists in the following cases
>
> 1. 2.6.38-rc8 with defrag disabled by
>     echo never>/sys/kernel/mm/transparent_hugepage/defrag
>     (this will stop THP allocations calling into compaction)
> 2. 2.6.38-rc8 with THP disabled by
>     echo never>
> /sys/kernel/mm/transparent_hugepage/enabled
>     (if the problem still persists, then page reclaim is still a problem
>      but we should still stop THP doing sync writes)
> 3. 2.6.37 vanilla
>     (in case this is a new regression introduced since then)
>
> Migration can do sync writes on dirty pages which is why it looks so similar
> to page reclaim but this can be controlled by the value of sync_migration
> passed into try_to_compact_pages(). If we find that option 1 above makes
> the regression go away or at least helps a lot, then a reasonable fix may
> be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
> THP allocations. I've added Andrea to the cc to see what he thinks.
>
> Thanks for the report.
>
I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I have verified that echoing "never" on either /sys/kernel/mm/transparent_hugepage/defrag or /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy to USB to proceed smoothly 
(copying 4GB of data). Just to verify, I later wrote "always" to both files, and sure enough, some applications stalled when I repeated the same file copy. So I have at least a workaround for the issue. Given this evidence, will the patch at comment #14 
fix the issue for good?
Comment 16 Anonymous Emailer 2011-03-19 13:47:05 UTC
Reply-To: mel@csn.ul.ie

On Fri, Mar 18, 2011 at 01:05:15PM -0500, Alex Villac??s Lasso wrote:
> El 18/03/11 06:13, Mel Gorman escribió:
> >
> >\o/ ... no wait, it's the other one - :(
> >
> >If you look at the stack traces though, all of them had called
> >do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger
> >is new because it's THP triggering compaction that is causing the stalls
> >rather than page reclaim doing direct writeback which was the culprit in
> >the past.
> >
> >To confirm if this is the case, I'd be very interested in hearing if this
> >problem persists in the following cases
> >
> >1. 2.6.38-rc8 with defrag disabled by
> >    echo never>/sys/kernel/mm/transparent_hugepage/defrag
> >    (this will stop THP allocations calling into compaction)
> >2. 2.6.38-rc8 with THP disabled by
> >    echo never>
> >/sys/kernel/mm/transparent_hugepage/enabled
> >    (if the problem still persists, then page reclaim is still a problem
> >     but we should still stop THP doing sync writes)
> >3. 2.6.37 vanilla
> >    (in case this is a new regression introduced since then)
> >
> >Migration can do sync writes on dirty pages which is why it looks so similar
> >to page reclaim but this can be controlled by the value of sync_migration
> >passed into try_to_compact_pages(). If we find that option 1 above makes
> >the regression go away or at least helps a lot, then a reasonable fix may
> >be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
> >THP allocations. I've added Andrea to the cc to see what he thinks.
> >
> >Thanks for the report.
> >
>
> I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I
> have verified that echoing "never" on either
> /sys/kernel/mm/transparent_hugepage/defrag or
> /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy
> to USB to proceed smoothly (copying 4GB of data). Just to verify, I
> later wrote "always" to both files, and sure enough, some
> applications stalled when I repeated the same file copy. So I have
> at least a workaround for the issue. Given this evidence, will the
> patch at comment #14 fix the issue for good?
> 

Thanks for testing and reporting, it's very helpful. Based on that that
report the patch should help. Can you test it to be absolutly sure please?
Comment 17 Alex Villacis Lasso 2011-03-19 15:59:06 UTC
Created attachment 51262 [details]
sysrq-w trace with patch applied

The patch did not help, it just delayed the appearance of the symptom. Now it took longer (about 4 minutes) before several applications started freezing. The sysrq-w trace shows the situation.
Comment 18 Anonymous Emailer 2011-03-19 16:05:32 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 19/03/11 08:46, Mel Gorman escribió:
> On Fri, Mar 18, 2011 at 01:05:15PM -0500, Alex Villac??s Lasso wrote:
>> El 18/03/11 06:13, Mel Gorman escribió:
>>> \o/ ... no wait, it's the other one - :(
>>>
>>> If you look at the stack traces though, all of them had called
>>> do_huge_pmd_anonymous_page() so while it looks similar to 12309, the
>>> trigger
>>> is new because it's THP triggering compaction that is causing the stalls
>>> rather than page reclaim doing direct writeback which was the culprit in
>>> the past.
>>>
>>> To confirm if this is the case, I'd be very interested in hearing if this
>>> problem persists in the following cases
>>>
>>> 1. 2.6.38-rc8 with defrag disabled by
>>>     echo never>/sys/kernel/mm/transparent_hugepage/defrag
>>>     (this will stop THP allocations calling into compaction)
>>> 2. 2.6.38-rc8 with THP disabled by
>>>     echo never>
>>> /sys/kernel/mm/transparent_hugepage/enabled
>>>     (if the problem still persists, then page reclaim is still a problem
>>>      but we should still stop THP doing sync writes)
>>> 3. 2.6.37 vanilla
>>>     (in case this is a new regression introduced since then)
>>>
>>> Migration can do sync writes on dirty pages which is why it looks so
>>> similar
>>> to page reclaim but this can be controlled by the value of sync_migration
>>> passed into try_to_compact_pages(). If we find that option 1 above makes
>>> the regression go away or at least helps a lot, then a reasonable fix may
>>> be to never set sync_migration if __GFP_NO_KSWAPD which is always set for
>>> THP allocations. I've added Andrea to the cc to see what he thinks.
>>>
>>> Thanks for the report.
>>>
>> I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I
>> have verified that echoing "never" on either
>> /sys/kernel/mm/transparent_hugepage/defrag or
>> /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy
>> to USB to proceed smoothly (copying 4GB of data). Just to verify, I
>> later wrote "always" to both files, and sure enough, some
>> applications stalled when I repeated the same file copy. So I have
>> at least a workaround for the issue. Given this evidence, will the
>> patch at comment #14 fix the issue for good?
>>
> Thanks for testing and reporting, it's very helpful. Based on that that
> report the patch should help. Can you test it to be absolutly sure please?
>
>
The patch did not help. I have attached a sysrq-w trace with the patch applied in the bug report.
Comment 19 Anonymous Emailer 2011-03-19 23:52:18 UTC
Reply-To: aarcange@redhat.com

On Sat, Mar 19, 2011 at 11:04:02AM -0500, Alex Villací­s Lasso wrote:
> The patch did not help. I have attached a sysrq-w trace with the patch
> applied in the bug report.

Most processes are stuck in udf_writepage. That's because migrate is
calling ->writepage on dirty pages even when sync=0.

This may do better, can you test it in replacement of the previous
patch?

Thanks,
Andrea

===
Subject: compaction: use async migrate for __GFP_NO_KSWAPD

From: Andrea Arcangeli <aarcange@redhat.com>

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to
succeed (they have graceful fallback). Waiting for I/O in those, tends to be
overkill in terms of latencies, so we can reduce their latency by disabling
sync migrate.

Stop calling ->writepage on dirty cache when migrate sync mode is not set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/migrate.c    |   35 ++++++++++++++++++++++++++---------
 mm/page_alloc.c |    2 +-
 2 files changed, 27 insertions(+), 10 deletions(-)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -536,10 +536,15 @@ static int writeout(struct address_space
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page)
+				 struct page *newpage, struct page *page,
+				 int sync)
 {
-	if (PageDirty(page))
-		return writeout(mapping, page);
+	if (PageDirty(page)) {
+		if (sync)
+			return writeout(mapping, page);
+		else
+			return -EBUSY;
+	}
 
 	/*
 	 * Buffers may be managed in a filesystem specific way.
@@ -564,7 +569,7 @@ static int fallback_migrate_page(struct 
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-						int remap_swapcache)
+			    int remap_swapcache, int sync)
 {
 	struct address_space *mapping;
 	int rc;
@@ -597,7 +602,7 @@ static int move_to_new_page(struct page 
 		rc = mapping->a_ops->migratepage(mapping,
 						newpage, page);
 	else
-		rc = fallback_migrate_page(mapping, newpage, page);
+		rc = fallback_migrate_page(mapping, newpage, page, sync);
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -641,6 +646,10 @@ static int unmap_and_move(new_page_t get
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
+		if (!sync) {
+			rc = -EBUSY;
+			goto move_newpage;
+		}
 		if (!force)
 			goto move_newpage;
 
@@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		if (!force || !sync)
+		if (!sync) {
+			rc = -EBUSY;
+			goto uncharge;
+		}
+		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
 	}
@@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache);
+		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -834,7 +847,11 @@ static int unmap_and_move_huge_page(new_
 	rc = -EAGAIN;
 
 	if (!trylock_page(hpage)) {
-		if (!force || !sync)
+		if (!sync) {
+			rc = -EBUSY;
+			goto out;
+		}
+		if (!force)
 			goto out;
 		lock_page(hpage);
 	}
@@ -850,7 +867,7 @@ static int unmap_and_move_huge_page(new_
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1);
+		rc = move_to_new_page(new_hpage, hpage, 1, sync);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
Comment 20 Anonymous Emailer 2011-03-21 09:42:23 UTC
Reply-To: mel@csn.ul.ie

On Sun, Mar 20, 2011 at 12:51:44AM +0100, Andrea Arcangeli wrote:
> On Sat, Mar 19, 2011 at 11:04:02AM -0500, Alex Villací­s Lasso wrote:
> > The patch did not help. I have attached a sysrq-w trace with the patch
> applied in the bug report.
> 
> Most processes are stuck in udf_writepage. That's because migrate is
> calling ->writepage on dirty pages even when sync=0.
> 
> This may do better, can you test it in replacement of the previous
> patch?
> 
> ===
> Subject: compaction: use async migrate for __GFP_NO_KSWAPD
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to
> succeed (they have graceful fallback). Waiting for I/O in those, tends to be
> overkill in terms of latencies, so we can reduce their latency by disabling
> sync migrate.
> 
> Stop calling ->writepage on dirty cache when migrate sync mode is not set.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/migrate.c    |   35 ++++++++++++++++++++++++++---------
>  mm/page_alloc.c |    2 +-
>  2 files changed, 27 insertions(+), 10 deletions(-)
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2085,7 +2085,7 @@ rebalance:
>                                       sync_migration);
>       if (page)
>               goto got_pg;
> -     sync_migration = true;
> +     sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>       /* Try direct reclaim and then allocating */
>       page = __alloc_pages_direct_reclaim(gfp_mask, order,
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -536,10 +536,15 @@ static int writeout(struct address_space
>   * Default handling if a filesystem does not provide a migration function.
>   */
>  static int fallback_migrate_page(struct address_space *mapping,
> -     struct page *newpage, struct page *page)
> +                              struct page *newpage, struct page *page,
> +                              int sync)
>  {
> -     if (PageDirty(page))
> -             return writeout(mapping, page);
> +     if (PageDirty(page)) {
> +             if (sync)
> +                     return writeout(mapping, page);
> +             else
> +                     return -EBUSY;
> +     }
>  
>       /*
>        * Buffers may be managed in a filesystem specific way.

The check is at the wrong level I believe because it misses NFS pages which
will still get queued for IO which can block waiting on a request to complete.

> @@ -564,7 +569,7 @@ static int fallback_migrate_page(struct 
>   *  == 0 - success
>   */
>  static int move_to_new_page(struct page *newpage, struct page *page,
> -                                             int remap_swapcache)
> +                         int remap_swapcache, int sync)

sync should be bool.

>  {
>       struct address_space *mapping;
>       int rc;
> @@ -597,7 +602,7 @@ static int move_to_new_page(struct page 
>               rc = mapping->a_ops->migratepage(mapping,
>                                               newpage, page);
>       else
> -             rc = fallback_migrate_page(mapping, newpage, page);
> +             rc = fallback_migrate_page(mapping, newpage, page, sync);
>  
>       if (rc) {
>               newpage->mapping = NULL;
> @@ -641,6 +646,10 @@ static int unmap_and_move(new_page_t get
>       rc = -EAGAIN;
>  
>       if (!trylock_page(page)) {
> +             if (!sync) {
> +                     rc = -EBUSY;
> +                     goto move_newpage;
> +             }

It's overkill to return EBUSY just because we failed to get a lock which could
be released very quickly. If we left rc as -EAGAIN it would retry again.
The worst case scenario is that the current process is the holder of the
lock and the loop is pointless but this is a relatively rare situation
(other than Hugh's loopback test aside which seems to be particularly good
at triggering that situation).

>               if (!force)
>                       goto move_newpage;
>  
> @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get
>       BUG_ON(charge);
>  
>       if (PageWriteback(page)) {
> -             if (!force || !sync)
> +             if (!sync) {
> +                     rc = -EBUSY;
> +                     goto uncharge;
> +             }
> +             if (!force)
>                       goto uncharge;

Where as this is ok because if the page is being written back, it's fairly
unlikely it'll get cleared quickly enough for the retry loop to make sense.

>               wait_on_page_writeback(page);
>       }
> @@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get
>  
>  skip_unmap:
>       if (!page_mapped(page))
> -             rc = move_to_new_page(newpage, page, remap_swapcache);
> +             rc = move_to_new_page(newpage, page, remap_swapcache, sync);
>  
>       if (rc && remap_swapcache)
>               remove_migration_ptes(page, page);
> @@ -834,7 +847,11 @@ static int unmap_and_move_huge_page(new_
>       rc = -EAGAIN;
>  
>       if (!trylock_page(hpage)) {
> -             if (!force || !sync)
> +             if (!sync) {
> +                     rc = -EBUSY;
> +                     goto out;
> +             }
> +             if (!force)
>                       goto out;
>               lock_page(hpage);
>       }

As before, it's worth retrying to get the lock as it could be released
very shortly.

> @@ -850,7 +867,7 @@ static int unmap_and_move_huge_page(new_
>       try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>  
>       if (!page_mapped(hpage))
> -             rc = move_to_new_page(new_hpage, hpage, 1);
> +             rc = move_to_new_page(new_hpage, hpage, 1, sync);
>  
>       if (rc)
>               remove_migration_ptes(hpage, hpage);
> 

Because of the NFS pages and being a bit aggressive about using -EBUSY,
how about the following instead? (build tested only unfortunately)

==== CUT HERE ====
mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback

From: Andrea Arcangeli <aarcange@redhat.com>

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those, tends
to be overkill in terms of latencies, so we can reduce their latency by
disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD
blocking, this patch prevents ->writepage being called on dirty page cache
for asynchronous migration.

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c    |   47 ++++++++++++++++++++++++++++++-----------------
 mm/page_alloc.c |    2 +-
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 352de555..1b45508 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping,
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-						int remap_swapcache)
+					int remap_swapcache, bool sync)
 {
 	struct address_space *mapping;
 	int rc;
@@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	mapping = page_mapping(page);
 	if (!mapping)
 		rc = migrate_page(mapping, newpage, page);
-	else if (mapping->a_ops->migratepage)
-		/*
-		 * Most pages have a mapping and most filesystems
-		 * should provide a migration function. Anonymous
-		 * pages are part of swap space which also has its
-		 * own migration function. This is the most common
-		 * path for page migration.
-		 */
-		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page);
+	else {
+		/* Do not writeback pages if !sync */
+		if (PageDirty(page) && !sync)
+			rc = -EBUSY;
+		else if (mapping->a_ops->migratepage)
+			/*
+		 	* Most pages have a mapping and most filesystems
+		 	* should provide a migration function. Anonymous
+		 	* pages are part of swap space which also has its
+		 	* own migration function. This is the most common
+		 	* path for page migration.
+		 	*/
+			rc = mapping->a_ops->migratepage(mapping,
+							newpage, page);
+		else
+			rc = fallback_migrate_page(mapping, newpage, page);
+	}
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -641,7 +646,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
-		if (!force)
+		if (!force || !sync)
 			goto move_newpage;
 
 		/*
@@ -686,7 +691,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		if (!force || !sync)
+		/*
+		 * For !sync, there is no point retrying as the retry loop
+		 * is expected to be too short for PageWriteback to be cleared
+		 */
+		if (!sync) {
+			rc = -EBUSY;
+			goto uncharge;
+		}
+		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
 	}
@@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache);
+		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -850,7 +863,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1);
+		rc = move_to_new_page(new_hpage, hpage, 1, sync);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdef1d4..ce6d601 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 21 Anonymous Emailer 2011-03-21 13:49:12 UTC
Reply-To: aarcange@redhat.com

On Mon, Mar 21, 2011 at 09:41:49AM +0000, Mel Gorman wrote:
> The check is at the wrong level I believe because it misses NFS pages which
> will still get queued for IO which can block waiting on a request to
> complete.

But for example ->migratepage won't block at all for swapcache... it's
just a pointer for migrate_page... so I didnt' want to skip what could
be nonblocking, it just makes migrate less reliable for no good in
some case. The fallback case is very likely blocking instead so I only
returned -EBUSY there.

Best would be to pass a sync/nonblock param to migratepage(nonblock)
so that nfs_migrate_page can pass "nonblock" instead of "false" to
nfs_find_and_lock_request.

> sync should be bool.

That's better thanks.

> It's overkill to return EBUSY just because we failed to get a lock which
> could
> be released very quickly. If we left rc as -EAGAIN it would retry again.
> The worst case scenario is that the current process is the holder of the
> lock and the loop is pointless but this is a relatively rare situation
> (other than Hugh's loopback test aside which seems to be particularly good
> at triggering that situation).

This change was only meant to possibly avoid some cpu waste in the
tight loop, not really "blocking" related so I'm sure ok to drop it
for now. The page lock holder better to be quick because with sync=0
the tight loop will retry real fast. If the holder blocks we're not so
smart at retrying in a tight loop but for now it's ok.

> >                     goto move_newpage;
> >  
> > @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get
> >     BUG_ON(charge);
> >  
> >     if (PageWriteback(page)) {
> > -           if (!force || !sync)
> > +           if (!sync) {
> > +                   rc = -EBUSY;
> > +                   goto uncharge;
> > +           }
> > +           if (!force)
> >                     goto uncharge;
> 
> Where as this is ok because if the page is being written back, it's fairly
> unlikely it'll get cleared quickly enough for the retry loop to make sense.

Agreed.

> Because of the NFS pages and being a bit aggressive about using -EBUSY,
> how about the following instead? (build tested only unfortunately)

I tested my version below but I think one needs udf with lots of dirty
pages plus the usb to trigger this which I don't have setup
immediately.

> @@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage,
> struct page *page,
>       mapping = page_mapping(page);
>       if (!mapping)
>               rc = migrate_page(mapping, newpage, page);
> -     else if (mapping->a_ops->migratepage)
> -             /*
> -              * Most pages have a mapping and most filesystems
> -              * should provide a migration function. Anonymous
> -              * pages are part of swap space which also has its
> -              * own migration function. This is the most common
> -              * path for page migration.
> -              */
> -             rc = mapping->a_ops->migratepage(mapping,
> -                                             newpage, page);
> -     else
> -             rc = fallback_migrate_page(mapping, newpage, page);
> +     else {
> +             /* Do not writeback pages if !sync */
> +             if (PageDirty(page) && !sync)
> +                     rc = -EBUSY;

I think it's better to at least change it to:

if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page))

I wasn't sure how to handle noblocking ->migratepage for swapcache and
tmpfs but probably the above check is a good enough approximation.

Before sending my patch I thought of adding a "sync" parameter to
->migratepage(..., sync/nonblock) but then the patch become
bigger... and I just wanted to know if this was the problem or not so
I deferred it.

If we're sure that all migratepage blocks except for things like
swapcache/tmpfs or other not-filebacked things that defines it to
migrate_page, we're pretty well covered by adding a check like above
migratepage == migrate_page and maybe we don't need to add a
"sync/nonblock" parameter to ->migratepage(). For example the
buffer_migrate_page can block too in lock_buffer.

This is the patch I'm trying with the addition of the above check and
some comment space/tab issue cleanup.

===
Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback

From: Andrea Arcangeli <aarcange@redhat.com>

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those, tends
to be overkill in terms of latencies, so we can reduce their latency by
disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD
blocking, this patch prevents ->writepage being called on dirty page cache
for asynchronous migration.

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c    |   48 +++++++++++++++++++++++++++++++++---------------
 mm/page_alloc.c |    2 +-
 2 files changed, 34 insertions(+), 16 deletions(-)

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -564,7 +564,7 @@ static int fallback_migrate_page(struct 
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-						int remap_swapcache)
+					int remap_swapcache, bool sync)
 {
 	struct address_space *mapping;
 	int rc;
@@ -586,18 +586,28 @@ static int move_to_new_page(struct page 
 	mapping = page_mapping(page);
 	if (!mapping)
 		rc = migrate_page(mapping, newpage, page);
-	else if (mapping->a_ops->migratepage)
+	else {
 		/*
-		 * Most pages have a mapping and most filesystems
-		 * should provide a migration function. Anonymous
-		 * pages are part of swap space which also has its
-		 * own migration function. This is the most common
-		 * path for page migration.
+		 * Do not writeback pages if !sync and migratepage is
+		 * not pointing to migrate_page() which is nonblocking
+		 * (swapcache/tmpfs uses migratepage = migrate_page).
 		 */
-		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page);
+		if (PageDirty(page) && !sync &&
+		    mapping->a_ops->migratepage != migrate_page)
+			rc = -EBUSY;
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * should provide a migration function. Anonymous
+			 * pages are part of swap space which also has its
+			 * own migration function. This is the most common
+			 * path for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping,
+							newpage, page);
+		else
+			rc = fallback_migrate_page(mapping, newpage, page);
+	}
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -641,7 +651,7 @@ static int unmap_and_move(new_page_t get
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
-		if (!force)
+		if (!force || !sync)
 			goto move_newpage;
 
 		/*
@@ -686,7 +696,15 @@ static int unmap_and_move(new_page_t get
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		if (!force || !sync)
+		/*
+		 * For !sync, there is no point retrying as the retry loop
+		 * is expected to be too short for PageWriteback to be cleared
+		 */
+		if (!sync) {
+			rc = -EBUSY;
+			goto uncharge;
+		}
+		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
 	}
@@ -757,7 +775,7 @@ static int unmap_and_move(new_page_t get
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache);
+		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -850,7 +868,7 @@ static int unmap_and_move_huge_page(new_
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1);
+		rc = move_to_new_page(new_hpage, hpage, 1, sync);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 22 Anonymous Emailer 2011-03-21 15:24:11 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 21/03/11 08:48, Andrea Arcangeli escribió:
>
> ===
> Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce
> no writeback
>
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
> to succeed as they have graceful fallback. Waiting for I/O in those, tends
> to be overkill in terms of latencies, so we can reduce their latency by
> disabling sync migrate.
>
> Unfortunately, even with async migration it's still possible for the
> process to be blocked waiting for a request slot (e.g. get_request_wait
> in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD
> blocking, this patch prevents ->writepage being called on dirty page cache
> for asynchronous migration.
>
> [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> ---
>   mm/migrate.c    |   48 +++++++++++++++++++++++++++++++++---------------
>   mm/page_alloc.c |    2 +-
>   2 files changed, 34 insertions(+), 16 deletions(-)
The latest patch fails to apply in vanilla 2.6.38:

[alex@srv64 linux-2.6.38]$ patch -p1 --dry-run < ../\[Bug\ 31142\]\ Large\ write\ to\ USB\ stick\ freezes\ unrelated\ tasks\ for\ a\ long\ time.eml
(Stripping trailing CRs from patch.)
patching file mm/migrate.c
Hunk #1 FAILED at 564.
Hunk #2 FAILED at 586.
Hunk #3 FAILED at 641.
Hunk #4 FAILED at 686.
Hunk #5 FAILED at 757.
Hunk #6 FAILED at 850.
6 out of 6 hunks FAILED -- saving rejects to file mm/migrate.c.rej
(Stripping trailing CRs from patch.)
patching file mm/page_alloc.c
Hunk #1 FAILED at 2085.
1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej

I will try to apply the patch manually.
Comment 23 Anonymous Emailer 2011-03-21 15:37:45 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 21/03/11 10:22, Alex Villací­s Lasso escribió:
> El 21/03/11 08:48, Andrea Arcangeli escribió:
>>
>> ===
>> Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce
>> no writeback
>>
>> From: Andrea Arcangeli<aarcange@redhat.com>
>>
>> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
>> to succeed as they have graceful fallback. Waiting for I/O in those, tends
>> to be overkill in terms of latencies, so we can reduce their latency by
>> disabling sync migrate.
>>
>> Unfortunately, even with async migration it's still possible for the
>> process to be blocked waiting for a request slot (e.g. get_request_wait
>> in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD
>> blocking, this patch prevents ->writepage being called on dirty page cache
>> for asynchronous migration.
>>
>> [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
>> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
>> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
>> ---
>>   mm/migrate.c    |   48 +++++++++++++++++++++++++++++++++---------------
>>   mm/page_alloc.c |    2 +-
>>   2 files changed, 34 insertions(+), 16 deletions(-)
> The latest patch fails to apply in vanilla 2.6.38:
>
> [alex@srv64 linux-2.6.38]$ patch -p1 --dry-run < ../\[Bug\ 31142\]\ Large\
> write\ to\ USB\ stick\ freezes\ unrelated\ tasks\ for\ a\ long\ time.eml
> (Stripping trailing CRs from patch.)
> patching file mm/migrate.c
> Hunk #1 FAILED at 564.
> Hunk #2 FAILED at 586.
> Hunk #3 FAILED at 641.
> Hunk #4 FAILED at 686.
> Hunk #5 FAILED at 757.
> Hunk #6 FAILED at 850.
> 6 out of 6 hunks FAILED -- saving rejects to file mm/migrate.c.rej
> (Stripping trailing CRs from patch.)
> patching file mm/page_alloc.c
> Hunk #1 FAILED at 2085.
> 1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej
>
> I will try to apply the patch manually.
>
After some massaging, I succeeded in applying the patch. Compiling now.
Comment 24 Anonymous Emailer 2011-03-21 15:40:45 UTC
Reply-To: aarcange@redhat.com

On Mon, Mar 21, 2011 at 10:22:43AM -0500, Alex Villací­s Lasso wrote:
> I will try to apply the patch manually.

Hmm, I checked that this latest version below applies clean both
against vanilla 2.6.38 and current current git. It should apply clean
if you run:

     "git fetch; git checkout -f origin/master"

or "git checkout -f v2.6.38"

I got this and another migrate patch from Hugh both applied in aa.git
so you may use that one too if you want:

     "git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git"

(with --reference it should clone very fast, linux-2.6 must be a clone
of the upstream linux-2.6.git tree)

Thanks,
Andrea

===
Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback

From: Andrea Arcangeli <aarcange@redhat.com>

__GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
to succeed as they have graceful fallback. Waiting for I/O in those, tends
to be overkill in terms of latencies, so we can reduce their latency by
disabling sync migrate.

Unfortunately, even with async migration it's still possible for the
process to be blocked waiting for a request slot (e.g. get_request_wait
in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD
blocking, this patch prevents ->writepage being called on dirty page cache
for asynchronous migration.

[mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c    |   48 +++++++++++++++++++++++++++++++++---------------
 mm/page_alloc.c |    2 +-
 2 files changed, 34 insertions(+), 16 deletions(-)

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -564,7 +564,7 @@ static int fallback_migrate_page(struct 
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-						int remap_swapcache)
+					int remap_swapcache, bool sync)
 {
 	struct address_space *mapping;
 	int rc;
@@ -586,18 +586,28 @@ static int move_to_new_page(struct page 
 	mapping = page_mapping(page);
 	if (!mapping)
 		rc = migrate_page(mapping, newpage, page);
-	else if (mapping->a_ops->migratepage)
+	else {
 		/*
-		 * Most pages have a mapping and most filesystems
-		 * should provide a migration function. Anonymous
-		 * pages are part of swap space which also has its
-		 * own migration function. This is the most common
-		 * path for page migration.
+		 * Do not writeback pages if !sync and migratepage is
+		 * not pointing to migrate_page() which is nonblocking
+		 * (swapcache/tmpfs uses migratepage = migrate_page).
 		 */
-		rc = mapping->a_ops->migratepage(mapping,
-						newpage, page);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page);
+		if (PageDirty(page) && !sync &&
+		    mapping->a_ops->migratepage != migrate_page)
+			rc = -EBUSY;
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * should provide a migration function. Anonymous
+			 * pages are part of swap space which also has its
+			 * own migration function. This is the most common
+			 * path for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping,
+							newpage, page);
+		else
+			rc = fallback_migrate_page(mapping, newpage, page);
+	}
 
 	if (rc) {
 		newpage->mapping = NULL;
@@ -641,7 +651,7 @@ static int unmap_and_move(new_page_t get
 	rc = -EAGAIN;
 
 	if (!trylock_page(page)) {
-		if (!force)
+		if (!force || !sync)
 			goto move_newpage;
 
 		/*
@@ -686,7 +696,15 @@ static int unmap_and_move(new_page_t get
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		if (!force || !sync)
+		/*
+		 * For !sync, there is no point retrying as the retry loop
+		 * is expected to be too short for PageWriteback to be cleared
+		 */
+		if (!sync) {
+			rc = -EBUSY;
+			goto uncharge;
+		}
+		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
 	}
@@ -757,7 +775,7 @@ static int unmap_and_move(new_page_t get
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache);
+		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -850,7 +868,7 @@ static int unmap_and_move_huge_page(new_
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1);
+		rc = move_to_new_page(new_hpage, hpage, 1, sync);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 25 Anonymous Emailer 2011-03-21 16:38:27 UTC
Reply-To: mel@csn.ul.ie

On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 21, 2011 at 09:41:49AM +0000, Mel Gorman wrote:
> > The check is at the wrong level I believe because it misses NFS pages which
> > will still get queued for IO which can block waiting on a request to
> complete.
> 
> But for example ->migratepage won't block at all for swapcache... it's
> just a pointer for migrate_page... so I didnt' want to skip what could
> be nonblocking, it just makes migrate less reliable for no good in
> some case. The fallback case is very likely blocking instead so I only
> returned -EBUSY there.
> 

Fair point.

> Best would be to pass a sync/nonblock param to migratepage(nonblock)
> so that nfs_migrate_page can pass "nonblock" instead of "false" to
> nfs_find_and_lock_request.
> 

I had considered this but thought passing in sync to things like
migrate_page() that ignored it looked a little ugly.

> > sync should be bool.
> 
> That's better thanks.
> 
> > It's overkill to return EBUSY just because we failed to get a lock which
> could
> > be released very quickly. If we left rc as -EAGAIN it would retry again.
> > The worst case scenario is that the current process is the holder of the
> > lock and the loop is pointless but this is a relatively rare situation
> > (other than Hugh's loopback test aside which seems to be particularly good
> > at triggering that situation).
> 
> This change was only meant to possibly avoid some cpu waste in the
> tight loop, not really "blocking" related so I'm sure ok to drop it
> for now. The page lock holder better to be quick because with sync=0
> the tight loop will retry real fast. If the holder blocks we're not so
> smart at retrying in a tight loop but for now it's ok.
> 

Ok.

> > >                   goto move_newpage;
> > >  
> > > @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get
> > >   BUG_ON(charge);
> > >  
> > >   if (PageWriteback(page)) {
> > > -         if (!force || !sync)
> > > +         if (!sync) {
> > > +                 rc = -EBUSY;
> > > +                 goto uncharge;
> > > +         }
> > > +         if (!force)
> > >                   goto uncharge;
> > 
> > Where as this is ok because if the page is being written back, it's fairly
> > unlikely it'll get cleared quickly enough for the retry loop to make sense.
> 
> Agreed.
> 
> > Because of the NFS pages and being a bit aggressive about using -EBUSY,
> > how about the following instead? (build tested only unfortunately)
> 
> I tested my version below but I think one needs udf with lots of dirty
> pages plus the usb to trigger this which I don't have setup
> immediately.
> 
> > @@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage,
> struct page *page,
> >     mapping = page_mapping(page);
> >     if (!mapping)
> >             rc = migrate_page(mapping, newpage, page);
> > -   else if (mapping->a_ops->migratepage)
> > -           /*
> > -            * Most pages have a mapping and most filesystems
> > -            * should provide a migration function. Anonymous
> > -            * pages are part of swap space which also has its
> > -            * own migration function. This is the most common
> > -            * path for page migration.
> > -            */
> > -           rc = mapping->a_ops->migratepage(mapping,
> > -                                           newpage, page);
> > -   else
> > -           rc = fallback_migrate_page(mapping, newpage, page);
> > +   else {
> > +           /* Do not writeback pages if !sync */
> > +           if (PageDirty(page) && !sync)
> > +                   rc = -EBUSY;
> 
> I think it's better to at least change it to:
> 
> if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page))
> 
> I wasn't sure how to handle noblocking ->migratepage for swapcache and
> tmpfs but probably the above check is a good enough approximation.
> 

It's a good enough approximation. It's a little ugly but I don't think
it's much uglier than passing in unused parameters to migrate_page().

> Before sending my patch I thought of adding a "sync" parameter to
> ->migratepage(..., sync/nonblock) but then the patch become
> bigger... and I just wanted to know if this was the problem or not so
> I deferred it.
> 

I deferred it for similar reasons. It was becoming a much larger change
than should be necessary for the fix.

> If we're sure that all migratepage blocks except for things like
> swapcache/tmpfs or other not-filebacked things that defines it to
> migrate_page, we're pretty well covered by adding a check like above
> migratepage == migrate_page and maybe we don't need to add a
> "sync/nonblock" parameter to ->migratepage(). For example the
> buffer_migrate_page can block too in lock_buffer.
> 

Agreed.

> This is the patch I'm trying with the addition of the above check and
> some comment space/tab issue cleanup.
> 

Nothing bad jumped out at me. Lets see how it gets on with testing.

Thanks
Comment 26 Alex Villacis Lasso 2011-03-21 17:04:43 UTC
Created attachment 51532 [details]
sysrq-w trace with patch from comment #21 applied

As with the previous patch, this one did not completely solve the freezing tasks issue. However, as with the previous patch, the freezes took longer to appear, and now lasted less (10 to 12 seconds instead of freezing until the end of the usb copy).
Comment 27 Anonymous Emailer 2011-03-21 17:07:09 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 21/03/11 11:37, Mel Gorman escribió:
> On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
>
> Nothing bad jumped out at me. Lets see how it gets on with testing.
>
> Thanks
>
As with the previous patch, this one did not completely solve the freezing tasks issue. However, as with the previous patch, the freezes took longer to appear, and now lasted less (10 to 12 seconds instead of freezing until the end of the usb copy).

I have attached the new sysrq-w trace to the bug report.
Comment 28 Anonymous Emailer 2011-03-21 20:17:17 UTC
Reply-To: aarcange@redhat.com

On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villací­s Lasso wrote:
> El 21/03/11 11:37, Mel Gorman escribió:
> > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
> >
> > Nothing bad jumped out at me. Lets see how it gets on with testing.
> >
> > Thanks
> >
> As with the previous patch, this one did not completely solve the freezing
> tasks issue. However, as with the previous patch, the freezes took longer to
> appear, and now lasted less (10 to 12 seconds instead of freezing until the
> end of the usb copy).
> 
> I have attached the new sysrq-w trace to the bug report.

migrate and compaction disappeared from the traces as we hoped
for. The THP allocations left throttles on writeback during reclaim
like any 4k allocation would do:

[ 2629.256809]  [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d
[ 2629.256812]  [<ffffffff810e5992>] shrink_page_list+0x134/0x478
[ 2629.256815]  [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a
[ 2629.256818]  [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21
[ 2629.256820]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
[ 2629.256823]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
[ 2629.256827]  [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3
[ 2629.256830]  [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef
[ 2629.256833]  [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772
[ 2629.256837]  [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1
[ 2629.256840]  [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267
[ 2629.256844]  [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40
[ 2629.256846]  [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f
[ 2629.256850]  [<ffffffff8100f298>] ? arch_get_unmapped_area_topdown+0x1c3/0x28f
[ 2629.256853]  [<ffffffff814818cc>] do_page_fault+0x33b/0x35d
[ 2629.256856]  [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4
[ 2629.256859]  [<ffffffff8112dd66>] ? path_put+0x22/0x27
[ 2629.256861]  [<ffffffff8147f285>] page_fault+0x25/0x30

They throttle on writeback I/O completion like kswapd too:

[ 2849.098751]  [<ffffffff8147d00b>] io_schedule+0x47/0x62
[ 2849.098756]  [<ffffffff8121c47b>] get_request_wait+0x10a/0x197
[ 2849.098760]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
[ 2849.098763]  [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0
[ 2849.098767]  [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb
[ 2849.098771]  [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336
[ 2849.098774]  [<ffffffff8121bd39>] submit_bio+0xe0/0xff
[ 2849.098777]  [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4
[ 2849.098781]  [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f
[ 2849.098785]  [<ffffffff811492ec>] submit_bh+0xe8/0x10e
[ 2849.098788]  [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da
[ 2849.098793]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
[ 2849.098796]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
[ 2849.098799]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
[ 2849.098802]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
[ 2849.098805]  [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98
[ 2849.098808]  [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17
[ 2849.098811]  [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf]
[ 2849.098814]  [<ffffffff810e44fd>] pageout+0x138/0x255
[ 2849.098817]  [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478
[ 2849.098820]  [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a
[ 2849.098824]  [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2
[ 2849.098828]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
[ 2849.098831]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
[ 2849.098834]  [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae
[ 2849.098837]  [<ffffffff810e773f>] kswapd+0x51c/0x89f

I'm unsure if there's any other problem left that can be attributed to
compaction/migrate (especially considering the THP allocations have no
__GFP_REPEAT set and should_continue_reclaim should break the loop if
nr_reclaim is zero, plus compaction_suitable requires not much more
memory to be reclaimed if compared to no-compaction).

I'd suggest to try a few more times with "echo never >
/sys/kernel/mm/transparent_hugepage/enabled" and see if it still makes
a difference.

I doubt udf is as optimal as other fs in optimizing locks and
writebacks but that may not be relevant to this issue (or at least it
most certainly wasn't until now and the same we had so far probably
applied to all fs, from this point things aren't as clear as before if
responsiveness is still worse than without THP). Anther thing that I
find confusing is the udf vs USB stick, I'd expect udf on a usb dvd
not a usbstick flash drive, I'd use vfat or ext4 on a flash drive.
Comment 29 Anonymous Emailer 2011-03-21 23:36:35 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 21/03/11 15:16, Andrea Arcangeli escribió:
> On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villací­s Lasso wrote:
>> El 21/03/11 11:37, Mel Gorman escribió:
>>> On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
>>>
>>> Nothing bad jumped out at me. Lets see how it gets on with testing.
>>>
>>> Thanks
>>>
>> As with the previous patch, this one did not completely solve the freezing
>> tasks issue. However, as with the previous patch, the freezes took longer to
>> appear, and now lasted less (10 to 12 seconds instead of freezing until the
>> end of the usb copy).
>>
>> I have attached the new sysrq-w trace to the bug report.
> migrate and compaction disappeared from the traces as we hoped
> for. The THP allocations left throttles on writeback during reclaim
> like any 4k allocation would do:
>
> [ 2629.256809]  [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d
> [ 2629.256812]  [<ffffffff810e5992>] shrink_page_list+0x134/0x478
> [ 2629.256815]  [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a
> [ 2629.256818]  [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21
> [ 2629.256820]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> [ 2629.256823]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> [ 2629.256827]  [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3
> [ 2629.256830]  [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef
> [ 2629.256833]  [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772
> [ 2629.256837]  [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1
> [ 2629.256840]  [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267
> [ 2629.256844]  [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40
> [ 2629.256846]  [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f
> [ 2629.256850]  [<ffffffff8100f298>] ?
> arch_get_unmapped_area_topdown+0x1c3/0x28f
> [ 2629.256853]  [<ffffffff814818cc>] do_page_fault+0x33b/0x35d
> [ 2629.256856]  [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4
> [ 2629.256859]  [<ffffffff8112dd66>] ? path_put+0x22/0x27
> [ 2629.256861]  [<ffffffff8147f285>] page_fault+0x25/0x30
>
> They throttle on writeback I/O completion like kswapd too:
>
> [ 2849.098751]  [<ffffffff8147d00b>] io_schedule+0x47/0x62
> [ 2849.098756]  [<ffffffff8121c47b>] get_request_wait+0x10a/0x197
> [ 2849.098760]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
> [ 2849.098763]  [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0
> [ 2849.098767]  [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb
> [ 2849.098771]  [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336
> [ 2849.098774]  [<ffffffff8121bd39>] submit_bio+0xe0/0xff
> [ 2849.098777]  [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4
> [ 2849.098781]  [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f
> [ 2849.098785]  [<ffffffff811492ec>] submit_bh+0xe8/0x10e
> [ 2849.098788]  [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da
> [ 2849.098793]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> [ 2849.098796]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> [ 2849.098799]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> [ 2849.098802]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> [ 2849.098805]  [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98
> [ 2849.098808]  [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17
> [ 2849.098811]  [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf]
> [ 2849.098814]  [<ffffffff810e44fd>] pageout+0x138/0x255
> [ 2849.098817]  [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478
> [ 2849.098820]  [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a
> [ 2849.098824]  [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2
> [ 2849.098828]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> [ 2849.098831]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> [ 2849.098834]  [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae
> [ 2849.098837]  [<ffffffff810e773f>] kswapd+0x51c/0x89f
>
> I'm unsure if there's any other problem left that can be attributed to
> compaction/migrate (especially considering the THP allocations have no
> __GFP_REPEAT set and should_continue_reclaim should break the loop if
> nr_reclaim is zero, plus compaction_suitable requires not much more
> memory to be reclaimed if compared to no-compaction).
>
> I'd suggest to try a few more times with "echo never>
> /sys/kernel/mm/transparent_hugepage/enabled" and see if it still makes
> a difference.
>
> I doubt udf is as optimal as other fs in optimizing locks and
> writebacks but that may not be relevant to this issue (or at least it
> most certainly wasn't until now and the same we had so far probably
> applied to all fs, from this point things aren't as clear as before if
> responsiveness is still worse than without THP). Anther thing that I
> find confusing is the udf vs USB stick, I'd expect udf on a usb dvd
> not a usbstick flash drive, I'd use vfat or ext4 on a flash drive.
>
By echoing "never" to /sys/kernel/mm/transparent_hugepage/enabled , the copy still goes smoothly without any application freezes, even with the recent patch applied.
Comment 30 Anonymous Emailer 2011-03-22 11:21:06 UTC
Reply-To: mel@csn.ul.ie

On Mon, Mar 21, 2011 at 09:16:41PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villací­s Lasso wrote:
> > El 21/03/11 11:37, Mel Gorman escribió:
> > > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
> > >
> > > Nothing bad jumped out at me. Lets see how it gets on with testing.
> > >
> > > Thanks
> > >
> > As with the previous patch, this one did not completely solve the freezing
> tasks issue. However, as with the previous patch, the freezes took longer to
> appear, and now lasted less (10 to 12 seconds instead of freezing until the
> end of the usb copy).
> > 
> > I have attached the new sysrq-w trace to the bug report.
> 
> migrate and compaction disappeared from the traces as we hoped
> for. The THP allocations left throttles on writeback during reclaim
> like any 4k allocation would do:
> 
> [ 2629.256809]  [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d
> [ 2629.256812]  [<ffffffff810e5992>] shrink_page_list+0x134/0x478
> [ 2629.256815]  [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a
> [ 2629.256818]  [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21
> [ 2629.256820]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> [ 2629.256823]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> [ 2629.256827]  [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3
> [ 2629.256830]  [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef
> [ 2629.256833]  [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772
> [ 2629.256837]  [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1
> [ 2629.256840]  [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267
> [ 2629.256844]  [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40
> [ 2629.256846]  [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f
> [ 2629.256850]  [<ffffffff8100f298>] ?
> arch_get_unmapped_area_topdown+0x1c3/0x28f
> [ 2629.256853]  [<ffffffff814818cc>] do_page_fault+0x33b/0x35d
> [ 2629.256856]  [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4
> [ 2629.256859]  [<ffffffff8112dd66>] ? path_put+0x22/0x27
> [ 2629.256861]  [<ffffffff8147f285>] page_fault+0x25/0x30
> 

There is an important difference between THP and generic order-0 reclaim
though. Once defrag is enabled in THP, it can enter direct reclaim for
reclaim/compaction where more pages may be claimed than for a base page
fault thereby encountering more dirty pages and stalling.

> They throttle on writeback I/O completion like kswapd too:
> 
> [ 2849.098751]  [<ffffffff8147d00b>] io_schedule+0x47/0x62
> [ 2849.098756]  [<ffffffff8121c47b>] get_request_wait+0x10a/0x197
> [ 2849.098760]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
> [ 2849.098763]  [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0
> [ 2849.098767]  [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb
> [ 2849.098771]  [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336
> [ 2849.098774]  [<ffffffff8121bd39>] submit_bio+0xe0/0xff
> [ 2849.098777]  [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4
> [ 2849.098781]  [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f
> [ 2849.098785]  [<ffffffff811492ec>] submit_bh+0xe8/0x10e
> [ 2849.098788]  [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da
> [ 2849.098793]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> [ 2849.098796]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> [ 2849.098799]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> [ 2849.098802]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> [ 2849.098805]  [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98
> [ 2849.098808]  [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17
> [ 2849.098811]  [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf]
> [ 2849.098814]  [<ffffffff810e44fd>] pageout+0x138/0x255
> [ 2849.098817]  [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478
> [ 2849.098820]  [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a
> [ 2849.098824]  [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2
> [ 2849.098828]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> [ 2849.098831]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> [ 2849.098834]  [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae
> [ 2849.098837]  [<ffffffff810e773f>] kswapd+0x51c/0x89f
> 
> I'm unsure if there's any other problem left that can be attributed to
> compaction/migrate (especially considering the THP allocations have no
> __GFP_REPEAT set and should_continue_reclaim should break the loop if
> nr_reclaim is zero, plus compaction_suitable requires not much more
> memory to be reclaimed if compared to no-compaction).
> 

I think we are breaking out because the report says the stalls aren't as
bad but not before we have waited on writeback of a few dirty pages. This
could be addressed in a number of ways but all of them impact THP in some way.

1. We could disable defrag by default. This will avoid the stalling at
   the cost of fewer pages being promoted even when plenty of clean pages
   were available.

2. We could redefine __GFP_NO_KSWAPD as __GFP_ASYNC to mean a) do not
   wake up kswapd that generates IO possibly causing syncs later b) does
   not queue any pages for IO itself and c) never waits on page writeback.
   This would also avoid stalls but it would disrupt LRU ordering by
   reclaiming younger pages than would otherwise have been reclaimed.

3. Again redefine __GFP_NO_KSWAPD but abort allocation if any dirty or
   writeback page is encountered during reclaim. This makes the assumption
   that dirty pages at the end of the LRU implies memory is under enough
   pressure to not care about promotion. This will also result in THP
   promoting fewer pages but has less impact on LRU ordering.

Which would you prefer? Other suggestions?
Comment 31 Anonymous Emailer 2011-03-22 15:03:49 UTC
Reply-To: aarcange@redhat.com

On Tue, Mar 22, 2011 at 11:20:32AM +0000, Mel Gorman wrote:
> On Mon, Mar 21, 2011 at 09:16:41PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villací­s Lasso wrote:
> > > El 21/03/11 11:37, Mel Gorman escribió:
> > > > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote:
> > > >
> > > > Nothing bad jumped out at me. Lets see how it gets on with testing.
> > > >
> > > > Thanks
> > > >
> > > As with the previous patch, this one did not completely solve the
> freezing tasks issue. However, as with the previous patch, the freezes took
> longer to appear, and now lasted less (10 to 12 seconds instead of freezing
> until the end of the usb copy).
> > > 
> > > I have attached the new sysrq-w trace to the bug report.
> > 
> > migrate and compaction disappeared from the traces as we hoped
> > for. The THP allocations left throttles on writeback during reclaim
> > like any 4k allocation would do:
> > 
> > [ 2629.256809]  [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d
> > [ 2629.256812]  [<ffffffff810e5992>] shrink_page_list+0x134/0x478
> > [ 2629.256815]  [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a
> > [ 2629.256818]  [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21
> > [ 2629.256820]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> > [ 2629.256823]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> > [ 2629.256827]  [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3
> > [ 2629.256830]  [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef
> > [ 2629.256833]  [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772
> > [ 2629.256837]  [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1
> > [ 2629.256840]  [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267
> > [ 2629.256844]  [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40
> > [ 2629.256846]  [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f
> > [ 2629.256850]  [<ffffffff8100f298>] ?
> arch_get_unmapped_area_topdown+0x1c3/0x28f
> > [ 2629.256853]  [<ffffffff814818cc>] do_page_fault+0x33b/0x35d
> > [ 2629.256856]  [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4
> > [ 2629.256859]  [<ffffffff8112dd66>] ? path_put+0x22/0x27
> > [ 2629.256861]  [<ffffffff8147f285>] page_fault+0x25/0x30
> > 
> 
> There is an important difference between THP and generic order-0 reclaim
> though. Once defrag is enabled in THP, it can enter direct reclaim for
> reclaim/compaction where more pages may be claimed than for a base page
> fault thereby encountering more dirty pages and stalling.

Ok but then 2M gets allocated. With 4k pages it's true there's less
work done for each 4k page but reclaim runs several more times to
allocate 512 4k pages instead of a single 2M page. So I'm unsure if
that should create a noticeable difference.

> > They throttle on writeback I/O completion like kswapd too:
> > 
> > [ 2849.098751]  [<ffffffff8147d00b>] io_schedule+0x47/0x62
> > [ 2849.098756]  [<ffffffff8121c47b>] get_request_wait+0x10a/0x197
> > [ 2849.098760]  [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d
> > [ 2849.098763]  [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0
> > [ 2849.098767]  [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb
> > [ 2849.098771]  [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336
> > [ 2849.098774]  [<ffffffff8121bd39>] submit_bio+0xe0/0xff
> > [ 2849.098777]  [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4
> > [ 2849.098781]  [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f
> > [ 2849.098785]  [<ffffffff811492ec>] submit_bh+0xe8/0x10e
> > [ 2849.098788]  [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da
> > [ 2849.098793]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> > [ 2849.098796]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> > [ 2849.098799]  [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d
> > [ 2849.098802]  [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf]
> > [ 2849.098805]  [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98
> > [ 2849.098808]  [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17
> > [ 2849.098811]  [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf]
> > [ 2849.098814]  [<ffffffff810e44fd>] pageout+0x138/0x255
> > [ 2849.098817]  [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478
> > [ 2849.098820]  [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a
> > [ 2849.098824]  [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2
> > [ 2849.098828]  [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27
> > [ 2849.098831]  [<ffffffff810e6849>] shrink_zone+0x362/0x464
> > [ 2849.098834]  [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae
> > [ 2849.098837]  [<ffffffff810e773f>] kswapd+0x51c/0x89f
> > 
> > I'm unsure if there's any other problem left that can be attributed to
> > compaction/migrate (especially considering the THP allocations have no
> > __GFP_REPEAT set and should_continue_reclaim should break the loop if
> > nr_reclaim is zero, plus compaction_suitable requires not much more
> > memory to be reclaimed if compared to no-compaction).
> > 
> 
> I think we are breaking out because the report says the stalls aren't as
> bad but not before we have waited on writeback of a few dirty pages. This
> could be addressed in a number of ways but all of them impact THP in some
> way.
> 
> 1. We could disable defrag by default. This will avoid the stalling at
>    the cost of fewer pages being promoted even when plenty of clean pages
>    were available.
> 
> 2. We could redefine __GFP_NO_KSWAPD as __GFP_ASYNC to mean a) do not
>    wake up kswapd that generates IO possibly causing syncs later b) does
>    not queue any pages for IO itself and c) never waits on page writeback.
>    This would also avoid stalls but it would disrupt LRU ordering by
>    reclaiming younger pages than would otherwise have been reclaimed.
> 
> 3. Again redefine __GFP_NO_KSWAPD but abort allocation if any dirty or
>    writeback page is encountered during reclaim. This makes the assumption
>    that dirty pages at the end of the LRU implies memory is under enough
>    pressure to not care about promotion. This will also result in THP
>    promoting fewer pages but has less impact on LRU ordering.
> 
> Which would you prefer? Other suggestions?

I'm not particularly excited by any of the above options first because
it'd decreases the reliability of allocations. And more important
because it won't solve anything for the no-THP related allocations
that would still create the same problem considering that compaction
is the issue as he verified that setting defrag=no solves the problem
(think SLUB that even for a radix tree allocation uses order 2 first,
just like THP tries order 9 before order 0, slub or the kernel stack
allocation aren't using anything like __GFP_ASYNC). So I'm afraid
we're more likely to hide the issue by tackling it entirely on the THP
side.

I asked yesterday by PM to Alex if the mouse pointer moves or not
during the stalls (if it doesn't that may be a scheduler issue with
the compaction irq disabled and lack of cond_resched) and to try
aa.git. Upstream still misses several compaction improvements that we
did over the last weeks and that I've in my queue and that are in -mm
as well. So before making more changes, considering the stack traces
looks very healthy now, I'd wait to be sure the hangs aren't already
solved by any of the other scheduling/irq latency fixes. I guess they
aren't going to help but it worth a try. Verifying if this happens
with a more optimal filesystem like ext4 I think is also interesting,
it may be something in udf internal locking that gets in the way of
compaction.

If we still have a problem with current aa.git and ext4, then I'd hope
we can find some other more genuine bit to improve like the bits we've
improved so far, but if there's nothing wrong and it gets unfixable,
then my preference would be to either create a defrag mode that is in
between "yes/no", or alternatively to be simpler and make the default
between defrag yes|no configurable at build time and through a command
line in grub, and hope that SLUB doesn't clashes on it too. The
current "default" is optimal for several server environments where we
know most of the allocations are long lived. So we want to still have
an option to be as reliable as we are toady for those.
Comment 32 Anonymous Emailer 2011-03-22 20:35:50 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 22/03/11 10:03, Andrea Arcangeli escribió:
>
> I asked yesterday by PM to Alex if the mouse pointer moves or not
> during the stalls (if it doesn't that may be a scheduler issue with
> the compaction irq disabled and lack of cond_resched) and to try
> aa.git. Upstream still misses several compaction improvements that we
> did over the last weeks and that I've in my queue and that are in -mm
> as well. So before making more changes, considering the stack traces
> looks very healthy now, I'd wait to be sure the hangs aren't already
> solved by any of the other scheduling/irq latency fixes. I guess they
> aren't going to help but it worth a try. Verifying if this happens
> with a more optimal filesystem like ext4 I think is also interesting,
> it may be something in udf internal locking that gets in the way of
> compaction.
>
> If we still have a problem with current aa.git and ext4, then I'd hope
> we can find some other more genuine bit to improve like the bits we've
> improved so far, but if there's nothing wrong and it gets unfixable,
> then my preference would be to either create a defrag mode that is in
> between "yes/no", or alternatively to be simpler and make the default
> between defrag yes|no configurable at build time and through a command
> line in grub, and hope that SLUB doesn't clashes on it too. The
> current "default" is optimal for several server environments where we
> know most of the allocations are long lived. So we want to still have
> an option to be as reliable as we are toady for those.
>
I have just tested aa.git as of today, with the USB stick formatted as FAT32. I could no longer reproduce the stalls. There was no need to format as ext4. No /proc workarounds required.
Comment 33 Anonymous Emailer 2011-03-22 21:40:54 UTC
Reply-To: aarcange@redhat.com

On Tue, Mar 22, 2011 at 03:34:10PM -0500, Alex Villací­s Lasso wrote:
> I have just tested aa.git as of today, with the USB stick formatted
> as FAT32. I could no longer reproduce the stalls. There was no need
> to format as ext4. No /proc workarounds required.

Sounds good.

So Andrew the patches to apply to solve this most certainly are:

http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-pages-for-migration.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-pages-for-migration-fix.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-free-pages.patch
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=6daf7ff3adc1a243aa9f5a77c7bde2b713a3188a (not in -mm, posted to linux-mm Message-ID 20110321134832.GC5719)

Very likely it's the combination of all the above that is equally
important and needed for this specific compaction issue.

==== rest of aa.git not relevant for this bugreport below ====

http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-prevent-kswapd-compacting-memory-to-reduce-cpu-usage.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmscan-kswapd-should-not-free-an-excessive-number-of-pages-when-balancing-small-zones.patch

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=cb107ebbb7541e5442fd897436440e71835b6496 (not in -mm posted to linux-mm Message-ID: alpine.LSU.2.00.1103192318100.1877)

http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-__gfp_other_node-flag.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-__gfp_other_node-flag-checkpatch-fixes.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-use-__gfp_other_node-for-transparent-huge-pages.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-use-__gfp_other_node-for-transparent-huge-pages-checkpatch-fixes.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-vm-counters-for-transparent-hugepages.patch
http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-vm-counters-for-transparent-hugepages-checkpatch-fixes.patch

smaps* (5 patches in -mm)


This is one experimental new feature that should improve mremap
significantly regardless of THP on or off (but bigger boost with THP
on). I'm using it for weeks without problem, I'd suggest it for
inclusion too. The version in the below link is the most uptodate. It
fixes a build trouble (s/__split_huge_page_pmd/split_huge_page_pmd/ in
move_page_tables) with CONFIG_TRANSPARENT_HUGEPAGE=n compared to the
last version I posted to linux-mm. But this is low priority.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=0e6f8bd8802c3309195d3e1a7af50093ed488f2d
Comment 34 Anonymous Emailer 2011-03-23 00:37:59 UTC
Reply-To: aarcange@redhat.com

Hi Alex,

could you also try to reverse this below bit (not the whole previous
patch: only the bit below quoted below) with "patch -p1 -R < thismail"
on top of your current aa.git tree, and see if you notice any
regression compared to the previous aa.git build that worked well?

This is part of the fix, but I'd need to be sure this really makes a
difference before sticking to it for long. I'm not concerned by
keeping it, but it adds dirt, and the closer THP allocations are to
any other high order allocation the better. So the less
__GFP_NO_KSWAPD affects the better. The hint about not telling kswapd
to insist in the background for order 9 allocations with fallback
(like THP) is the maximum I consider clean because there's khugepaged
with its alloc_sleep_millisecs that replaces the kswapd task for THP
allocations. So that is clean enough, but when __GFP_NO_KSWAPD starts
to make compaction behave slightly different from a SLUB order 2
allocation I don't like it (especially because if you later enable
SLUB or some driver you may run into the same compaction issue again
if the below change is making a difference).

If things works fine even after you reverse the below, we can safely
undo this change and also feel safer for all other high order
allocations, so it'll make life easier. (plus we don't want
unnecessary special changes, we need to be sure this makes a
difference to keep it for long)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2085,7 +2085,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 35 Alex Villacis Lasso 2011-03-23 16:30:39 UTC
Created attachment 51782 [details]
sysrq-w trace for aa.git kernel

Bad news. Apparently the stalling *is* filesystem-sensitive. I reformatted the USB-stick to use UDF instead of FAT32, and I managed to trigger the application freezes again, on the aa.git kernel. Trace attached.

BTW, I was using UDF on the stick because I needed a filesystem with no user permissions to hold a file of about 5 GB without having to split it, and UDF seemed the most adequate for the task.
Comment 36 Alex Villacis Lasso 2011-03-23 16:46:15 UTC
Created attachment 51792 [details]
second sysrq-w trace for aa.git kernel

I have just triggered the freeze for the stick with vfat on the aa.git kernel, although I had to wait a bit longer. Trace attached.
Comment 37 Anonymous Emailer 2011-03-23 16:53:09 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 22/03/11 19:37, Andrea Arcangeli escribió:
> Hi Alex,
>
> could you also try to reverse this below bit (not the whole previous
> patch: only the bit below quoted below) with "patch -p1 -R<  thismail"
> on top of your current aa.git tree, and see if you notice any
> regression compared to the previous aa.git build that worked well?
>
> This is part of the fix, but I'd need to be sure this really makes a
> difference before sticking to it for long. I'm not concerned by
> keeping it, but it adds dirt, and the closer THP allocations are to
> any other high order allocation the better. So the less
> __GFP_NO_KSWAPD affects the better. The hint about not telling kswapd
> to insist in the background for order 9 allocations with fallback
> (like THP) is the maximum I consider clean because there's khugepaged
> with its alloc_sleep_millisecs that replaces the kswapd task for THP
> allocations. So that is clean enough, but when __GFP_NO_KSWAPD starts
> to make compaction behave slightly different from a SLUB order 2
> allocation I don't like it (especially because if you later enable
> SLUB or some driver you may run into the same compaction issue again
> if the below change is making a difference).
>
> If things works fine even after you reverse the below, we can safely
> undo this change and also feel safer for all other high order
> allocations, so it'll make life easier. (plus we don't want
> unnecessary special changes, we need to be sure this makes a
> difference to keep it for long)
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2085,7 +2085,7 @@ rebalance:
>                                       sync_migration);
>       if (page)
>               goto got_pg;
> -     sync_migration = true;
> +     sync_migration = !(gfp_mask&  __GFP_NO_KSWAPD);
>
>       /* Try direct reclaim and then allocating */
>       page = __alloc_pages_direct_reclaim(gfp_mask, order,
>

> On Tue, Mar 22, 2011 at 03:34:10PM -0500, Alex Villací­s Lasso wrote:
>> >  I have just tested aa.git as of today, with the USB stick formatted
>> >  as FAT32. I could no longer reproduce the stall
> Probably udf is not optimized enough but I wonder if maybe the
> udf->vfat change helped more than the other patches. We need the other
> patches anyway to provide responsive behavior including the one you
> tested before aa.git so it's not very important if udf was the
> problem, but it might have been.
>
I tried to reformat the stick as UDF to check whether the stall was filesystem-sensitive. Apparently it is. I managed to induce the freeze on firefox while performing the same copy on the aa.git kernel. Then I reformatted the stick as FAT32 and repeated 
the test, and it also induced freezes, although they were a bit shorter and occurred late in the copy progress. I have attached the traces in the bug report. All of this is with the kernel before reversing the quoted patch.
Comment 38 Rafael J. Wysocki 2011-03-27 19:55:20 UTC
*** Bug 28432 has been marked as a duplicate of this bug. ***
Comment 39 Florian Mickler 2011-03-28 22:49:18 UTC
A patch referencing this bug report has been merged in v2.6.38-8569-g16c29da:

commit 11bc82d67d1150767901bca54a24466621d763d7
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Tue Mar 22 16:33:11 2011 -0700

    mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback
Comment 40 Alex Villacis Lasso 2011-03-30 20:30:09 UTC
The original issue seems to have been greatly reduced, if not completely eliminated, in 2.6.39-rc1. I did the same old test, the 4GB copy into the USB stick, VFAT formatted.
Comment 41 Florian Mickler 2011-03-30 21:08:31 UTC
Thanks for the update. 

Should this bug be closed then? 
If not, can you contact Andrea and let him now, this is still a problem? 

Thanks,
Flo
Comment 42 Anonymous Emailer 2011-04-04 15:39:27 UTC
Reply-To: avillaci@fiec.espol.edu.ec

Latest update: with 2.6.39-rc1 the stalls only last for a few seconds, but they are still there. So, as I said before, they are reduced but not eliminated.
Comment 43 Alex Villacis Lasso 2011-04-04 15:42:35 UTC
Created attachment 53442 [details]
sysrq-w trace for 2.6.39-rc1 kernel

I got a longer stall that froze firefox. Trace attached.
Comment 44 Anonymous Emailer 2011-04-08 19:09:45 UTC
Reply-To: aarcange@redhat.com

Hi Alex,

On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villací­s Lasso wrote:
> Latest update: with 2.6.39-rc1 the stalls only last for a few
> seconds, but they are still there. So, as I said before, they are
> reduced but not eliminated.

Ok from a complete stall for the whole duration of the I/O to a few
second stall we're clearly going into the right direction.

The few second stalls happen with udf? And vfat doesn't stall?

Minchan rightly pointed out that the (panik) change we made in
page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also
talked with Mel about it. We think it's nicer if we can keep THP
allocations as close as any other high order allocation as
possible. There are already plenty of __GFP bits with complex
semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it
means the allocation relies on a different kernel daemon for the
background work (which is khugepaged instead of kswapd in the THP
case, where khugepaged uses a non intrusive alloc_sleep_millisec
throttling in case of MM congestion, unlike kswapd would do).

So this is no solution to your problem (if vfat already works I think
that might be a better solution), but we'd like to know if you get any
_worse_ stall compared to current 2.6.39-rc, by applying the below
patch on top of 2.6.39-rc. If this doesn't make any difference, we can
safely apply it to remove unnecessary complications.

Thanks,
Andrea

===
Subject: compaction: reverse the change that forbid sync migraton with __GFP_NO_KSWAPD

From: Andrea Arcangeli <aarcange@redhat.com>

It's uncertain this has been beneficial, so it's safer to undo it. All other
compaction users would still go in synchronous mode if a first attempt of async
compaction failed. Hopefully we don't need to force special behavior for THP
(which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise
and to be noticeable). This also make __GFP_NO_KSWAPD return to its original
strict semantics specific to bypass kswapd, as THP allocations have khugepaged
for the async THP allocations/compactions.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2105,7 +2105,7 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
+	sync_migration = true;
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
Comment 45 Anonymous Emailer 2011-04-08 20:08:02 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 08/04/11 14:09, Andrea Arcangeli escribió:
> Hi Alex,
>
> On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villací­s Lasso wrote:
>> Latest update: with 2.6.39-rc1 the stalls only last for a few
>> seconds, but they are still there. So, as I said before, they are
>> reduced but not eliminated.
> Ok from a complete stall for the whole duration of the I/O to a few
> second stall we're clearly going into the right direction.
>
> The few second stalls happen with udf? And vfat doesn't stall?
>
> Minchan rightly pointed out that the (panik) change we made in
> page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also
> talked with Mel about it. We think it's nicer if we can keep THP
> allocations as close as any other high order allocation as
> possible. There are already plenty of __GFP bits with complex
> semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it
> means the allocation relies on a different kernel daemon for the
> background work (which is khugepaged instead of kswapd in the THP
> case, where khugepaged uses a non intrusive alloc_sleep_millisec
> throttling in case of MM congestion, unlike kswapd would do).
>
> So this is no solution to your problem (if vfat already works I think
> that might be a better solution), but we'd like to know if you get any
> _worse_ stall compared to current 2.6.39-rc, by applying the below
> patch on top of 2.6.39-rc. If this doesn't make any difference, we can
> safely apply it to remove unnecessary complications.
>
> Thanks,
> Andrea
>
> ===
> Subject: compaction: reverse the change that forbid sync migraton with
> __GFP_NO_KSWAPD
>
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> It's uncertain this has been beneficial, so it's safer to undo it. All other
> compaction users would still go in synchronous mode if a first attempt of
> async
> compaction failed. Hopefully we don't need to force special behavior for THP
> (which is the only __GFP_NO_KSWAPD user so far and it's the easier to
> exercise
> and to be noticeable). This also make __GFP_NO_KSWAPD return to its original
> strict semantics specific to bypass kswapd, as THP allocations have
> khugepaged
> for the async THP allocations/compactions.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   mm/page_alloc.c |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2105,7 +2105,7 @@ rebalance:
>                                       sync_migration);
>       if (page)
>               goto got_pg;
> -     sync_migration = !(gfp_mask&  __GFP_NO_KSWAPD);
> +     sync_migration = true;
>
>       /* Try direct reclaim and then allocating */
>       page = __alloc_pages_direct_reclaim(gfp_mask, order,
>
The stalls occur even with vfat. I am no longer using udf, since (right now) it is not necessary. I will test this patch now.
Comment 46 Anonymous Emailer 2011-04-12 16:29:38 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 08/04/11 15:06, Alex Villací­s Lasso escribió:
> El 08/04/11 14:09, Andrea Arcangeli escribió:
>> Hi Alex,
>>
>> On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villací­s Lasso wrote:
>>> Latest update: with 2.6.39-rc1 the stalls only last for a few
>>> seconds, but they are still there. So, as I said before, they are
>>> reduced but not eliminated.
>> Ok from a complete stall for the whole duration of the I/O to a few
>> second stall we're clearly going into the right direction.
>>
>> The few second stalls happen with udf? And vfat doesn't stall?
>>
>> Minchan rightly pointed out that the (panik) change we made in
>> page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also
>> talked with Mel about it. We think it's nicer if we can keep THP
>> allocations as close as any other high order allocation as
>> possible. There are already plenty of __GFP bits with complex
>> semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it
>> means the allocation relies on a different kernel daemon for the
>> background work (which is khugepaged instead of kswapd in the THP
>> case, where khugepaged uses a non intrusive alloc_sleep_millisec
>> throttling in case of MM congestion, unlike kswapd would do).
>>
>> So this is no solution to your problem (if vfat already works I think
>> that might be a better solution), but we'd like to know if you get any
>> _worse_ stall compared to current 2.6.39-rc, by applying the below
>> patch on top of 2.6.39-rc. If this doesn't make any difference, we can
>> safely apply it to remove unnecessary complications.
>>
>> Thanks,
>> Andrea
>>
>> ===
>> Subject: compaction: reverse the change that forbid sync migraton with
>> __GFP_NO_KSWAPD
>>
>> From: Andrea Arcangeli<aarcange@redhat.com>
>>
>> It's uncertain this has been beneficial, so it's safer to undo it. All other
>> compaction users would still go in synchronous mode if a first attempt of
>> async
>> compaction failed. Hopefully we don't need to force special behavior for THP
>> (which is the only __GFP_NO_KSWAPD user so far and it's the easier to
>> exercise
>> and to be noticeable). This also make __GFP_NO_KSWAPD return to its original
>> strict semantics specific to bypass kswapd, as THP allocations have
>> khugepaged
>> for the async THP allocations/compactions.
>>
>> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
>> ---
>>   mm/page_alloc.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2105,7 +2105,7 @@ rebalance:
>>                       sync_migration);
>>       if (page)
>>           goto got_pg;
>> -    sync_migration = !(gfp_mask&  __GFP_NO_KSWAPD);
>> +    sync_migration = true;
>>
>>       /* Try direct reclaim and then allocating */
>>       page = __alloc_pages_direct_reclaim(gfp_mask, order,
>>
> The stalls occur even with vfat. I am no longer using udf, since (right now)
> it is not necessary. I will test this patch now.
>
 From preliminary tests, I feel that the patch actually eliminates the stalls. I have just copied nearly 6 GB of data into my USB stick and noticed no application freezes.
Comment 47 Anonymous Emailer 2011-04-14 17:27:39 UTC
Reply-To: avillaci@fiec.espol.edu.ec

El 12/04/11 11:27, Alex Villací­s Lasso escribió:
>>> ===
>>> Subject: compaction: reverse the change that forbid sync migraton with
>>> __GFP_NO_KSWAPD
>>>
>>> From: Andrea Arcangeli<aarcange@redhat.com>
>>>
>>> It's uncertain this has been beneficial, so it's safer to undo it. All
>>> other
>>> compaction users would still go in synchronous mode if a first attempt of
>>> async
>>> compaction failed. Hopefully we don't need to force special behavior for
>>> THP
>>> (which is the only __GFP_NO_KSWAPD user so far and it's the easier to
>>> exercise
>>> and to be noticeable). This also make __GFP_NO_KSWAPD return to its
>>> original
>>> strict semantics specific to bypass kswapd, as THP allocations have
>>> khugepaged
>>> for the async THP allocations/compactions.
>>>
>>> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
>>> ---
>>>   mm/page_alloc.c |    2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2105,7 +2105,7 @@ rebalance:
>>>                       sync_migration);
>>>       if (page)
>>>           goto got_pg;
>>> -    sync_migration = !(gfp_mask&  __GFP_NO_KSWAPD);
>>> +    sync_migration = true;
>>>
>>>       /* Try direct reclaim and then allocating */
>>>       page = __alloc_pages_direct_reclaim(gfp_mask, order,
>>>
>> The stalls occur even with vfat. I am no longer using udf, since (right now)
>> it is not necessary. I will test this patch now.
>>
>
> From preliminary tests, I feel that the patch actually eliminates the stalls.
> I have just copied nearly 6 GB of data into my USB stick and noticed no
> application freezes.
>
I retract that. I have tested 2.6.39-rc3 after a day of having several heavy applications loaded in memory, and the stalls do get worse when reversing the patch.
Comment 48 Anonymous Emailer 2011-04-14 17:38:15 UTC
Reply-To: aarcange@redhat.com

Hello Alex,

On Thu, Apr 14, 2011 at 12:25:48PM -0500, Alex Villací­s Lasso wrote:
> I retract that. I have tested 2.6.39-rc3 after a day of having
> several heavy applications loaded in memory, and the stalls do get
> worse when reversing the patch.

Well I was already afraid your stalling wasn't 100%
reproducible. Depends on background load like you said. I think if it
never happens when you didn't start the heavy applications yet is good
enough result for now. When we throttle on I/O and there are heavy
apps things may stall even without usb drive because the more memory
pressure the more every little write() or memory allocation, may stall
too regardless of compaction, we've no perfection in that area yet
(the best way to tackle this are the per-process write throttling
levels).

Bottom line is that the additional congestion created by the heavy app
is by far not easy to quantify or to assume as a reproducible, so it
is very possible it was the same as before and in general if we want
to apply that change it's cleaner to do it unconditionally for all
allocation orders and not in function of __GFP_NO_KSWAPD. So I think
the patch is still good to go.
Comment 49 dE 2012-05-31 05:53:03 UTC
I'm also affected by this since a very long time; after no one could give me a solution on various forums and kernel.org mailing list, I decided to drop here.

This issue is more reproducible with slow external storage devices and even various network file systems.

Tweaking vm.* does not help much.

As far as I remember, this issue started at 2.6.38.

Also, when large amount of ram is available (corresponding to the processes running), this issue is not reproducible.
Comment 50 dE 2012-05-31 06:14:23 UTC
I'm also affected by this since a very long time; after no one could give me a solution on various forums and kernel.org mailing list, I decided to drop here.

This issue is more reproducible with slow external storage devices and even various network file systems.

Tweaking vm.* does not help much.

As far as I remember, this issue started at 2.6.38.

Also, when large amount of ram is available (corresponding to the processes running), this issue is not reproducible.

I confirm this issue on both  Debian running the official generic kernel, and Gentoo running a custom kernel.
Comment 51 Alan 2012-06-13 15:03:27 UTC
dE: Please open a bug of your own against a current kernel, the cause of any stalling may well be very different

Closing the original bug as codefix