Bug 31142
Summary: | Large write to USB stick freezes unrelated tasks for a long time | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Alex Villacis Lasso (avillaci) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | alan, de.techno, florian, maciej.rutecki, makovick |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.38-rc8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel backtraces from hung Eclipse task while writing to usb stick
sysrw-w trace with hung thunderbird and gedit sysrq-w trace with patch applied sysrq-w trace with patch from comment #21 applied sysrq-w trace for aa.git kernel second sysrq-w trace for aa.git kernel sysrq-w trace for 2.6.39-rc1 kernel |
I must note that no such freeze happens when I *read* files from USB to the hard disk, no matter how large the transfer. (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Tue, 15 Mar 2011 15:55:52 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=31142 > > Summary: Large write to USB stick freezes unrelated tasks for a > long time > Product: IO/Storage > Version: 2.5 > Kernel Version: 2.6.38-rc8 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Block Layer > AssignedTo: axboe@kernel.dk > ReportedBy: avillaci@ceibo.fiec.espol.edu.ec > Regression: No > > > Created an attachment (id=50902) > --> (https://bugzilla.kernel.org/attachment.cgi?id=50902) > kernel backtraces from hung Eclipse task while writing to usb stick > > System is Fedora 14 x86_64 with 4 GB RAM, running vanilla kernel 2.6.38-rc8. > > I have a USB 2.0 high-speed memory stick with around 7.5 GB of space. > Whenenver > I write a large amount of data (several GBs of files) through any means (cp, > nautilus GUI, etc), I notice some large applications that I consider > unrelated > to the I/O operation (Firefox web browser, Thunderbird email viewer, Eclipse > IDE) may randomly freeze whenever I try to interact with them. I use Compiz, > and I notice the apps getting grayed out, but I have also seen the freeze > happening with Metacity and Gnome-shell, so I believe the window manager is > irrelevant. Sometimes other smaller tasks (gnome-terminal, gedit) also > freeze. > For Eclipse, the hang also cause a series of kernel backtraces, attached to > this report. The hang usually lasts for several tens of seconds, and may > freeze > and unfreeze several times while the file copying to USB takes place. All of > the hung applications unfreeze themselves after write activity (as seen from > the LED in the memory stick) ceases. > > Reproducibility: always (with sufficiently large bulk write) > To reproduce: > 1) have an usb stick with several GB of free space, with any filesystem > (tried > vfat and udf) > 2) prepare several gb of files to copy from hard disk to usb stick > 3) start large application (firefox, eclipse, or thunderbird) > 4) check that application is responsive before file copy starts > 5) insert usb stick and (auto)mount it. Previously started app is still > responsive. > 6) start file copy to usb stick with any command > 7) attempt to interact with chosen application during the entirety of the > file > write > Expected result: I/O to usb stick takes place in background, unrelated apps > continue to be responsive in foreground. > Actual result: some large tasks freeze for tens of seconds while write takes > place. > > Feel free to reassign this bug to a different category. It involves I/O, > block, > USB, and mmap. rofl, will we ever fix this. Please enable sysrq and do a sysrq-w when the tasks are blocked so we can find where things are getting stuck. Please avoid email client wordwrapping when sending us the sysrq output. Reply-To: avillaci@fiec.espol.edu.ec El 15/03/11 15:53, Andrew Morton escribió: > > rofl, will we ever fix this. Does this mean there is already a duplicate of this issue? If so, which one? > Please enable sysrq and do a sysrq-w when the tasks are blocked so we > can find where things are getting stuck. Please avoid email client > wordwrapping when sending us the sysrq output. > On Tue, 15 Mar 2011 17:53:16 -0500 Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > El 15/03/11 15:53, Andrew Morton escribi__: > > > > rofl, will we ever fix this. > Does this mean there is already a duplicate of this issue? If so, which one? Nothing specific. Nonsense like this has been happening for at least a decade and it never seems to get a lot better. > > Please enable sysrq and do a sysrq-w when the tasks are blocked so we > > can find where things are getting stuck. Please avoid email client > > wordwrapping when sending us the sysrq output. > > Created attachment 50952 [details]
sysrw-w trace with hung thunderbird and gedit
This report was generated with 2.6.38. I started copying about 3 GB of data to my usb stick (with nautilus), and after a while, thunderbird froze. I did the sysrq-w action, but when pasting the report into gedit and trying to save, gedit also froze. This report shows the second sysrq report. Does this help?
Reply-To: avillaci@fiec.espol.edu.ec El 15/03/11 18:19, Andrew Morton escribió: > On Tue, 15 Mar 2011 17:53:16 -0500 > Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec> wrote: > >> El 15/03/11 15:53, Andrew Morton escribi__: >>> rofl, will we ever fix this. >> Does this mean there is already a duplicate of this issue? If so, which one? > Nothing specific. Nonsense like this has been happening for at least a > decade and it never seems to get a lot better. > >>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we >>> can find where things are getting stuck. Please avoid email client >>> wordwrapping when sending us the sysrq output. >>> Posted sysrq-w report into original bug report to avoid email word-wrap. On Wed, 16 Mar 2011 10:25:16 -0500 Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > El 15/03/11 18:19, Andrew Morton escribi__: > > On Tue, 15 Mar 2011 17:53:16 -0500 > > Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec> wrote: > > > >> El 15/03/11 15:53, Andrew Morton escribi__: > >>> rofl, will we ever fix this. > >> Does this mean there is already a duplicate of this issue? If so, which > one? > > Nothing specific. Nonsense like this has been happening for at least a > > decade and it never seems to get a lot better. > > > >>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we > >>> can find where things are getting stuck. Please avoid email client > >>> wordwrapping when sending us the sysrq output. > >>> > Posted sysrq-w report into original bug report to avoid email word-wrap. https://bugzilla.kernel.org/attachment.cgi?id=50952 Interesting bits: [70874.969550] thunderbird-bin D 000000010434e04e 0 32283 32279 0x00000080 [70874.969553] ffff88011ba91838 0000000000000086 ffff880100000000 0000000000013880 [70874.969557] 0000000000013880 ffff88010a231780 ffff88011ba91fd8 0000000000013880 [70874.969560] 0000000000013880 0000000000013880 0000000000013880 ffff88011ba91fd8 [70874.969564] Call Trace: [70874.969567] [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d [70874.969569] [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d [70874.969572] [<ffffffff8147cedb>] io_schedule+0x47/0x62 [70874.969575] [<ffffffff810d7f1c>] sync_page+0x49/0x4d [70874.969577] [<ffffffff8147d36a>] __wait_on_bit+0x48/0x7b [70874.969580] [<ffffffff810d8100>] wait_on_page_bit+0x72/0x79 [70874.969583] [<ffffffff8106cdb4>] ? wake_bit_function+0x0/0x31 [70874.969586] [<ffffffff8111833c>] migrate_pages+0x1ac/0x38d [70874.969589] [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4 [70874.969592] [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e [70874.969595] [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1 [70874.969599] [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea [70874.969602] [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5 [70874.969605] [<ffffffff810ddbb8>] ? drain_local_pages+0x0/0x17 [70874.969607] [<ffffffff810df0b0>] __alloc_pages_nodemask+0x69e/0x766 [70874.969610] [<ffffffff81113701>] ? __slab_free+0x6d/0xf6 [70874.969614] [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1 [70874.969617] [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267 [70874.969620] [<ffffffff810f2497>] ? pmd_offset+0x19/0x40 [70874.969623] [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f [70874.969626] [<ffffffff8100f26a>] ? arch_get_unmapped_area_topdown+0x195/0x28f [70874.969629] [<ffffffff8148178c>] do_page_fault+0x33b/0x35d [70874.969632] [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4 [70874.969635] [<ffffffff8112dcee>] ? path_put+0x22/0x27 [70874.969638] [<ffffffff8147f145>] page_fault+0x25/0x30 [70874.969731] gedit D 000000010434dfb0 0 32356 1 0x00000080 [70874.969734] ffff8800982ab558 0000000000000082 ffff880102400001 0000000000013880 [70874.969737] 0000000000013880 ffff880117408000 ffff8800982abfd8 0000000000013880 [70874.969741] 0000000000013880 0000000000013880 0000000000013880 ffff8800982abfd8 [70874.969744] Call Trace: [70874.969747] [<ffffffff8147cedb>] io_schedule+0x47/0x62 [70874.969750] [<ffffffff8121c403>] get_request_wait+0x10a/0x197 [70874.969753] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d [70874.969756] [<ffffffff8121ccc4>] __make_request+0x2c8/0x3e0 [70874.969759] [<ffffffff8111487d>] ? kmem_cache_alloc+0x73/0xeb [70874.969762] [<ffffffff8121bb67>] generic_make_request+0x2bc/0x336 [70874.969765] [<ffffffff811213a5>] ? lookup_page_cgroup+0x36/0x4c [70874.969768] [<ffffffff8121bcc1>] submit_bio+0xe0/0xff [70874.969770] [<ffffffff8114d72d>] ? bio_alloc_bioset+0x4d/0xc4 [70874.969773] [<ffffffff810edf1f>] ? inc_zone_page_state+0x2d/0x2f [70874.969776] [<ffffffff81149274>] submit_bh+0xe8/0x10e [70874.969779] [<ffffffff8114b9fa>] __block_write_full_page+0x1ea/0x2da [70874.969782] [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf] [70874.969785] [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d [70874.969788] [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d [70874.969791] [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf] [70874.969794] [<ffffffff8114bb76>] block_write_full_page_endio+0x8c/0x98 [70874.969796] [<ffffffff8114bb97>] block_write_full_page+0x15/0x17 [70874.969800] [<ffffffffa0686027>] udf_writepage+0x18/0x1a [udf] [70874.969803] [<ffffffff81117fed>] move_to_new_page+0x106/0x195 [70874.969806] [<ffffffff811183de>] migrate_pages+0x24e/0x38d [70874.969809] [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4 [70874.969812] [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e [70874.969815] [<ffffffff81049c78>] ? load_balance+0xcb/0x6b0 [70874.969818] [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1 [70874.969821] [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea [70874.969824] [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5 [70874.969827] [<ffffffff810dee79>] __alloc_pages_nodemask+0x467/0x766 [70874.969830] [<ffffffff810fcfd3>] ? anon_vma_alloc+0x1a/0x1c [70874.969833] [<ffffffff81048a30>] ? get_parent_ip+0x11/0x41 [70874.969833] [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1 [70874.969833] [<ffffffff8123430e>] ? rb_insert_color+0x66/0xe1 [70874.969833] [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267 [70874.969833] [<ffffffff810f2497>] ? pmd_offset+0x19/0x40 [70874.969833] [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f [70874.969833] [<ffffffff8100f298>] ? arch_get_unmapped_area_topdown+0x1c3/0x28f [70874.969833] [<ffffffff8148178c>] do_page_fault+0x33b/0x35d [70874.969833] [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4 [70874.969833] [<ffffffff8112dcee>] ? path_put+0x22/0x27 [70874.969833] [<ffffffff8147f145>] page_fault+0x25/0x30 So it appears that the system is full of dirty pages against a slow device and your foreground processes have got stuck in direct reclaim -> compaction -> migration. That's Mel ;) What happened to the plans to eliminate direct reclaim? Reply-To: avillaci@fiec.espol.edu.ec El 16/03/11 17:02, Andrew Morton escribió: > On Wed, 16 Mar 2011 10:25:16 -0500 > Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec> wrote: > >> El 15/03/11 18:19, Andrew Morton escribi__: >>> On Tue, 15 Mar 2011 17:53:16 -0500 >>> Alex Villac____s Lasso<avillaci@fiec.espol.edu.ec> wrote: >>> >>>> El 15/03/11 15:53, Andrew Morton escribi__: >>>>> rofl, will we ever fix this. >>>> Does this mean there is already a duplicate of this issue? If so, which >>>> one? >>> Nothing specific. Nonsense like this has been happening for at least a >>> decade and it never seems to get a lot better. >>> >>>>> Please enable sysrq and do a sysrq-w when the tasks are blocked so we >>>>> can find where things are getting stuck. Please avoid email client >>>>> wordwrapping when sending us the sysrq output. >>>>> >> Posted sysrq-w report into original bug report to avoid email word-wrap. > https://bugzilla.kernel.org/attachment.cgi?id=50952 > > Interesting bits: > > [70874.969550] thunderbird-bin D 000000010434e04e 0 32283 32279 > 0x00000080 > [70874.969553] ffff88011ba91838 0000000000000086 ffff880100000000 > 0000000000013880 > [70874.969557] 0000000000013880 ffff88010a231780 ffff88011ba91fd8 > 0000000000013880 > [70874.969560] 0000000000013880 0000000000013880 0000000000013880 > ffff88011ba91fd8 > [70874.969564] Call Trace: > [70874.969567] [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d > [70874.969569] [<ffffffff810d7ed3>] ? sync_page+0x0/0x4d > [70874.969572] [<ffffffff8147cedb>] io_schedule+0x47/0x62 > [70874.969575] [<ffffffff810d7f1c>] sync_page+0x49/0x4d > [70874.969577] [<ffffffff8147d36a>] __wait_on_bit+0x48/0x7b > [70874.969580] [<ffffffff810d8100>] wait_on_page_bit+0x72/0x79 > [70874.969583] [<ffffffff8106cdb4>] ? wake_bit_function+0x0/0x31 > [70874.969586] [<ffffffff8111833c>] migrate_pages+0x1ac/0x38d > [70874.969589] [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4 > [70874.969592] [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e > [70874.969595] [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1 > [70874.969599] [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea > [70874.969602] [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5 > [70874.969605] [<ffffffff810ddbb8>] ? drain_local_pages+0x0/0x17 > [70874.969607] [<ffffffff810df0b0>] __alloc_pages_nodemask+0x69e/0x766 > [70874.969610] [<ffffffff81113701>] ? __slab_free+0x6d/0xf6 > [70874.969614] [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1 > [70874.969617] [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267 > [70874.969620] [<ffffffff810f2497>] ? pmd_offset+0x19/0x40 > [70874.969623] [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f > [70874.969626] [<ffffffff8100f26a>] ? > arch_get_unmapped_area_topdown+0x195/0x28f > [70874.969629] [<ffffffff8148178c>] do_page_fault+0x33b/0x35d > [70874.969632] [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4 > [70874.969635] [<ffffffff8112dcee>] ? path_put+0x22/0x27 > [70874.969638] [<ffffffff8147f145>] page_fault+0x25/0x30 > > > [70874.969731] gedit D 000000010434dfb0 0 32356 1 > 0x00000080 > [70874.969734] ffff8800982ab558 0000000000000082 ffff880102400001 > 0000000000013880 > [70874.969737] 0000000000013880 ffff880117408000 ffff8800982abfd8 > 0000000000013880 > [70874.969741] 0000000000013880 0000000000013880 0000000000013880 > ffff8800982abfd8 > [70874.969744] Call Trace: > [70874.969747] [<ffffffff8147cedb>] io_schedule+0x47/0x62 > [70874.969750] [<ffffffff8121c403>] get_request_wait+0x10a/0x197 > [70874.969753] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d > [70874.969756] [<ffffffff8121ccc4>] __make_request+0x2c8/0x3e0 > [70874.969759] [<ffffffff8111487d>] ? kmem_cache_alloc+0x73/0xeb > [70874.969762] [<ffffffff8121bb67>] generic_make_request+0x2bc/0x336 > [70874.969765] [<ffffffff811213a5>] ? lookup_page_cgroup+0x36/0x4c > [70874.969768] [<ffffffff8121bcc1>] submit_bio+0xe0/0xff > [70874.969770] [<ffffffff8114d72d>] ? bio_alloc_bioset+0x4d/0xc4 > [70874.969773] [<ffffffff810edf1f>] ? inc_zone_page_state+0x2d/0x2f > [70874.969776] [<ffffffff81149274>] submit_bh+0xe8/0x10e > [70874.969779] [<ffffffff8114b9fa>] __block_write_full_page+0x1ea/0x2da > [70874.969782] [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf] > [70874.969785] [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d > [70874.969788] [<ffffffff8114a640>] ? end_buffer_async_write+0x0/0x12d > [70874.969791] [<ffffffffa0689202>] ? udf_get_block+0x0/0x115 [udf] > [70874.969794] [<ffffffff8114bb76>] block_write_full_page_endio+0x8c/0x98 > [70874.969796] [<ffffffff8114bb97>] block_write_full_page+0x15/0x17 > [70874.969800] [<ffffffffa0686027>] udf_writepage+0x18/0x1a [udf] > [70874.969803] [<ffffffff81117fed>] move_to_new_page+0x106/0x195 > [70874.969806] [<ffffffff811183de>] migrate_pages+0x24e/0x38d > [70874.969809] [<ffffffff8110d6d7>] ? compaction_alloc+0x0/0x2a4 > [70874.969812] [<ffffffff8110ddfd>] compact_zone+0x3f4/0x60e > [70874.969815] [<ffffffff81049c78>] ? load_balance+0xcb/0x6b0 > [70874.969818] [<ffffffff8110e1dc>] compact_zone_order+0xc2/0xd1 > [70874.969821] [<ffffffff8110e27f>] try_to_compact_pages+0x94/0xea > [70874.969824] [<ffffffff810de916>] __alloc_pages_direct_compact+0xa9/0x1a5 > [70874.969827] [<ffffffff810dee79>] __alloc_pages_nodemask+0x467/0x766 > [70874.969830] [<ffffffff810fcfd3>] ? anon_vma_alloc+0x1a/0x1c > [70874.969833] [<ffffffff81048a30>] ? get_parent_ip+0x11/0x41 > [70874.969833] [<ffffffff8110c0de>] alloc_pages_vma+0xec/0xf1 > [70874.969833] [<ffffffff8123430e>] ? rb_insert_color+0x66/0xe1 > [70874.969833] [<ffffffff8111be1c>] do_huge_pmd_anonymous_page+0xbf/0x267 > [70874.969833] [<ffffffff810f2497>] ? pmd_offset+0x19/0x40 > [70874.969833] [<ffffffff810f5c70>] handle_mm_fault+0x15d/0x20f > [70874.969833] [<ffffffff8100f298>] ? > arch_get_unmapped_area_topdown+0x1c3/0x28f > [70874.969833] [<ffffffff8148178c>] do_page_fault+0x33b/0x35d > [70874.969833] [<ffffffff810fb07d>] ? do_mmap_pgoff+0x29a/0x2f4 > [70874.969833] [<ffffffff8112dcee>] ? path_put+0x22/0x27 > [70874.969833] [<ffffffff8147f145>] page_fault+0x25/0x30 > > So it appears that the system is full of dirty pages against a slow > device and your foreground processes have got stuck in direct reclaim > -> compaction -> migration. That's Mel ;) > > What happened to the plans to eliminate direct reclaim? > > Browsing around bugzilla, I believe that bug 12309 looks very similar to the issue I am experiencing, especially from comment #525 onwards. Am I correct in this? On Thu, 17 Mar 2011 16:27:29 -0500 Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > > So it appears that the system is full of dirty pages against a slow > > device and your foreground processes have got stuck in direct reclaim > > -> compaction -> migration. That's Mel ;) > > > > What happened to the plans to eliminate direct reclaim? > > > > > Browsing around bugzilla, I believe that bug 12309 looks very similar to the > issue I am experiencing, especially from comment #525 onwards. Am I correct > in this? ah, the epic 12309. https://bugzilla.kernel.org/show_bug.cgi?id=12309. If you're ever wondering how much we suck, go read that one. I think what we're seeing in 31142 is a large amount of dirty data buffered against a slow device. Innocent processes enter page reclaim and end up getting stuck trying to write to that heavily-queued and slow device. If so, that's probably what some of the 12309 participants are seeing. But there are lots of other things in that report too. Now, the problem you're seeing in 31142 isn't really supposed to happen. In the direct-reclaim case the code will try to avoid initiation of blocking I/O against a congested device, via the bdi_write_congested() test in may_write_to_queue(). Although that code now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case, whodidthat. However in the case of the new(ish) compaction/migration code I don't think we're performing that test. migrate_pages()->unmap_and_move() will get stuck behind that large&slow IO queue if page reclaim decided to pass it down sync==true, as it apparently has done. IOW, Mel broke it ;) Reply-To: avillaci@fiec.espol.edu.ec El 17/03/11 16:47, Andrew Morton escribió: > > ah, the epic 12309. https://bugzilla.kernel.org/show_bug.cgi?id=12309. > If you're ever wondering how much we suck, go read that one. > > I think what we're seeing in 31142 is a large amount of dirty data > buffered against a slow device. Innocent processes enter page reclaim > and end up getting stuck trying to write to that heavily-queued and > slow device. > > If so, that's probably what some of the 12309 participants are seeing. > But there are lots of other things in that report too. > > > Now, the problem you're seeing in 31142 isn't really supposed to > happen. In the direct-reclaim case the code will try to avoid > initiation of blocking I/O against a congested device, via the > bdi_write_congested() test in may_write_to_queue(). Although that code > now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case, > whodidthat. > > However in the case of the new(ish) compaction/migration code I don't > think we're performing that test. migrate_pages()->unmap_and_move() > will get stuck behind that large&slow IO queue if page reclaim decided > to pass it down sync==true, as it apparently has done. > > IOW, Mel broke it ;) > I don't quite follow. In my case, the congested device is the USB stick, but the affected processes should be reading/writing on the hard disk. What kind of queue(s) implementation results in pending writes to the USB stick interfering with I/O to the hard disk? Or am I misunderstanding? I had the (possibly incorrect) impression that each block device had its own I/O queue. On Thu, 17 Mar 2011 17:11:05 -0500 Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > El 17/03/11 16:47, Andrew Morton escribi__: > > > > ah, the epic 12309. https://bugzilla.kernel.org/show_bug.cgi?id=12309. > > If you're ever wondering how much we suck, go read that one. > > > > I think what we're seeing in 31142 is a large amount of dirty data > > buffered against a slow device. Innocent processes enter page reclaim > > and end up getting stuck trying to write to that heavily-queued and > > slow device. > > > > If so, that's probably what some of the 12309 participants are seeing. > > But there are lots of other things in that report too. > > > > > > Now, the problem you're seeing in 31142 isn't really supposed to > > happen. In the direct-reclaim case the code will try to avoid > > initiation of blocking I/O against a congested device, via the > > bdi_write_congested() test in may_write_to_queue(). Although that code > > now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case, > > whodidthat. > > > > However in the case of the new(ish) compaction/migration code I don't > > think we're performing that test. migrate_pages()->unmap_and_move() > > will get stuck behind that large&slow IO queue if page reclaim decided > > to pass it down sync==true, as it apparently has done. > > > > IOW, Mel broke it ;) > > > I don't quite follow. In my case, the congested device is the USB stick, but > the affected processes should be reading/writing on the hard disk. What kind > of queue(s) implementation results in pending writes to the USB stick > interfering with I/O to the hard > disk? Or am I misunderstanding? I had the (possibly incorrect) impression > that each block device had its own I/O queue. Your web browser is just trying to allocate some memory. As part of that operation it entered the kernel's page reclaim and while scanning for memory to free, page reclaim encountered a page which was queued for IO. Then page reclaim waited for the IO to complete against that page. So the browser got stuck. Page reclaim normally tries to avoid this situation by not waiting on such pages, unless the calling processes was itself involved in writing to the page's device (stored in current->backing_dev_info). But afaict the new compaction/migration code forgot to do this. Reply-To: mel@csn.ul.ie On Thu, Mar 17, 2011 at 02:47:27PM -0700, Andrew Morton wrote: > On Thu, 17 Mar 2011 16:27:29 -0500 > Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > > > > So it appears that the system is full of dirty pages against a slow > > > device and your foreground processes have got stuck in direct reclaim > > > -> compaction -> migration. That's Mel ;) > > > > > > What happened to the plans to eliminate direct reclaim? > > > > > > > > Browsing around bugzilla, I believe that bug 12309 looks very similar to > the issue I am experiencing, especially from comment #525 onwards. Am I > correct in this? > > ah, the epic 12309. https://bugzilla.kernel.org/show_bug.cgi?id=12309. > If you're ever wondering how much we suck, go read that one. > I'm reasonably sure over the last few series that we've taken a number of steps to mitigate the problems described in #12309 although it's been a while since I double checked. When I last stopped looking at it, we had reached the stage where the dirty pages encountered by writeback was greatly reduced which should have affected the stalls reported in that bug. I stopped working on it further to see see how the IO-less dirty balancing being worked on by Wu and Jan worked out because the next reasonable step was making sure the flusher threads were behaving as expected. That is still a work in progress. > I think what we're seeing in 31142 is a large amount of dirty data > buffered against a slow device. Innocent processes enter page reclaim > and end up getting stuck trying to write to that heavily-queued and > slow device. > > If so, that's probably what some of the 12309 participants are seeing. > But there are lots of other things in that report too. > > > Now, the problem you're seeing in 31142 isn't really supposed to > happen. In the direct-reclaim case the code will try to avoid > initiation of blocking I/O against a congested device, via the > bdi_write_congested() test in may_write_to_queue(). Although that code > now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case, > whodidthat. > > However in the case of the new(ish) compaction/migration code I don't > think we're performing that test. migrate_pages()->unmap_and_move() > will get stuck behind that large&slow IO queue if page reclaim decided > to pass it down sync==true, as it apparently has done. > > IOW, Mel broke it ;) > \o/ ... no wait, it's the other one - :( If you look at the stack traces though, all of them had called do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger is new because it's THP triggering compaction that is causing the stalls rather than page reclaim doing direct writeback which was the culprit in the past. To confirm if this is the case, I'd be very interested in hearing if this problem persists in the following cases 1. 2.6.38-rc8 with defrag disabled by echo never >/sys/kernel/mm/transparent_hugepage/defrag (this will stop THP allocations calling into compaction) 2. 2.6.38-rc8 with THP disabled by echo never > /sys/kernel/mm/transparent_hugepage/enabled (if the problem still persists, then page reclaim is still a problem but we should still stop THP doing sync writes) 3. 2.6.37 vanilla (in case this is a new regression introduced since then) Migration can do sync writes on dirty pages which is why it looks so similar to page reclaim but this can be controlled by the value of sync_migration passed into try_to_compact_pages(). If we find that option 1 above makes the regression go away or at least helps a lot, then a reasonable fix may be to never set sync_migration if __GFP_NO_KSWAPD which is always set for THP allocations. I've added Andrea to the cc to see what he thinks. Thanks for the report. Reply-To: mel@csn.ul.ie On Thu, Mar 17, 2011 at 02:47:27PM -0700, Andrew Morton wrote: > On Thu, 17 Mar 2011 16:27:29 -0500 > Alex Villac____s Lasso <avillaci@fiec.espol.edu.ec> wrote: > > > > So it appears that the system is full of dirty pages against a slow > > > device and your foreground processes have got stuck in direct reclaim > > > -> compaction -> migration. That's Mel ;) > > > > > > What happened to the plans to eliminate direct reclaim? > > > > > > > > Browsing around bugzilla, I believe that bug 12309 looks very similar to > the issue I am experiencing, especially from comment #525 onwards. Am I > correct in this? > > ah, the epic 12309. https://bugzilla.kernel.org/show_bug.cgi?id=12309. > If you're ever wondering how much we suck, go read that one. > I'm reasonably sure over the last few series that we've taken a number of steps to mitigate the problems described in #12309 although it's been a while since I double checked. When I last stopped looking at it, we had reached the stage where the dirty pages encountered by writeback was greatly reduced which should have affected the stalls reported in that bug. I stopped working on it further to see see how the IO-less dirty balancing being worked on by Wu and Jan worked out because the next reasonable step was making sure the flusher threads were behaving as expected. That is still a work in progress. > I think what we're seeing in 31142 is a large amount of dirty data > buffered against a slow device. Innocent processes enter page reclaim > and end up getting stuck trying to write to that heavily-queued and > slow device. > > If so, that's probably what some of the 12309 participants are seeing. > But there are lots of other things in that report too. > > > Now, the problem you're seeing in 31142 isn't really supposed to > happen. In the direct-reclaim case the code will try to avoid > initiation of blocking I/O against a congested device, via the > bdi_write_congested() test in may_write_to_queue(). Although that code > now looks a bit busted for the order>PAGE_ALLOC_COSTLY_ORDER case, > whodidthat. > > However in the case of the new(ish) compaction/migration code I don't > think we're performing that test. migrate_pages()->unmap_and_move() > will get stuck behind that large&slow IO queue if page reclaim decided > to pass it down sync==true, as it apparently has done. > > IOW, Mel broke it ;) > \o/ ... no wait, it's the other one - :( If you look at the stack traces though, all of them had called do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger is new because it's THP triggering compaction that is causing the stalls rather than page reclaim doing direct writeback which was the culprit in the past. To confirm if this is the case, I'd be very interested in hearing if this problem persists in the following cases 1. 2.6.38-rc8 with defrag disabled by echo never >/sys/kernel/mm/transparent_hugepage/defrag (this will stop THP allocations calling into compaction) 2. 2.6.38-rc8 with THP disabled by echo never > /sys/kernel/mm/transparent_hugepage/enabled (if the problem still persists, then page reclaim is still a problem but we should still stop THP doing sync writes) 3. 2.6.37 vanilla (in case this is a new regression introduced since then) Migration can do sync writes on dirty pages which is why it looks so similar to page reclaim but this can be controlled by the value of sync_migration passed into try_to_compact_pages(). If we find that option 1 above makes the regression go away or at least helps a lot, then a reasonable fix may be to never set sync_migration if __GFP_NO_KSWAPD which is always set for THP allocations. I've added Andrea to the cc to see what he thinks. Thanks for the report. Reply-To: aarcange@redhat.com On Fri, Mar 18, 2011 at 11:13:00AM +0000, Mel Gorman wrote: > To confirm if this is the case, I'd be very interested in hearing if this > problem persists in the following cases > > 1. 2.6.38-rc8 with defrag disabled by > echo never >/sys/kernel/mm/transparent_hugepage/defrag > (this will stop THP allocations calling into compaction) > 2. 2.6.38-rc8 with THP disabled by > echo never > /sys/kernel/mm/transparent_hugepage/enabled > (if the problem still persists, then page reclaim is still a problem > but we should still stop THP doing sync writes) > 3. 2.6.37 vanilla > (in case this is a new regression introduced since then) > > Migration can do sync writes on dirty pages which is why it looks so similar > to page reclaim but this can be controlled by the value of sync_migration > passed into try_to_compact_pages(). If we find that option 1 above makes > the regression go away or at least helps a lot, then a reasonable fix may > be to never set sync_migration if __GFP_NO_KSWAPD which is always set for > THP allocations. I've added Andrea to the cc to see what he thinks. I agree. Forcing sync=0 when __GFP_NO_KSWAPD is set, sounds good to me, if it is proven to resolve these I/O waits. Also note that 2.6.38 upstream still misses a couple of important compaction fixes that are in aa.git (everything relevant is already queued in -mm but it was a bit late for 2.6.38), so I'd also be interested to know if you can reproduce in current aa.git origin/master branch. If it's a __GFP_NO_KSWAPD allocation (do_huge_pmd_anonymous_page()) that is present in the hanging stack traces, I strongly doubt any of the changes in aa.git is going to help at all, but it worth a try to be sure. http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=48ad57f498835621d8bad83b972ee6e6c395523a http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=8f6854f7cbf71bc61758bcd92497378e1f677552 http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=8ff6d16eb15d2b328bbe715fcaf453b6fedb2cf9 http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=e31adb46cd8c4f331cfb02c938e88586d5846bf8 This is the implementation of Mel's idea that you can apply to upstream or aa.git to see what happens... === Subject: compaction: use async migrate for __GFP_NO_KSWAPD From: Andrea Arcangeli <aarcange@redhat.com> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed (they have graceful fallback). Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bd76256..36d1c79 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Reply-To: avillaci@fiec.espol.edu.ec El 18/03/11 06:13, Mel Gorman escribió: > > \o/ ... no wait, it's the other one - :( > > If you look at the stack traces though, all of them had called > do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger > is new because it's THP triggering compaction that is causing the stalls > rather than page reclaim doing direct writeback which was the culprit in > the past. > > To confirm if this is the case, I'd be very interested in hearing if this > problem persists in the following cases > > 1. 2.6.38-rc8 with defrag disabled by > echo never>/sys/kernel/mm/transparent_hugepage/defrag > (this will stop THP allocations calling into compaction) > 2. 2.6.38-rc8 with THP disabled by > echo never> > /sys/kernel/mm/transparent_hugepage/enabled > (if the problem still persists, then page reclaim is still a problem > but we should still stop THP doing sync writes) > 3. 2.6.37 vanilla > (in case this is a new regression introduced since then) > > Migration can do sync writes on dirty pages which is why it looks so similar > to page reclaim but this can be controlled by the value of sync_migration > passed into try_to_compact_pages(). If we find that option 1 above makes > the regression go away or at least helps a lot, then a reasonable fix may > be to never set sync_migration if __GFP_NO_KSWAPD which is always set for > THP allocations. I've added Andrea to the cc to see what he thinks. > > Thanks for the report. > I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I have verified that echoing "never" on either /sys/kernel/mm/transparent_hugepage/defrag or /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy to USB to proceed smoothly (copying 4GB of data). Just to verify, I later wrote "always" to both files, and sure enough, some applications stalled when I repeated the same file copy. So I have at least a workaround for the issue. Given this evidence, will the patch at comment #14 fix the issue for good? Reply-To: mel@csn.ul.ie On Fri, Mar 18, 2011 at 01:05:15PM -0500, Alex Villac??s Lasso wrote: > El 18/03/11 06:13, Mel Gorman escribió: > > > >\o/ ... no wait, it's the other one - :( > > > >If you look at the stack traces though, all of them had called > >do_huge_pmd_anonymous_page() so while it looks similar to 12309, the trigger > >is new because it's THP triggering compaction that is causing the stalls > >rather than page reclaim doing direct writeback which was the culprit in > >the past. > > > >To confirm if this is the case, I'd be very interested in hearing if this > >problem persists in the following cases > > > >1. 2.6.38-rc8 with defrag disabled by > > echo never>/sys/kernel/mm/transparent_hugepage/defrag > > (this will stop THP allocations calling into compaction) > >2. 2.6.38-rc8 with THP disabled by > > echo never> > >/sys/kernel/mm/transparent_hugepage/enabled > > (if the problem still persists, then page reclaim is still a problem > > but we should still stop THP doing sync writes) > >3. 2.6.37 vanilla > > (in case this is a new regression introduced since then) > > > >Migration can do sync writes on dirty pages which is why it looks so similar > >to page reclaim but this can be controlled by the value of sync_migration > >passed into try_to_compact_pages(). If we find that option 1 above makes > >the regression go away or at least helps a lot, then a reasonable fix may > >be to never set sync_migration if __GFP_NO_KSWAPD which is always set for > >THP allocations. I've added Andrea to the cc to see what he thinks. > > > >Thanks for the report. > > > > I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I > have verified that echoing "never" on either > /sys/kernel/mm/transparent_hugepage/defrag or > /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy > to USB to proceed smoothly (copying 4GB of data). Just to verify, I > later wrote "always" to both files, and sure enough, some > applications stalled when I repeated the same file copy. So I have > at least a workaround for the issue. Given this evidence, will the > patch at comment #14 fix the issue for good? > Thanks for testing and reporting, it's very helpful. Based on that that report the patch should help. Can you test it to be absolutly sure please? Created attachment 51262 [details]
sysrq-w trace with patch applied
The patch did not help, it just delayed the appearance of the symptom. Now it took longer (about 4 minutes) before several applications started freezing. The sysrq-w trace shows the situation.
Reply-To: avillaci@fiec.espol.edu.ec El 19/03/11 08:46, Mel Gorman escribió: > On Fri, Mar 18, 2011 at 01:05:15PM -0500, Alex Villac??s Lasso wrote: >> El 18/03/11 06:13, Mel Gorman escribió: >>> \o/ ... no wait, it's the other one - :( >>> >>> If you look at the stack traces though, all of them had called >>> do_huge_pmd_anonymous_page() so while it looks similar to 12309, the >>> trigger >>> is new because it's THP triggering compaction that is causing the stalls >>> rather than page reclaim doing direct writeback which was the culprit in >>> the past. >>> >>> To confirm if this is the case, I'd be very interested in hearing if this >>> problem persists in the following cases >>> >>> 1. 2.6.38-rc8 with defrag disabled by >>> echo never>/sys/kernel/mm/transparent_hugepage/defrag >>> (this will stop THP allocations calling into compaction) >>> 2. 2.6.38-rc8 with THP disabled by >>> echo never> >>> /sys/kernel/mm/transparent_hugepage/enabled >>> (if the problem still persists, then page reclaim is still a problem >>> but we should still stop THP doing sync writes) >>> 3. 2.6.37 vanilla >>> (in case this is a new regression introduced since then) >>> >>> Migration can do sync writes on dirty pages which is why it looks so >>> similar >>> to page reclaim but this can be controlled by the value of sync_migration >>> passed into try_to_compact_pages(). If we find that option 1 above makes >>> the regression go away or at least helps a lot, then a reasonable fix may >>> be to never set sync_migration if __GFP_NO_KSWAPD which is always set for >>> THP allocations. I've added Andrea to the cc to see what he thinks. >>> >>> Thanks for the report. >>> >> I have just done tests 1 and 2 on 2.6.38 (final, not -rc8), and I >> have verified that echoing "never" on either >> /sys/kernel/mm/transparent_hugepage/defrag or >> /sys/kernel/mm/transparent_hugepage/enabled does allow the file copy >> to USB to proceed smoothly (copying 4GB of data). Just to verify, I >> later wrote "always" to both files, and sure enough, some >> applications stalled when I repeated the same file copy. So I have >> at least a workaround for the issue. Given this evidence, will the >> patch at comment #14 fix the issue for good? >> > Thanks for testing and reporting, it's very helpful. Based on that that > report the patch should help. Can you test it to be absolutly sure please? > > The patch did not help. I have attached a sysrq-w trace with the patch applied in the bug report. Reply-To: aarcange@redhat.com On Sat, Mar 19, 2011 at 11:04:02AM -0500, Alex Villacís Lasso wrote: > The patch did not help. I have attached a sysrq-w trace with the patch > applied in the bug report. Most processes are stuck in udf_writepage. That's because migrate is calling ->writepage on dirty pages even when sync=0. This may do better, can you test it in replacement of the previous patch? Thanks, Andrea === Subject: compaction: use async migrate for __GFP_NO_KSWAPD From: Andrea Arcangeli <aarcange@redhat.com> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed (they have graceful fallback). Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Stop calling ->writepage on dirty cache when migrate sync mode is not set. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/migrate.c | 35 ++++++++++++++++++++++++++--------- mm/page_alloc.c | 2 +- 2 files changed, 27 insertions(+), 10 deletions(-) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, --- a/mm/migrate.c +++ b/mm/migrate.c @@ -536,10 +536,15 @@ static int writeout(struct address_space * Default handling if a filesystem does not provide a migration function. */ static int fallback_migrate_page(struct address_space *mapping, - struct page *newpage, struct page *page) + struct page *newpage, struct page *page, + int sync) { - if (PageDirty(page)) - return writeout(mapping, page); + if (PageDirty(page)) { + if (sync) + return writeout(mapping, page); + else + return -EBUSY; + } /* * Buffers may be managed in a filesystem specific way. @@ -564,7 +569,7 @@ static int fallback_migrate_page(struct * == 0 - success */ static int move_to_new_page(struct page *newpage, struct page *page, - int remap_swapcache) + int remap_swapcache, int sync) { struct address_space *mapping; int rc; @@ -597,7 +602,7 @@ static int move_to_new_page(struct page rc = mapping->a_ops->migratepage(mapping, newpage, page); else - rc = fallback_migrate_page(mapping, newpage, page); + rc = fallback_migrate_page(mapping, newpage, page, sync); if (rc) { newpage->mapping = NULL; @@ -641,6 +646,10 @@ static int unmap_and_move(new_page_t get rc = -EAGAIN; if (!trylock_page(page)) { + if (!sync) { + rc = -EBUSY; + goto move_newpage; + } if (!force) goto move_newpage; @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get BUG_ON(charge); if (PageWriteback(page)) { - if (!force || !sync) + if (!sync) { + rc = -EBUSY; + goto uncharge; + } + if (!force) goto uncharge; wait_on_page_writeback(page); } @@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get skip_unmap: if (!page_mapped(page)) - rc = move_to_new_page(newpage, page, remap_swapcache); + rc = move_to_new_page(newpage, page, remap_swapcache, sync); if (rc && remap_swapcache) remove_migration_ptes(page, page); @@ -834,7 +847,11 @@ static int unmap_and_move_huge_page(new_ rc = -EAGAIN; if (!trylock_page(hpage)) { - if (!force || !sync) + if (!sync) { + rc = -EBUSY; + goto out; + } + if (!force) goto out; lock_page(hpage); } @@ -850,7 +867,7 @@ static int unmap_and_move_huge_page(new_ try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(hpage)) - rc = move_to_new_page(new_hpage, hpage, 1); + rc = move_to_new_page(new_hpage, hpage, 1, sync); if (rc) remove_migration_ptes(hpage, hpage); Reply-To: mel@csn.ul.ie On Sun, Mar 20, 2011 at 12:51:44AM +0100, Andrea Arcangeli wrote: > On Sat, Mar 19, 2011 at 11:04:02AM -0500, Alex Villacís Lasso wrote: > > The patch did not help. I have attached a sysrq-w trace with the patch > applied in the bug report. > > Most processes are stuck in udf_writepage. That's because migrate is > calling ->writepage on dirty pages even when sync=0. > > This may do better, can you test it in replacement of the previous > patch? > > === > Subject: compaction: use async migrate for __GFP_NO_KSWAPD > > From: Andrea Arcangeli <aarcange@redhat.com> > > __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to > succeed (they have graceful fallback). Waiting for I/O in those, tends to be > overkill in terms of latencies, so we can reduce their latency by disabling > sync migrate. > > Stop calling ->writepage on dirty cache when migrate sync mode is not set. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > mm/migrate.c | 35 ++++++++++++++++++++++++++--------- > mm/page_alloc.c | 2 +- > 2 files changed, 27 insertions(+), 10 deletions(-) > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2085,7 +2085,7 @@ rebalance: > sync_migration); > if (page) > goto got_pg; > - sync_migration = true; > + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > > /* Try direct reclaim and then allocating */ > page = __alloc_pages_direct_reclaim(gfp_mask, order, > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -536,10 +536,15 @@ static int writeout(struct address_space > * Default handling if a filesystem does not provide a migration function. > */ > static int fallback_migrate_page(struct address_space *mapping, > - struct page *newpage, struct page *page) > + struct page *newpage, struct page *page, > + int sync) > { > - if (PageDirty(page)) > - return writeout(mapping, page); > + if (PageDirty(page)) { > + if (sync) > + return writeout(mapping, page); > + else > + return -EBUSY; > + } > > /* > * Buffers may be managed in a filesystem specific way. The check is at the wrong level I believe because it misses NFS pages which will still get queued for IO which can block waiting on a request to complete. > @@ -564,7 +569,7 @@ static int fallback_migrate_page(struct > * == 0 - success > */ > static int move_to_new_page(struct page *newpage, struct page *page, > - int remap_swapcache) > + int remap_swapcache, int sync) sync should be bool. > { > struct address_space *mapping; > int rc; > @@ -597,7 +602,7 @@ static int move_to_new_page(struct page > rc = mapping->a_ops->migratepage(mapping, > newpage, page); > else > - rc = fallback_migrate_page(mapping, newpage, page); > + rc = fallback_migrate_page(mapping, newpage, page, sync); > > if (rc) { > newpage->mapping = NULL; > @@ -641,6 +646,10 @@ static int unmap_and_move(new_page_t get > rc = -EAGAIN; > > if (!trylock_page(page)) { > + if (!sync) { > + rc = -EBUSY; > + goto move_newpage; > + } It's overkill to return EBUSY just because we failed to get a lock which could be released very quickly. If we left rc as -EAGAIN it would retry again. The worst case scenario is that the current process is the holder of the lock and the loop is pointless but this is a relatively rare situation (other than Hugh's loopback test aside which seems to be particularly good at triggering that situation). > if (!force) > goto move_newpage; > > @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get > BUG_ON(charge); > > if (PageWriteback(page)) { > - if (!force || !sync) > + if (!sync) { > + rc = -EBUSY; > + goto uncharge; > + } > + if (!force) > goto uncharge; Where as this is ok because if the page is being written back, it's fairly unlikely it'll get cleared quickly enough for the retry loop to make sense. > wait_on_page_writeback(page); > } > @@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get > > skip_unmap: > if (!page_mapped(page)) > - rc = move_to_new_page(newpage, page, remap_swapcache); > + rc = move_to_new_page(newpage, page, remap_swapcache, sync); > > if (rc && remap_swapcache) > remove_migration_ptes(page, page); > @@ -834,7 +847,11 @@ static int unmap_and_move_huge_page(new_ > rc = -EAGAIN; > > if (!trylock_page(hpage)) { > - if (!force || !sync) > + if (!sync) { > + rc = -EBUSY; > + goto out; > + } > + if (!force) > goto out; > lock_page(hpage); > } As before, it's worth retrying to get the lock as it could be released very shortly. > @@ -850,7 +867,7 @@ static int unmap_and_move_huge_page(new_ > try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > if (!page_mapped(hpage)) > - rc = move_to_new_page(new_hpage, hpage, 1); > + rc = move_to_new_page(new_hpage, hpage, 1, sync); > > if (rc) > remove_migration_ptes(hpage, hpage); > Because of the NFS pages and being a bit aggressive about using -EBUSY, how about the following instead? (build tested only unfortunately) ==== CUT HERE ==== mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback From: Andrea Arcangeli <aarcange@redhat.com> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed as they have graceful fallback. Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Unfortunately, even with async migration it's still possible for the process to be blocked waiting for a request slot (e.g. get_request_wait in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on dirty page cache for asynchronous migration. [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/migrate.c | 47 ++++++++++++++++++++++++++++++----------------- mm/page_alloc.c | 2 +- 2 files changed, 31 insertions(+), 18 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 352de555..1b45508 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping, * == 0 - success */ static int move_to_new_page(struct page *newpage, struct page *page, - int remap_swapcache) + int remap_swapcache, bool sync) { struct address_space *mapping; int rc; @@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage, struct page *page, mapping = page_mapping(page); if (!mapping) rc = migrate_page(mapping, newpage, page); - else if (mapping->a_ops->migratepage) - /* - * Most pages have a mapping and most filesystems - * should provide a migration function. Anonymous - * pages are part of swap space which also has its - * own migration function. This is the most common - * path for page migration. - */ - rc = mapping->a_ops->migratepage(mapping, - newpage, page); - else - rc = fallback_migrate_page(mapping, newpage, page); + else { + /* Do not writeback pages if !sync */ + if (PageDirty(page) && !sync) + rc = -EBUSY; + else if (mapping->a_ops->migratepage) + /* + * Most pages have a mapping and most filesystems + * should provide a migration function. Anonymous + * pages are part of swap space which also has its + * own migration function. This is the most common + * path for page migration. + */ + rc = mapping->a_ops->migratepage(mapping, + newpage, page); + else + rc = fallback_migrate_page(mapping, newpage, page); + } if (rc) { newpage->mapping = NULL; @@ -641,7 +646,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, rc = -EAGAIN; if (!trylock_page(page)) { - if (!force) + if (!force || !sync) goto move_newpage; /* @@ -686,7 +691,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, BUG_ON(charge); if (PageWriteback(page)) { - if (!force || !sync) + /* + * For !sync, there is no point retrying as the retry loop + * is expected to be too short for PageWriteback to be cleared + */ + if (!sync) { + rc = -EBUSY; + goto uncharge; + } + if (!force) goto uncharge; wait_on_page_writeback(page); } @@ -757,7 +770,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, skip_unmap: if (!page_mapped(page)) - rc = move_to_new_page(newpage, page, remap_swapcache); + rc = move_to_new_page(newpage, page, remap_swapcache, sync); if (rc && remap_swapcache) remove_migration_ptes(page, page); @@ -850,7 +863,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(hpage)) - rc = move_to_new_page(new_hpage, hpage, 1); + rc = move_to_new_page(new_hpage, hpage, 1, sync); if (rc) remove_migration_ptes(hpage, hpage); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cdef1d4..ce6d601 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Reply-To: aarcange@redhat.com On Mon, Mar 21, 2011 at 09:41:49AM +0000, Mel Gorman wrote: > The check is at the wrong level I believe because it misses NFS pages which > will still get queued for IO which can block waiting on a request to > complete. But for example ->migratepage won't block at all for swapcache... it's just a pointer for migrate_page... so I didnt' want to skip what could be nonblocking, it just makes migrate less reliable for no good in some case. The fallback case is very likely blocking instead so I only returned -EBUSY there. Best would be to pass a sync/nonblock param to migratepage(nonblock) so that nfs_migrate_page can pass "nonblock" instead of "false" to nfs_find_and_lock_request. > sync should be bool. That's better thanks. > It's overkill to return EBUSY just because we failed to get a lock which > could > be released very quickly. If we left rc as -EAGAIN it would retry again. > The worst case scenario is that the current process is the holder of the > lock and the loop is pointless but this is a relatively rare situation > (other than Hugh's loopback test aside which seems to be particularly good > at triggering that situation). This change was only meant to possibly avoid some cpu waste in the tight loop, not really "blocking" related so I'm sure ok to drop it for now. The page lock holder better to be quick because with sync=0 the tight loop will retry real fast. If the holder blocks we're not so smart at retrying in a tight loop but for now it's ok. > > goto move_newpage; > > > > @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get > > BUG_ON(charge); > > > > if (PageWriteback(page)) { > > - if (!force || !sync) > > + if (!sync) { > > + rc = -EBUSY; > > + goto uncharge; > > + } > > + if (!force) > > goto uncharge; > > Where as this is ok because if the page is being written back, it's fairly > unlikely it'll get cleared quickly enough for the retry loop to make sense. Agreed. > Because of the NFS pages and being a bit aggressive about using -EBUSY, > how about the following instead? (build tested only unfortunately) I tested my version below but I think one needs udf with lots of dirty pages plus the usb to trigger this which I don't have setup immediately. > @@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage, > struct page *page, > mapping = page_mapping(page); > if (!mapping) > rc = migrate_page(mapping, newpage, page); > - else if (mapping->a_ops->migratepage) > - /* > - * Most pages have a mapping and most filesystems > - * should provide a migration function. Anonymous > - * pages are part of swap space which also has its > - * own migration function. This is the most common > - * path for page migration. > - */ > - rc = mapping->a_ops->migratepage(mapping, > - newpage, page); > - else > - rc = fallback_migrate_page(mapping, newpage, page); > + else { > + /* Do not writeback pages if !sync */ > + if (PageDirty(page) && !sync) > + rc = -EBUSY; I think it's better to at least change it to: if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page)) I wasn't sure how to handle noblocking ->migratepage for swapcache and tmpfs but probably the above check is a good enough approximation. Before sending my patch I thought of adding a "sync" parameter to ->migratepage(..., sync/nonblock) but then the patch become bigger... and I just wanted to know if this was the problem or not so I deferred it. If we're sure that all migratepage blocks except for things like swapcache/tmpfs or other not-filebacked things that defines it to migrate_page, we're pretty well covered by adding a check like above migratepage == migrate_page and maybe we don't need to add a "sync/nonblock" parameter to ->migratepage(). For example the buffer_migrate_page can block too in lock_buffer. This is the patch I'm trying with the addition of the above check and some comment space/tab issue cleanup. === Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback From: Andrea Arcangeli <aarcange@redhat.com> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed as they have graceful fallback. Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Unfortunately, even with async migration it's still possible for the process to be blocked waiting for a request slot (e.g. get_request_wait in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on dirty page cache for asynchronous migration. [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/migrate.c | 48 +++++++++++++++++++++++++++++++++--------------- mm/page_alloc.c | 2 +- 2 files changed, 34 insertions(+), 16 deletions(-) --- a/mm/migrate.c +++ b/mm/migrate.c @@ -564,7 +564,7 @@ static int fallback_migrate_page(struct * == 0 - success */ static int move_to_new_page(struct page *newpage, struct page *page, - int remap_swapcache) + int remap_swapcache, bool sync) { struct address_space *mapping; int rc; @@ -586,18 +586,28 @@ static int move_to_new_page(struct page mapping = page_mapping(page); if (!mapping) rc = migrate_page(mapping, newpage, page); - else if (mapping->a_ops->migratepage) + else { /* - * Most pages have a mapping and most filesystems - * should provide a migration function. Anonymous - * pages are part of swap space which also has its - * own migration function. This is the most common - * path for page migration. + * Do not writeback pages if !sync and migratepage is + * not pointing to migrate_page() which is nonblocking + * (swapcache/tmpfs uses migratepage = migrate_page). */ - rc = mapping->a_ops->migratepage(mapping, - newpage, page); - else - rc = fallback_migrate_page(mapping, newpage, page); + if (PageDirty(page) && !sync && + mapping->a_ops->migratepage != migrate_page) + rc = -EBUSY; + else if (mapping->a_ops->migratepage) + /* + * Most pages have a mapping and most filesystems + * should provide a migration function. Anonymous + * pages are part of swap space which also has its + * own migration function. This is the most common + * path for page migration. + */ + rc = mapping->a_ops->migratepage(mapping, + newpage, page); + else + rc = fallback_migrate_page(mapping, newpage, page); + } if (rc) { newpage->mapping = NULL; @@ -641,7 +651,7 @@ static int unmap_and_move(new_page_t get rc = -EAGAIN; if (!trylock_page(page)) { - if (!force) + if (!force || !sync) goto move_newpage; /* @@ -686,7 +696,15 @@ static int unmap_and_move(new_page_t get BUG_ON(charge); if (PageWriteback(page)) { - if (!force || !sync) + /* + * For !sync, there is no point retrying as the retry loop + * is expected to be too short for PageWriteback to be cleared + */ + if (!sync) { + rc = -EBUSY; + goto uncharge; + } + if (!force) goto uncharge; wait_on_page_writeback(page); } @@ -757,7 +775,7 @@ static int unmap_and_move(new_page_t get skip_unmap: if (!page_mapped(page)) - rc = move_to_new_page(newpage, page, remap_swapcache); + rc = move_to_new_page(newpage, page, remap_swapcache, sync); if (rc && remap_swapcache) remove_migration_ptes(page, page); @@ -850,7 +868,7 @@ static int unmap_and_move_huge_page(new_ try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(hpage)) - rc = move_to_new_page(new_hpage, hpage, 1); + rc = move_to_new_page(new_hpage, hpage, 1, sync); if (rc) remove_migration_ptes(hpage, hpage); --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Reply-To: avillaci@fiec.espol.edu.ec El 21/03/11 08:48, Andrea Arcangeli escribió: > > === > Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce > no writeback > > From: Andrea Arcangeli<aarcange@redhat.com> > > __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory > to succeed as they have graceful fallback. Waiting for I/O in those, tends > to be overkill in terms of latencies, so we can reduce their latency by > disabling sync migrate. > > Unfortunately, even with async migration it's still possible for the > process to be blocked waiting for a request slot (e.g. get_request_wait > in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD > blocking, this patch prevents ->writepage being called on dirty page cache > for asynchronous migration. > > [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> > Signed-off-by: Mel Gorman<mel@csn.ul.ie> > --- > mm/migrate.c | 48 +++++++++++++++++++++++++++++++++--------------- > mm/page_alloc.c | 2 +- > 2 files changed, 34 insertions(+), 16 deletions(-) The latest patch fails to apply in vanilla 2.6.38: [alex@srv64 linux-2.6.38]$ patch -p1 --dry-run < ../\[Bug\ 31142\]\ Large\ write\ to\ USB\ stick\ freezes\ unrelated\ tasks\ for\ a\ long\ time.eml (Stripping trailing CRs from patch.) patching file mm/migrate.c Hunk #1 FAILED at 564. Hunk #2 FAILED at 586. Hunk #3 FAILED at 641. Hunk #4 FAILED at 686. Hunk #5 FAILED at 757. Hunk #6 FAILED at 850. 6 out of 6 hunks FAILED -- saving rejects to file mm/migrate.c.rej (Stripping trailing CRs from patch.) patching file mm/page_alloc.c Hunk #1 FAILED at 2085. 1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej I will try to apply the patch manually. Reply-To: avillaci@fiec.espol.edu.ec El 21/03/11 10:22, Alex Villacís Lasso escribió: > El 21/03/11 08:48, Andrea Arcangeli escribió: >> >> === >> Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce >> no writeback >> >> From: Andrea Arcangeli<aarcange@redhat.com> >> >> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory >> to succeed as they have graceful fallback. Waiting for I/O in those, tends >> to be overkill in terms of latencies, so we can reduce their latency by >> disabling sync migrate. >> >> Unfortunately, even with async migration it's still possible for the >> process to be blocked waiting for a request slot (e.g. get_request_wait >> in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD >> blocking, this patch prevents ->writepage being called on dirty page cache >> for asynchronous migration. >> >> [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] >> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> >> Signed-off-by: Mel Gorman<mel@csn.ul.ie> >> --- >> mm/migrate.c | 48 +++++++++++++++++++++++++++++++++--------------- >> mm/page_alloc.c | 2 +- >> 2 files changed, 34 insertions(+), 16 deletions(-) > The latest patch fails to apply in vanilla 2.6.38: > > [alex@srv64 linux-2.6.38]$ patch -p1 --dry-run < ../\[Bug\ 31142\]\ Large\ > write\ to\ USB\ stick\ freezes\ unrelated\ tasks\ for\ a\ long\ time.eml > (Stripping trailing CRs from patch.) > patching file mm/migrate.c > Hunk #1 FAILED at 564. > Hunk #2 FAILED at 586. > Hunk #3 FAILED at 641. > Hunk #4 FAILED at 686. > Hunk #5 FAILED at 757. > Hunk #6 FAILED at 850. > 6 out of 6 hunks FAILED -- saving rejects to file mm/migrate.c.rej > (Stripping trailing CRs from patch.) > patching file mm/page_alloc.c > Hunk #1 FAILED at 2085. > 1 out of 1 hunk FAILED -- saving rejects to file mm/page_alloc.c.rej > > I will try to apply the patch manually. > After some massaging, I succeeded in applying the patch. Compiling now. Reply-To: aarcange@redhat.com On Mon, Mar 21, 2011 at 10:22:43AM -0500, Alex Villacís Lasso wrote: > I will try to apply the patch manually. Hmm, I checked that this latest version below applies clean both against vanilla 2.6.38 and current current git. It should apply clean if you run: "git fetch; git checkout -f origin/master" or "git checkout -f v2.6.38" I got this and another migrate patch from Hugh both applied in aa.git so you may use that one too if you want: "git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git" (with --reference it should clone very fast, linux-2.6 must be a clone of the upstream linux-2.6.git tree) Thanks, Andrea === Subject: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback From: Andrea Arcangeli <aarcange@redhat.com> __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory to succeed as they have graceful fallback. Waiting for I/O in those, tends to be overkill in terms of latencies, so we can reduce their latency by disabling sync migrate. Unfortunately, even with async migration it's still possible for the process to be blocked waiting for a request slot (e.g. get_request_wait in the block layer) when ->writepage is called. To prevent __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on dirty page cache for asynchronous migration. [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/migrate.c | 48 +++++++++++++++++++++++++++++++++--------------- mm/page_alloc.c | 2 +- 2 files changed, 34 insertions(+), 16 deletions(-) --- a/mm/migrate.c +++ b/mm/migrate.c @@ -564,7 +564,7 @@ static int fallback_migrate_page(struct * == 0 - success */ static int move_to_new_page(struct page *newpage, struct page *page, - int remap_swapcache) + int remap_swapcache, bool sync) { struct address_space *mapping; int rc; @@ -586,18 +586,28 @@ static int move_to_new_page(struct page mapping = page_mapping(page); if (!mapping) rc = migrate_page(mapping, newpage, page); - else if (mapping->a_ops->migratepage) + else { /* - * Most pages have a mapping and most filesystems - * should provide a migration function. Anonymous - * pages are part of swap space which also has its - * own migration function. This is the most common - * path for page migration. + * Do not writeback pages if !sync and migratepage is + * not pointing to migrate_page() which is nonblocking + * (swapcache/tmpfs uses migratepage = migrate_page). */ - rc = mapping->a_ops->migratepage(mapping, - newpage, page); - else - rc = fallback_migrate_page(mapping, newpage, page); + if (PageDirty(page) && !sync && + mapping->a_ops->migratepage != migrate_page) + rc = -EBUSY; + else if (mapping->a_ops->migratepage) + /* + * Most pages have a mapping and most filesystems + * should provide a migration function. Anonymous + * pages are part of swap space which also has its + * own migration function. This is the most common + * path for page migration. + */ + rc = mapping->a_ops->migratepage(mapping, + newpage, page); + else + rc = fallback_migrate_page(mapping, newpage, page); + } if (rc) { newpage->mapping = NULL; @@ -641,7 +651,7 @@ static int unmap_and_move(new_page_t get rc = -EAGAIN; if (!trylock_page(page)) { - if (!force) + if (!force || !sync) goto move_newpage; /* @@ -686,7 +696,15 @@ static int unmap_and_move(new_page_t get BUG_ON(charge); if (PageWriteback(page)) { - if (!force || !sync) + /* + * For !sync, there is no point retrying as the retry loop + * is expected to be too short for PageWriteback to be cleared + */ + if (!sync) { + rc = -EBUSY; + goto uncharge; + } + if (!force) goto uncharge; wait_on_page_writeback(page); } @@ -757,7 +775,7 @@ static int unmap_and_move(new_page_t get skip_unmap: if (!page_mapped(page)) - rc = move_to_new_page(newpage, page, remap_swapcache); + rc = move_to_new_page(newpage, page, remap_swapcache, sync); if (rc && remap_swapcache) remove_migration_ptes(page, page); @@ -850,7 +868,7 @@ static int unmap_and_move_huge_page(new_ try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(hpage)) - rc = move_to_new_page(new_hpage, hpage, 1); + rc = move_to_new_page(new_hpage, hpage, 1, sync); if (rc) remove_migration_ptes(hpage, hpage); --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Reply-To: mel@csn.ul.ie On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: > On Mon, Mar 21, 2011 at 09:41:49AM +0000, Mel Gorman wrote: > > The check is at the wrong level I believe because it misses NFS pages which > > will still get queued for IO which can block waiting on a request to > complete. > > But for example ->migratepage won't block at all for swapcache... it's > just a pointer for migrate_page... so I didnt' want to skip what could > be nonblocking, it just makes migrate less reliable for no good in > some case. The fallback case is very likely blocking instead so I only > returned -EBUSY there. > Fair point. > Best would be to pass a sync/nonblock param to migratepage(nonblock) > so that nfs_migrate_page can pass "nonblock" instead of "false" to > nfs_find_and_lock_request. > I had considered this but thought passing in sync to things like migrate_page() that ignored it looked a little ugly. > > sync should be bool. > > That's better thanks. > > > It's overkill to return EBUSY just because we failed to get a lock which > could > > be released very quickly. If we left rc as -EAGAIN it would retry again. > > The worst case scenario is that the current process is the holder of the > > lock and the loop is pointless but this is a relatively rare situation > > (other than Hugh's loopback test aside which seems to be particularly good > > at triggering that situation). > > This change was only meant to possibly avoid some cpu waste in the > tight loop, not really "blocking" related so I'm sure ok to drop it > for now. The page lock holder better to be quick because with sync=0 > the tight loop will retry real fast. If the holder blocks we're not so > smart at retrying in a tight loop but for now it's ok. > Ok. > > > goto move_newpage; > > > > > > @@ -686,7 +695,11 @@ static int unmap_and_move(new_page_t get > > > BUG_ON(charge); > > > > > > if (PageWriteback(page)) { > > > - if (!force || !sync) > > > + if (!sync) { > > > + rc = -EBUSY; > > > + goto uncharge; > > > + } > > > + if (!force) > > > goto uncharge; > > > > Where as this is ok because if the page is being written back, it's fairly > > unlikely it'll get cleared quickly enough for the retry loop to make sense. > > Agreed. > > > Because of the NFS pages and being a bit aggressive about using -EBUSY, > > how about the following instead? (build tested only unfortunately) > > I tested my version below but I think one needs udf with lots of dirty > pages plus the usb to trigger this which I don't have setup > immediately. > > > @@ -586,18 +586,23 @@ static int move_to_new_page(struct page *newpage, > struct page *page, > > mapping = page_mapping(page); > > if (!mapping) > > rc = migrate_page(mapping, newpage, page); > > - else if (mapping->a_ops->migratepage) > > - /* > > - * Most pages have a mapping and most filesystems > > - * should provide a migration function. Anonymous > > - * pages are part of swap space which also has its > > - * own migration function. This is the most common > > - * path for page migration. > > - */ > > - rc = mapping->a_ops->migratepage(mapping, > > - newpage, page); > > - else > > - rc = fallback_migrate_page(mapping, newpage, page); > > + else { > > + /* Do not writeback pages if !sync */ > > + if (PageDirty(page) && !sync) > > + rc = -EBUSY; > > I think it's better to at least change it to: > > if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page)) > > I wasn't sure how to handle noblocking ->migratepage for swapcache and > tmpfs but probably the above check is a good enough approximation. > It's a good enough approximation. It's a little ugly but I don't think it's much uglier than passing in unused parameters to migrate_page(). > Before sending my patch I thought of adding a "sync" parameter to > ->migratepage(..., sync/nonblock) but then the patch become > bigger... and I just wanted to know if this was the problem or not so > I deferred it. > I deferred it for similar reasons. It was becoming a much larger change than should be necessary for the fix. > If we're sure that all migratepage blocks except for things like > swapcache/tmpfs or other not-filebacked things that defines it to > migrate_page, we're pretty well covered by adding a check like above > migratepage == migrate_page and maybe we don't need to add a > "sync/nonblock" parameter to ->migratepage(). For example the > buffer_migrate_page can block too in lock_buffer. > Agreed. > This is the patch I'm trying with the addition of the above check and > some comment space/tab issue cleanup. > Nothing bad jumped out at me. Lets see how it gets on with testing. Thanks Created attachment 51532 [details] sysrq-w trace with patch from comment #21 applied As with the previous patch, this one did not completely solve the freezing tasks issue. However, as with the previous patch, the freezes took longer to appear, and now lasted less (10 to 12 seconds instead of freezing until the end of the usb copy). Reply-To: avillaci@fiec.espol.edu.ec El 21/03/11 11:37, Mel Gorman escribió: > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: > > Nothing bad jumped out at me. Lets see how it gets on with testing. > > Thanks > As with the previous patch, this one did not completely solve the freezing tasks issue. However, as with the previous patch, the freezes took longer to appear, and now lasted less (10 to 12 seconds instead of freezing until the end of the usb copy). I have attached the new sysrq-w trace to the bug report. Reply-To: aarcange@redhat.com On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villacís Lasso wrote: > El 21/03/11 11:37, Mel Gorman escribió: > > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: > > > > Nothing bad jumped out at me. Lets see how it gets on with testing. > > > > Thanks > > > As with the previous patch, this one did not completely solve the freezing > tasks issue. However, as with the previous patch, the freezes took longer to > appear, and now lasted less (10 to 12 seconds instead of freezing until the > end of the usb copy). > > I have attached the new sysrq-w trace to the bug report. migrate and compaction disappeared from the traces as we hoped for. The THP allocations left throttles on writeback during reclaim like any 4k allocation would do: [ 2629.256809] [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d [ 2629.256812] [<ffffffff810e5992>] shrink_page_list+0x134/0x478 [ 2629.256815] [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a [ 2629.256818] [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21 [ 2629.256820] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 [ 2629.256823] [<ffffffff810e6849>] shrink_zone+0x362/0x464 [ 2629.256827] [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3 [ 2629.256830] [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef [ 2629.256833] [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772 [ 2629.256837] [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1 [ 2629.256840] [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267 [ 2629.256844] [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40 [ 2629.256846] [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f [ 2629.256850] [<ffffffff8100f298>] ? arch_get_unmapped_area_topdown+0x1c3/0x28f [ 2629.256853] [<ffffffff814818cc>] do_page_fault+0x33b/0x35d [ 2629.256856] [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4 [ 2629.256859] [<ffffffff8112dd66>] ? path_put+0x22/0x27 [ 2629.256861] [<ffffffff8147f285>] page_fault+0x25/0x30 They throttle on writeback I/O completion like kswapd too: [ 2849.098751] [<ffffffff8147d00b>] io_schedule+0x47/0x62 [ 2849.098756] [<ffffffff8121c47b>] get_request_wait+0x10a/0x197 [ 2849.098760] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d [ 2849.098763] [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0 [ 2849.098767] [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb [ 2849.098771] [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336 [ 2849.098774] [<ffffffff8121bd39>] submit_bio+0xe0/0xff [ 2849.098777] [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4 [ 2849.098781] [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f [ 2849.098785] [<ffffffff811492ec>] submit_bh+0xe8/0x10e [ 2849.098788] [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da [ 2849.098793] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] [ 2849.098796] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d [ 2849.098799] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d [ 2849.098802] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] [ 2849.098805] [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98 [ 2849.098808] [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17 [ 2849.098811] [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf] [ 2849.098814] [<ffffffff810e44fd>] pageout+0x138/0x255 [ 2849.098817] [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478 [ 2849.098820] [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a [ 2849.098824] [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2 [ 2849.098828] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 [ 2849.098831] [<ffffffff810e6849>] shrink_zone+0x362/0x464 [ 2849.098834] [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae [ 2849.098837] [<ffffffff810e773f>] kswapd+0x51c/0x89f I'm unsure if there's any other problem left that can be attributed to compaction/migrate (especially considering the THP allocations have no __GFP_REPEAT set and should_continue_reclaim should break the loop if nr_reclaim is zero, plus compaction_suitable requires not much more memory to be reclaimed if compared to no-compaction). I'd suggest to try a few more times with "echo never > /sys/kernel/mm/transparent_hugepage/enabled" and see if it still makes a difference. I doubt udf is as optimal as other fs in optimizing locks and writebacks but that may not be relevant to this issue (or at least it most certainly wasn't until now and the same we had so far probably applied to all fs, from this point things aren't as clear as before if responsiveness is still worse than without THP). Anther thing that I find confusing is the udf vs USB stick, I'd expect udf on a usb dvd not a usbstick flash drive, I'd use vfat or ext4 on a flash drive. Reply-To: avillaci@fiec.espol.edu.ec El 21/03/11 15:16, Andrea Arcangeli escribió: > On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villacís Lasso wrote: >> El 21/03/11 11:37, Mel Gorman escribió: >>> On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: >>> >>> Nothing bad jumped out at me. Lets see how it gets on with testing. >>> >>> Thanks >>> >> As with the previous patch, this one did not completely solve the freezing >> tasks issue. However, as with the previous patch, the freezes took longer to >> appear, and now lasted less (10 to 12 seconds instead of freezing until the >> end of the usb copy). >> >> I have attached the new sysrq-w trace to the bug report. > migrate and compaction disappeared from the traces as we hoped > for. The THP allocations left throttles on writeback during reclaim > like any 4k allocation would do: > > [ 2629.256809] [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d > [ 2629.256812] [<ffffffff810e5992>] shrink_page_list+0x134/0x478 > [ 2629.256815] [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a > [ 2629.256818] [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21 > [ 2629.256820] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > [ 2629.256823] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > [ 2629.256827] [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3 > [ 2629.256830] [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef > [ 2629.256833] [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772 > [ 2629.256837] [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1 > [ 2629.256840] [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267 > [ 2629.256844] [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40 > [ 2629.256846] [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f > [ 2629.256850] [<ffffffff8100f298>] ? > arch_get_unmapped_area_topdown+0x1c3/0x28f > [ 2629.256853] [<ffffffff814818cc>] do_page_fault+0x33b/0x35d > [ 2629.256856] [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4 > [ 2629.256859] [<ffffffff8112dd66>] ? path_put+0x22/0x27 > [ 2629.256861] [<ffffffff8147f285>] page_fault+0x25/0x30 > > They throttle on writeback I/O completion like kswapd too: > > [ 2849.098751] [<ffffffff8147d00b>] io_schedule+0x47/0x62 > [ 2849.098756] [<ffffffff8121c47b>] get_request_wait+0x10a/0x197 > [ 2849.098760] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d > [ 2849.098763] [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0 > [ 2849.098767] [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb > [ 2849.098771] [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336 > [ 2849.098774] [<ffffffff8121bd39>] submit_bio+0xe0/0xff > [ 2849.098777] [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4 > [ 2849.098781] [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f > [ 2849.098785] [<ffffffff811492ec>] submit_bh+0xe8/0x10e > [ 2849.098788] [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da > [ 2849.098793] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > [ 2849.098796] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > [ 2849.098799] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > [ 2849.098802] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > [ 2849.098805] [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98 > [ 2849.098808] [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17 > [ 2849.098811] [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf] > [ 2849.098814] [<ffffffff810e44fd>] pageout+0x138/0x255 > [ 2849.098817] [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478 > [ 2849.098820] [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a > [ 2849.098824] [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2 > [ 2849.098828] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > [ 2849.098831] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > [ 2849.098834] [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae > [ 2849.098837] [<ffffffff810e773f>] kswapd+0x51c/0x89f > > I'm unsure if there's any other problem left that can be attributed to > compaction/migrate (especially considering the THP allocations have no > __GFP_REPEAT set and should_continue_reclaim should break the loop if > nr_reclaim is zero, plus compaction_suitable requires not much more > memory to be reclaimed if compared to no-compaction). > > I'd suggest to try a few more times with "echo never> > /sys/kernel/mm/transparent_hugepage/enabled" and see if it still makes > a difference. > > I doubt udf is as optimal as other fs in optimizing locks and > writebacks but that may not be relevant to this issue (or at least it > most certainly wasn't until now and the same we had so far probably > applied to all fs, from this point things aren't as clear as before if > responsiveness is still worse than without THP). Anther thing that I > find confusing is the udf vs USB stick, I'd expect udf on a usb dvd > not a usbstick flash drive, I'd use vfat or ext4 on a flash drive. > By echoing "never" to /sys/kernel/mm/transparent_hugepage/enabled , the copy still goes smoothly without any application freezes, even with the recent patch applied. Reply-To: mel@csn.ul.ie On Mon, Mar 21, 2011 at 09:16:41PM +0100, Andrea Arcangeli wrote: > On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villacís Lasso wrote: > > El 21/03/11 11:37, Mel Gorman escribió: > > > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: > > > > > > Nothing bad jumped out at me. Lets see how it gets on with testing. > > > > > > Thanks > > > > > As with the previous patch, this one did not completely solve the freezing > tasks issue. However, as with the previous patch, the freezes took longer to > appear, and now lasted less (10 to 12 seconds instead of freezing until the > end of the usb copy). > > > > I have attached the new sysrq-w trace to the bug report. > > migrate and compaction disappeared from the traces as we hoped > for. The THP allocations left throttles on writeback during reclaim > like any 4k allocation would do: > > [ 2629.256809] [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d > [ 2629.256812] [<ffffffff810e5992>] shrink_page_list+0x134/0x478 > [ 2629.256815] [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a > [ 2629.256818] [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21 > [ 2629.256820] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > [ 2629.256823] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > [ 2629.256827] [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3 > [ 2629.256830] [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef > [ 2629.256833] [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772 > [ 2629.256837] [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1 > [ 2629.256840] [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267 > [ 2629.256844] [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40 > [ 2629.256846] [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f > [ 2629.256850] [<ffffffff8100f298>] ? > arch_get_unmapped_area_topdown+0x1c3/0x28f > [ 2629.256853] [<ffffffff814818cc>] do_page_fault+0x33b/0x35d > [ 2629.256856] [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4 > [ 2629.256859] [<ffffffff8112dd66>] ? path_put+0x22/0x27 > [ 2629.256861] [<ffffffff8147f285>] page_fault+0x25/0x30 > There is an important difference between THP and generic order-0 reclaim though. Once defrag is enabled in THP, it can enter direct reclaim for reclaim/compaction where more pages may be claimed than for a base page fault thereby encountering more dirty pages and stalling. > They throttle on writeback I/O completion like kswapd too: > > [ 2849.098751] [<ffffffff8147d00b>] io_schedule+0x47/0x62 > [ 2849.098756] [<ffffffff8121c47b>] get_request_wait+0x10a/0x197 > [ 2849.098760] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d > [ 2849.098763] [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0 > [ 2849.098767] [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb > [ 2849.098771] [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336 > [ 2849.098774] [<ffffffff8121bd39>] submit_bio+0xe0/0xff > [ 2849.098777] [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4 > [ 2849.098781] [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f > [ 2849.098785] [<ffffffff811492ec>] submit_bh+0xe8/0x10e > [ 2849.098788] [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da > [ 2849.098793] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > [ 2849.098796] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > [ 2849.098799] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > [ 2849.098802] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > [ 2849.098805] [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98 > [ 2849.098808] [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17 > [ 2849.098811] [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf] > [ 2849.098814] [<ffffffff810e44fd>] pageout+0x138/0x255 > [ 2849.098817] [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478 > [ 2849.098820] [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a > [ 2849.098824] [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2 > [ 2849.098828] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > [ 2849.098831] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > [ 2849.098834] [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae > [ 2849.098837] [<ffffffff810e773f>] kswapd+0x51c/0x89f > > I'm unsure if there's any other problem left that can be attributed to > compaction/migrate (especially considering the THP allocations have no > __GFP_REPEAT set and should_continue_reclaim should break the loop if > nr_reclaim is zero, plus compaction_suitable requires not much more > memory to be reclaimed if compared to no-compaction). > I think we are breaking out because the report says the stalls aren't as bad but not before we have waited on writeback of a few dirty pages. This could be addressed in a number of ways but all of them impact THP in some way. 1. We could disable defrag by default. This will avoid the stalling at the cost of fewer pages being promoted even when plenty of clean pages were available. 2. We could redefine __GFP_NO_KSWAPD as __GFP_ASYNC to mean a) do not wake up kswapd that generates IO possibly causing syncs later b) does not queue any pages for IO itself and c) never waits on page writeback. This would also avoid stalls but it would disrupt LRU ordering by reclaiming younger pages than would otherwise have been reclaimed. 3. Again redefine __GFP_NO_KSWAPD but abort allocation if any dirty or writeback page is encountered during reclaim. This makes the assumption that dirty pages at the end of the LRU implies memory is under enough pressure to not care about promotion. This will also result in THP promoting fewer pages but has less impact on LRU ordering. Which would you prefer? Other suggestions? Reply-To: aarcange@redhat.com On Tue, Mar 22, 2011 at 11:20:32AM +0000, Mel Gorman wrote: > On Mon, Mar 21, 2011 at 09:16:41PM +0100, Andrea Arcangeli wrote: > > On Mon, Mar 21, 2011 at 12:05:40PM -0500, Alex Villacís Lasso wrote: > > > El 21/03/11 11:37, Mel Gorman escribió: > > > > On Mon, Mar 21, 2011 at 02:48:32PM +0100, Andrea Arcangeli wrote: > > > > > > > > Nothing bad jumped out at me. Lets see how it gets on with testing. > > > > > > > > Thanks > > > > > > > As with the previous patch, this one did not completely solve the > freezing tasks issue. However, as with the previous patch, the freezes took > longer to appear, and now lasted less (10 to 12 seconds instead of freezing > until the end of the usb copy). > > > > > > I have attached the new sysrq-w trace to the bug report. > > > > migrate and compaction disappeared from the traces as we hoped > > for. The THP allocations left throttles on writeback during reclaim > > like any 4k allocation would do: > > > > [ 2629.256809] [<ffffffff810e43c3>] wait_on_page_writeback+0x1b/0x1d > > [ 2629.256812] [<ffffffff810e5992>] shrink_page_list+0x134/0x478 > > [ 2629.256815] [<ffffffff810e614f>] shrink_inactive_list+0x29f/0x39a > > [ 2629.256818] [<ffffffff810dbd55>] ? zone_watermark_ok+0x1f/0x21 > > [ 2629.256820] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > > [ 2629.256823] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > > [ 2629.256827] [<ffffffff810e6c87>] do_try_to_free_pages+0xdd/0x2e3 > > [ 2629.256830] [<ffffffff810e70eb>] try_to_free_pages+0xaa/0xef > > [ 2629.256833] [<ffffffff810deede>] __alloc_pages_nodemask+0x4cc/0x772 > > [ 2629.256837] [<ffffffff8110c0ea>] alloc_pages_vma+0xec/0xf1 > > [ 2629.256840] [<ffffffff8111be94>] do_huge_pmd_anonymous_page+0xbf/0x267 > > [ 2629.256844] [<ffffffff810f24a3>] ? pmd_offset+0x19/0x40 > > [ 2629.256846] [<ffffffff810f5c7c>] handle_mm_fault+0x15d/0x20f > > [ 2629.256850] [<ffffffff8100f298>] ? > arch_get_unmapped_area_topdown+0x1c3/0x28f > > [ 2629.256853] [<ffffffff814818cc>] do_page_fault+0x33b/0x35d > > [ 2629.256856] [<ffffffff810fb089>] ? do_mmap_pgoff+0x29a/0x2f4 > > [ 2629.256859] [<ffffffff8112dd66>] ? path_put+0x22/0x27 > > [ 2629.256861] [<ffffffff8147f285>] page_fault+0x25/0x30 > > > > There is an important difference between THP and generic order-0 reclaim > though. Once defrag is enabled in THP, it can enter direct reclaim for > reclaim/compaction where more pages may be claimed than for a base page > fault thereby encountering more dirty pages and stalling. Ok but then 2M gets allocated. With 4k pages it's true there's less work done for each 4k page but reclaim runs several more times to allocate 512 4k pages instead of a single 2M page. So I'm unsure if that should create a noticeable difference. > > They throttle on writeback I/O completion like kswapd too: > > > > [ 2849.098751] [<ffffffff8147d00b>] io_schedule+0x47/0x62 > > [ 2849.098756] [<ffffffff8121c47b>] get_request_wait+0x10a/0x197 > > [ 2849.098760] [<ffffffff8106cd77>] ? autoremove_wake_function+0x0/0x3d > > [ 2849.098763] [<ffffffff8121cd3c>] __make_request+0x2c8/0x3e0 > > [ 2849.098767] [<ffffffff81114889>] ? kmem_cache_alloc+0x73/0xeb > > [ 2849.098771] [<ffffffff8121bbdf>] generic_make_request+0x2bc/0x336 > > [ 2849.098774] [<ffffffff8121bd39>] submit_bio+0xe0/0xff > > [ 2849.098777] [<ffffffff8114d7a5>] ? bio_alloc_bioset+0x4d/0xc4 > > [ 2849.098781] [<ffffffff810edf2b>] ? inc_zone_page_state+0x2d/0x2f > > [ 2849.098785] [<ffffffff811492ec>] submit_bh+0xe8/0x10e > > [ 2849.098788] [<ffffffff8114ba72>] __block_write_full_page+0x1ea/0x2da > > [ 2849.098793] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > > [ 2849.098796] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > > [ 2849.098799] [<ffffffff8114a6b8>] ? end_buffer_async_write+0x0/0x12d > > [ 2849.098802] [<ffffffffa06e5202>] ? udf_get_block+0x0/0x115 [udf] > > [ 2849.098805] [<ffffffff8114bbee>] block_write_full_page_endio+0x8c/0x98 > > [ 2849.098808] [<ffffffff8114bc0f>] block_write_full_page+0x15/0x17 > > [ 2849.098811] [<ffffffffa06e2027>] udf_writepage+0x18/0x1a [udf] > > [ 2849.098814] [<ffffffff810e44fd>] pageout+0x138/0x255 > > [ 2849.098817] [<ffffffff810e5ad7>] shrink_page_list+0x279/0x478 > > [ 2849.098820] [<ffffffff810e60ec>] shrink_inactive_list+0x23c/0x39a > > [ 2849.098824] [<ffffffff81481a46>] ? add_preempt_count+0xae/0xb2 > > [ 2849.098828] [<ffffffff810dfe81>] ? determine_dirtyable_memory+0x1d/0x27 > > [ 2849.098831] [<ffffffff810e6849>] shrink_zone+0x362/0x464 > > [ 2849.098834] [<ffffffff810dbdf8>] ? zone_watermark_ok_safe+0xa1/0xae > > [ 2849.098837] [<ffffffff810e773f>] kswapd+0x51c/0x89f > > > > I'm unsure if there's any other problem left that can be attributed to > > compaction/migrate (especially considering the THP allocations have no > > __GFP_REPEAT set and should_continue_reclaim should break the loop if > > nr_reclaim is zero, plus compaction_suitable requires not much more > > memory to be reclaimed if compared to no-compaction). > > > > I think we are breaking out because the report says the stalls aren't as > bad but not before we have waited on writeback of a few dirty pages. This > could be addressed in a number of ways but all of them impact THP in some > way. > > 1. We could disable defrag by default. This will avoid the stalling at > the cost of fewer pages being promoted even when plenty of clean pages > were available. > > 2. We could redefine __GFP_NO_KSWAPD as __GFP_ASYNC to mean a) do not > wake up kswapd that generates IO possibly causing syncs later b) does > not queue any pages for IO itself and c) never waits on page writeback. > This would also avoid stalls but it would disrupt LRU ordering by > reclaiming younger pages than would otherwise have been reclaimed. > > 3. Again redefine __GFP_NO_KSWAPD but abort allocation if any dirty or > writeback page is encountered during reclaim. This makes the assumption > that dirty pages at the end of the LRU implies memory is under enough > pressure to not care about promotion. This will also result in THP > promoting fewer pages but has less impact on LRU ordering. > > Which would you prefer? Other suggestions? I'm not particularly excited by any of the above options first because it'd decreases the reliability of allocations. And more important because it won't solve anything for the no-THP related allocations that would still create the same problem considering that compaction is the issue as he verified that setting defrag=no solves the problem (think SLUB that even for a radix tree allocation uses order 2 first, just like THP tries order 9 before order 0, slub or the kernel stack allocation aren't using anything like __GFP_ASYNC). So I'm afraid we're more likely to hide the issue by tackling it entirely on the THP side. I asked yesterday by PM to Alex if the mouse pointer moves or not during the stalls (if it doesn't that may be a scheduler issue with the compaction irq disabled and lack of cond_resched) and to try aa.git. Upstream still misses several compaction improvements that we did over the last weeks and that I've in my queue and that are in -mm as well. So before making more changes, considering the stack traces looks very healthy now, I'd wait to be sure the hangs aren't already solved by any of the other scheduling/irq latency fixes. I guess they aren't going to help but it worth a try. Verifying if this happens with a more optimal filesystem like ext4 I think is also interesting, it may be something in udf internal locking that gets in the way of compaction. If we still have a problem with current aa.git and ext4, then I'd hope we can find some other more genuine bit to improve like the bits we've improved so far, but if there's nothing wrong and it gets unfixable, then my preference would be to either create a defrag mode that is in between "yes/no", or alternatively to be simpler and make the default between defrag yes|no configurable at build time and through a command line in grub, and hope that SLUB doesn't clashes on it too. The current "default" is optimal for several server environments where we know most of the allocations are long lived. So we want to still have an option to be as reliable as we are toady for those. Reply-To: avillaci@fiec.espol.edu.ec El 22/03/11 10:03, Andrea Arcangeli escribió: > > I asked yesterday by PM to Alex if the mouse pointer moves or not > during the stalls (if it doesn't that may be a scheduler issue with > the compaction irq disabled and lack of cond_resched) and to try > aa.git. Upstream still misses several compaction improvements that we > did over the last weeks and that I've in my queue and that are in -mm > as well. So before making more changes, considering the stack traces > looks very healthy now, I'd wait to be sure the hangs aren't already > solved by any of the other scheduling/irq latency fixes. I guess they > aren't going to help but it worth a try. Verifying if this happens > with a more optimal filesystem like ext4 I think is also interesting, > it may be something in udf internal locking that gets in the way of > compaction. > > If we still have a problem with current aa.git and ext4, then I'd hope > we can find some other more genuine bit to improve like the bits we've > improved so far, but if there's nothing wrong and it gets unfixable, > then my preference would be to either create a defrag mode that is in > between "yes/no", or alternatively to be simpler and make the default > between defrag yes|no configurable at build time and through a command > line in grub, and hope that SLUB doesn't clashes on it too. The > current "default" is optimal for several server environments where we > know most of the allocations are long lived. So we want to still have > an option to be as reliable as we are toady for those. > I have just tested aa.git as of today, with the USB stick formatted as FAT32. I could no longer reproduce the stalls. There was no need to format as ext4. No /proc workarounds required. Reply-To: aarcange@redhat.com On Tue, Mar 22, 2011 at 03:34:10PM -0500, Alex Villacís Lasso wrote: > I have just tested aa.git as of today, with the USB stick formatted > as FAT32. I could no longer reproduce the stalls. There was no need > to format as ext4. No /proc workarounds required. Sounds good. So Andrew the patches to apply to solve this most certainly are: http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-pages-for-migration.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-pages-for-migration-fix.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-minimise-the-time-irqs-are-disabled-while-isolating-free-pages.patch http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=6daf7ff3adc1a243aa9f5a77c7bde2b713a3188a (not in -mm, posted to linux-mm Message-ID 20110321134832.GC5719) Very likely it's the combination of all the above that is equally important and needed for this specific compaction issue. ==== rest of aa.git not relevant for this bugreport below ==== http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-compaction-prevent-kswapd-compacting-memory-to-reduce-cpu-usage.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-vmscan-kswapd-should-not-free-an-excessive-number-of-pages-when-balancing-small-zones.patch http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=cb107ebbb7541e5442fd897436440e71835b6496 (not in -mm posted to linux-mm Message-ID: alpine.LSU.2.00.1103192318100.1877) http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-__gfp_other_node-flag.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-__gfp_other_node-flag-checkpatch-fixes.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-use-__gfp_other_node-for-transparent-huge-pages.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-use-__gfp_other_node-for-transparent-huge-pages-checkpatch-fixes.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-vm-counters-for-transparent-hugepages.patch http://userweb.kernel.org/~akpm/mmotm/broken-out/mm-add-vm-counters-for-transparent-hugepages-checkpatch-fixes.patch smaps* (5 patches in -mm) This is one experimental new feature that should improve mremap significantly regardless of THP on or off (but bigger boost with THP on). I'm using it for weeks without problem, I'd suggest it for inclusion too. The version in the below link is the most uptodate. It fixes a build trouble (s/__split_huge_page_pmd/split_huge_page_pmd/ in move_page_tables) with CONFIG_TRANSPARENT_HUGEPAGE=n compared to the last version I posted to linux-mm. But this is low priority. http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=0e6f8bd8802c3309195d3e1a7af50093ed488f2d Reply-To: aarcange@redhat.com Hi Alex, could you also try to reverse this below bit (not the whole previous patch: only the bit below quoted below) with "patch -p1 -R < thismail" on top of your current aa.git tree, and see if you notice any regression compared to the previous aa.git build that worked well? This is part of the fix, but I'd need to be sure this really makes a difference before sticking to it for long. I'm not concerned by keeping it, but it adds dirt, and the closer THP allocations are to any other high order allocation the better. So the less __GFP_NO_KSWAPD affects the better. The hint about not telling kswapd to insist in the background for order 9 allocations with fallback (like THP) is the maximum I consider clean because there's khugepaged with its alloc_sleep_millisecs that replaces the kswapd task for THP allocations. So that is clean enough, but when __GFP_NO_KSWAPD starts to make compaction behave slightly different from a SLUB order 2 allocation I don't like it (especially because if you later enable SLUB or some driver you may run into the same compaction issue again if the below change is making a difference). If things works fine even after you reverse the below, we can safely undo this change and also feel safer for all other high order allocations, so it'll make life easier. (plus we don't want unnecessary special changes, we need to be sure this makes a difference to keep it for long) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2085,7 +2085,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Created attachment 51782 [details]
sysrq-w trace for aa.git kernel
Bad news. Apparently the stalling *is* filesystem-sensitive. I reformatted the USB-stick to use UDF instead of FAT32, and I managed to trigger the application freezes again, on the aa.git kernel. Trace attached.
BTW, I was using UDF on the stick because I needed a filesystem with no user permissions to hold a file of about 5 GB without having to split it, and UDF seemed the most adequate for the task.
Created attachment 51792 [details]
second sysrq-w trace for aa.git kernel
I have just triggered the freeze for the stick with vfat on the aa.git kernel, although I had to wait a bit longer. Trace attached.
Reply-To: avillaci@fiec.espol.edu.ec El 22/03/11 19:37, Andrea Arcangeli escribió: > Hi Alex, > > could you also try to reverse this below bit (not the whole previous > patch: only the bit below quoted below) with "patch -p1 -R< thismail" > on top of your current aa.git tree, and see if you notice any > regression compared to the previous aa.git build that worked well? > > This is part of the fix, but I'd need to be sure this really makes a > difference before sticking to it for long. I'm not concerned by > keeping it, but it adds dirt, and the closer THP allocations are to > any other high order allocation the better. So the less > __GFP_NO_KSWAPD affects the better. The hint about not telling kswapd > to insist in the background for order 9 allocations with fallback > (like THP) is the maximum I consider clean because there's khugepaged > with its alloc_sleep_millisecs that replaces the kswapd task for THP > allocations. So that is clean enough, but when __GFP_NO_KSWAPD starts > to make compaction behave slightly different from a SLUB order 2 > allocation I don't like it (especially because if you later enable > SLUB or some driver you may run into the same compaction issue again > if the below change is making a difference). > > If things works fine even after you reverse the below, we can safely > undo this change and also feel safer for all other high order > allocations, so it'll make life easier. (plus we don't want > unnecessary special changes, we need to be sure this makes a > difference to keep it for long) > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2085,7 +2085,7 @@ rebalance: > sync_migration); > if (page) > goto got_pg; > - sync_migration = true; > + sync_migration = !(gfp_mask& __GFP_NO_KSWAPD); > > /* Try direct reclaim and then allocating */ > page = __alloc_pages_direct_reclaim(gfp_mask, order, > > On Tue, Mar 22, 2011 at 03:34:10PM -0500, Alex Villacís Lasso wrote: >> > I have just tested aa.git as of today, with the USB stick formatted >> > as FAT32. I could no longer reproduce the stall > Probably udf is not optimized enough but I wonder if maybe the > udf->vfat change helped more than the other patches. We need the other > patches anyway to provide responsive behavior including the one you > tested before aa.git so it's not very important if udf was the > problem, but it might have been. > I tried to reformat the stick as UDF to check whether the stall was filesystem-sensitive. Apparently it is. I managed to induce the freeze on firefox while performing the same copy on the aa.git kernel. Then I reformatted the stick as FAT32 and repeated the test, and it also induced freezes, although they were a bit shorter and occurred late in the copy progress. I have attached the traces in the bug report. All of this is with the kernel before reversing the quoted patch. *** Bug 28432 has been marked as a duplicate of this bug. *** A patch referencing this bug report has been merged in v2.6.38-8569-g16c29da: commit 11bc82d67d1150767901bca54a24466621d763d7 Author: Andrea Arcangeli <aarcange@redhat.com> Date: Tue Mar 22 16:33:11 2011 -0700 mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback The original issue seems to have been greatly reduced, if not completely eliminated, in 2.6.39-rc1. I did the same old test, the 4GB copy into the USB stick, VFAT formatted. Thanks for the update. Should this bug be closed then? If not, can you contact Andrea and let him now, this is still a problem? Thanks, Flo Reply-To: avillaci@fiec.espol.edu.ec Latest update: with 2.6.39-rc1 the stalls only last for a few seconds, but they are still there. So, as I said before, they are reduced but not eliminated. Created attachment 53442 [details]
sysrq-w trace for 2.6.39-rc1 kernel
I got a longer stall that froze firefox. Trace attached.
Reply-To: aarcange@redhat.com Hi Alex, On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villacís Lasso wrote: > Latest update: with 2.6.39-rc1 the stalls only last for a few > seconds, but they are still there. So, as I said before, they are > reduced but not eliminated. Ok from a complete stall for the whole duration of the I/O to a few second stall we're clearly going into the right direction. The few second stalls happen with udf? And vfat doesn't stall? Minchan rightly pointed out that the (panik) change we made in page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also talked with Mel about it. We think it's nicer if we can keep THP allocations as close as any other high order allocation as possible. There are already plenty of __GFP bits with complex semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it means the allocation relies on a different kernel daemon for the background work (which is khugepaged instead of kswapd in the THP case, where khugepaged uses a non intrusive alloc_sleep_millisec throttling in case of MM congestion, unlike kswapd would do). So this is no solution to your problem (if vfat already works I think that might be a better solution), but we'd like to know if you get any _worse_ stall compared to current 2.6.39-rc, by applying the below patch on top of 2.6.39-rc. If this doesn't make any difference, we can safely apply it to remove unnecessary complications. Thanks, Andrea === Subject: compaction: reverse the change that forbid sync migraton with __GFP_NO_KSWAPD From: Andrea Arcangeli <aarcange@redhat.com> It's uncertain this has been beneficial, so it's safer to undo it. All other compaction users would still go in synchronous mode if a first attempt of async compaction failed. Hopefully we don't need to force special behavior for THP (which is the only __GFP_NO_KSWAPD user so far and it's the easier to exercise and to be noticeable). This also make __GFP_NO_KSWAPD return to its original strict semantics specific to bypass kswapd, as THP allocations have khugepaged for the async THP allocations/compactions. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2105,7 +2105,7 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); + sync_migration = true; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, Reply-To: avillaci@fiec.espol.edu.ec El 08/04/11 14:09, Andrea Arcangeli escribió: > Hi Alex, > > On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villacís Lasso wrote: >> Latest update: with 2.6.39-rc1 the stalls only last for a few >> seconds, but they are still there. So, as I said before, they are >> reduced but not eliminated. > Ok from a complete stall for the whole duration of the I/O to a few > second stall we're clearly going into the right direction. > > The few second stalls happen with udf? And vfat doesn't stall? > > Minchan rightly pointed out that the (panik) change we made in > page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also > talked with Mel about it. We think it's nicer if we can keep THP > allocations as close as any other high order allocation as > possible. There are already plenty of __GFP bits with complex > semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it > means the allocation relies on a different kernel daemon for the > background work (which is khugepaged instead of kswapd in the THP > case, where khugepaged uses a non intrusive alloc_sleep_millisec > throttling in case of MM congestion, unlike kswapd would do). > > So this is no solution to your problem (if vfat already works I think > that might be a better solution), but we'd like to know if you get any > _worse_ stall compared to current 2.6.39-rc, by applying the below > patch on top of 2.6.39-rc. If this doesn't make any difference, we can > safely apply it to remove unnecessary complications. > > Thanks, > Andrea > > === > Subject: compaction: reverse the change that forbid sync migraton with > __GFP_NO_KSWAPD > > From: Andrea Arcangeli<aarcange@redhat.com> > > It's uncertain this has been beneficial, so it's safer to undo it. All other > compaction users would still go in synchronous mode if a first attempt of > async > compaction failed. Hopefully we don't need to force special behavior for THP > (which is the only __GFP_NO_KSWAPD user so far and it's the easier to > exercise > and to be noticeable). This also make __GFP_NO_KSWAPD return to its original > strict semantics specific to bypass kswapd, as THP allocations have > khugepaged > for the async THP allocations/compactions. > > Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2105,7 +2105,7 @@ rebalance: > sync_migration); > if (page) > goto got_pg; > - sync_migration = !(gfp_mask& __GFP_NO_KSWAPD); > + sync_migration = true; > > /* Try direct reclaim and then allocating */ > page = __alloc_pages_direct_reclaim(gfp_mask, order, > The stalls occur even with vfat. I am no longer using udf, since (right now) it is not necessary. I will test this patch now. Reply-To: avillaci@fiec.espol.edu.ec El 08/04/11 15:06, Alex Villacís Lasso escribió: > El 08/04/11 14:09, Andrea Arcangeli escribió: >> Hi Alex, >> >> On Mon, Apr 04, 2011 at 10:37:44AM -0500, Alex Villacís Lasso wrote: >>> Latest update: with 2.6.39-rc1 the stalls only last for a few >>> seconds, but they are still there. So, as I said before, they are >>> reduced but not eliminated. >> Ok from a complete stall for the whole duration of the I/O to a few >> second stall we're clearly going into the right direction. >> >> The few second stalls happen with udf? And vfat doesn't stall? >> >> Minchan rightly pointed out that the (panik) change we made in >> page_alloc.c changes the semantics of the __GFP_NO_KSWAPD bit. I also >> talked with Mel about it. We think it's nicer if we can keep THP >> allocations as close as any other high order allocation as >> possible. There are already plenty of __GFP bits with complex >> semantics. __GFP_NO_KSWAPD is simple and it's nice to stay simple: it >> means the allocation relies on a different kernel daemon for the >> background work (which is khugepaged instead of kswapd in the THP >> case, where khugepaged uses a non intrusive alloc_sleep_millisec >> throttling in case of MM congestion, unlike kswapd would do). >> >> So this is no solution to your problem (if vfat already works I think >> that might be a better solution), but we'd like to know if you get any >> _worse_ stall compared to current 2.6.39-rc, by applying the below >> patch on top of 2.6.39-rc. If this doesn't make any difference, we can >> safely apply it to remove unnecessary complications. >> >> Thanks, >> Andrea >> >> === >> Subject: compaction: reverse the change that forbid sync migraton with >> __GFP_NO_KSWAPD >> >> From: Andrea Arcangeli<aarcange@redhat.com> >> >> It's uncertain this has been beneficial, so it's safer to undo it. All other >> compaction users would still go in synchronous mode if a first attempt of >> async >> compaction failed. Hopefully we don't need to force special behavior for THP >> (which is the only __GFP_NO_KSWAPD user so far and it's the easier to >> exercise >> and to be noticeable). This also make __GFP_NO_KSWAPD return to its original >> strict semantics specific to bypass kswapd, as THP allocations have >> khugepaged >> for the async THP allocations/compactions. >> >> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> >> --- >> mm/page_alloc.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -2105,7 +2105,7 @@ rebalance: >> sync_migration); >> if (page) >> goto got_pg; >> - sync_migration = !(gfp_mask& __GFP_NO_KSWAPD); >> + sync_migration = true; >> >> /* Try direct reclaim and then allocating */ >> page = __alloc_pages_direct_reclaim(gfp_mask, order, >> > The stalls occur even with vfat. I am no longer using udf, since (right now) > it is not necessary. I will test this patch now. > From preliminary tests, I feel that the patch actually eliminates the stalls. I have just copied nearly 6 GB of data into my USB stick and noticed no application freezes. Reply-To: avillaci@fiec.espol.edu.ec El 12/04/11 11:27, Alex Villacís Lasso escribió: >>> === >>> Subject: compaction: reverse the change that forbid sync migraton with >>> __GFP_NO_KSWAPD >>> >>> From: Andrea Arcangeli<aarcange@redhat.com> >>> >>> It's uncertain this has been beneficial, so it's safer to undo it. All >>> other >>> compaction users would still go in synchronous mode if a first attempt of >>> async >>> compaction failed. Hopefully we don't need to force special behavior for >>> THP >>> (which is the only __GFP_NO_KSWAPD user so far and it's the easier to >>> exercise >>> and to be noticeable). This also make __GFP_NO_KSWAPD return to its >>> original >>> strict semantics specific to bypass kswapd, as THP allocations have >>> khugepaged >>> for the async THP allocations/compactions. >>> >>> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com> >>> --- >>> mm/page_alloc.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -2105,7 +2105,7 @@ rebalance: >>> sync_migration); >>> if (page) >>> goto got_pg; >>> - sync_migration = !(gfp_mask& __GFP_NO_KSWAPD); >>> + sync_migration = true; >>> >>> /* Try direct reclaim and then allocating */ >>> page = __alloc_pages_direct_reclaim(gfp_mask, order, >>> >> The stalls occur even with vfat. I am no longer using udf, since (right now) >> it is not necessary. I will test this patch now. >> > > From preliminary tests, I feel that the patch actually eliminates the stalls. > I have just copied nearly 6 GB of data into my USB stick and noticed no > application freezes. > I retract that. I have tested 2.6.39-rc3 after a day of having several heavy applications loaded in memory, and the stalls do get worse when reversing the patch. Reply-To: aarcange@redhat.com Hello Alex, On Thu, Apr 14, 2011 at 12:25:48PM -0500, Alex Villacís Lasso wrote: > I retract that. I have tested 2.6.39-rc3 after a day of having > several heavy applications loaded in memory, and the stalls do get > worse when reversing the patch. Well I was already afraid your stalling wasn't 100% reproducible. Depends on background load like you said. I think if it never happens when you didn't start the heavy applications yet is good enough result for now. When we throttle on I/O and there are heavy apps things may stall even without usb drive because the more memory pressure the more every little write() or memory allocation, may stall too regardless of compaction, we've no perfection in that area yet (the best way to tackle this are the per-process write throttling levels). Bottom line is that the additional congestion created by the heavy app is by far not easy to quantify or to assume as a reproducible, so it is very possible it was the same as before and in general if we want to apply that change it's cleaner to do it unconditionally for all allocation orders and not in function of __GFP_NO_KSWAPD. So I think the patch is still good to go. I'm also affected by this since a very long time; after no one could give me a solution on various forums and kernel.org mailing list, I decided to drop here. This issue is more reproducible with slow external storage devices and even various network file systems. Tweaking vm.* does not help much. As far as I remember, this issue started at 2.6.38. Also, when large amount of ram is available (corresponding to the processes running), this issue is not reproducible. I'm also affected by this since a very long time; after no one could give me a solution on various forums and kernel.org mailing list, I decided to drop here. This issue is more reproducible with slow external storage devices and even various network file systems. Tweaking vm.* does not help much. As far as I remember, this issue started at 2.6.38. Also, when large amount of ram is available (corresponding to the processes running), this issue is not reproducible. I confirm this issue on both Debian running the official generic kernel, and Gentoo running a custom kernel. dE: Please open a bug of your own against a current kernel, the cause of any stalling may well be very different Closing the original bug as codefix |
Created attachment 50902 [details] kernel backtraces from hung Eclipse task while writing to usb stick System is Fedora 14 x86_64 with 4 GB RAM, running vanilla kernel 2.6.38-rc8. I have a USB 2.0 high-speed memory stick with around 7.5 GB of space. Whenenver I write a large amount of data (several GBs of files) through any means (cp, nautilus GUI, etc), I notice some large applications that I consider unrelated to the I/O operation (Firefox web browser, Thunderbird email viewer, Eclipse IDE) may randomly freeze whenever I try to interact with them. I use Compiz, and I notice the apps getting grayed out, but I have also seen the freeze happening with Metacity and Gnome-shell, so I believe the window manager is irrelevant. Sometimes other smaller tasks (gnome-terminal, gedit) also freeze. For Eclipse, the hang also cause a series of kernel backtraces, attached to this report. The hang usually lasts for several tens of seconds, and may freeze and unfreeze several times while the file copying to USB takes place. All of the hung applications unfreeze themselves after write activity (as seen from the LED in the memory stick) ceases. Reproducibility: always (with sufficiently large bulk write) To reproduce: 1) have an usb stick with several GB of free space, with any filesystem (tried vfat and udf) 2) prepare several gb of files to copy from hard disk to usb stick 3) start large application (firefox, eclipse, or thunderbird) 4) check that application is responsive before file copy starts 5) insert usb stick and (auto)mount it. Previously started app is still responsive. 6) start file copy to usb stick with any command 7) attempt to interact with chosen application during the entirety of the file write Expected result: I/O to usb stick takes place in background, unrelated apps continue to be responsive in foreground. Actual result: some large tasks freeze for tens of seconds while write takes place. Feel free to reassign this bug to a different category. It involves I/O, block, USB, and mmap.