Bug 33622 - during hibernation, preallocating image memory can take several minutes
during hibernation, preallocating image memory can take several minutes
Status: CLOSED INSUFFICIENT_DATA
Product: Power Management
Classification: Unclassified
Component: Hibernation/Suspend
All Linux
: P1 low
Assigned To: Aaron Lu
:
Depends on:
Blocks: 7216
  Show dependency treegraph
 
Reported: 2011-04-18 19:27 UTC by Ferenc Wágner
Modified: 2013-02-20 06:28 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.2
Tree: Mainline
Regression: No


Attachments
screenshot montage of the lockup messages (39 bytes, text/plain)
2011-07-01 17:34 UTC, Ferenc Wágner
Details
SYSRQ all buttons (208.71 KB, text/plain)
2011-07-11 17:57 UTC, chr()
Details
SYSRQ console dump (159.40 KB, text/plain)
2011-07-11 21:11 UTC, chr()
Details
PM / Freezer: Freeze filesystems after user space has been frozen (3.61 KB, patch)
2011-07-31 11:28 UTC, Rafael J. Wysocki
Details | Diff
fs: Fix fsync() regression on block devices (579 bytes, patch)
2011-08-03 20:54 UTC, Rafael J. Wysocki
Details | Diff
A different trace on 3.1.0-rc1+ with the freezing patch applied (863.31 KB, image/jpeg)
2011-08-18 08:40 UTC, Ferenc Wágner
Details
PM / Hibernate: Use GFP_NOFS for memory preallocation (550 bytes, patch)
2011-08-27 21:11 UTC, Rafael J. Wysocki
Details | Diff
blocked task warning on 3.1.0-rc3+ with the GFP_NOFS patch applied (833.96 KB, image/jpeg)
2011-08-30 10:40 UTC, Ferenc Wágner
Details
Preallocation statistics on 3.1.0-rc3+ with the GFP_NOFS patch applied (749.87 KB, image/jpeg)
2011-08-30 10:48 UTC, Ferenc Wágner
Details
another blocked task warning during preallocation (GFP_NOFS) (859.36 KB, image/jpeg)
2011-08-31 12:48 UTC, Ferenc Wágner
Details
some other blocked task warnings later from the same preallocation phase (769.43 KB, image/jpeg)
2011-08-31 12:49 UTC, Ferenc Wágner
Details
strange pauses after the finally succeeding preallocation (953.08 KB, image/jpeg)
2011-08-31 12:57 UTC, Ferenc Wágner
Details
Unfortunately, resume halted indefinitely with this on screen (990.04 KB, image/jpeg)
2011-08-31 12:59 UTC, Ferenc Wágner
Details
SysRq-w showed lots of blocked tasks, but there were no automatic warnings (817.28 KB, image/jpeg)
2011-08-31 13:01 UTC, Ferenc Wágner
Details
screenshot of new hung task warnings (121.85 KB, image/jpeg)
2012-01-27 16:41 UTC, Ferenc Wágner
Details

Description Ferenc Wágner 2011-04-18 19:27:36 UTC
When there isn't much free memory when hibernation is initiated, the
PM: Preallocating image memory...
phase can occasionally take more than 10 minutes, which seems excessive on this laptop with 512 MB of memory. I understand that in this case quite some stuff has to be swapped out before the snapshot can be made, but most of this time is spent with no HDD activity at all: after the first couple of seconds the HDD indicator gives only the quickest flash in every 2 or 3 seconds.
The worst is that I get no indication whatsoever about the process: at first I thought the system just froze and powered it off to safely get off the train. I've had much preferred to be told that hibernation is infeasible and suspended to RAM instead. But a time estimate could have helped as well, of course.
/sys/power/image_size contains 204791808 all the time, I don't use uswsusp or similar, pm-utils defaults to the kernel method.
Comment 1 Rafael J. Wysocki 2011-04-18 19:47:10 UTC
Has the problem started to occur at one point or has it been always present
(as long as memory preallocation is used by the core hibernate code)?
Comment 2 Ferenc Wágner 2011-04-18 23:06:39 UTC
Hard to tell, because earlier I used uswsusp, up to 2.6.36. I can't remember such delays with that. But then I was forced to upgrade my user-space due to the i915 KMS migration, I encountered several PM problems which made me abandon uswsusp for simplicity. Anyway I can't easily test older kernels now. When was memory preallocation introduced, btw? After all, I could write a simple console application which allocates memory, one doesn't need a full featured graphical browser just for this...
Comment 3 Ferenc Wágner 2011-07-01 17:34:29 UTC
Created attachment 64392 [details]
screenshot montage of the lockup messages

Now, with kernel 3.0.0-rc5, hibernation seems to hang all the time. In every 2 minutes a pair of "hung task" messages appear as shown on this screenshot montage. SysRq is active at this time, but the blocked task list is far too long to photograph. The kernel seems to be waiting for some XFS stuff here...
Comment 4 Rafael J. Wysocki 2011-07-01 19:24:18 UTC
Since the problem appears to be readily reproducible, would you be able to
check if XFS is necessary to trigger it?
Comment 5 Ferenc Wágner 2011-07-04 15:40:29 UTC
Not really, sorry. I couldn't trigger it from the initramfs: the system hibernated all right with a full buffer cache (even quickly) or with a full tmpfs (after pushing most of it out to swap with continuous disk activity as it should). So the problem appears reliably after "real" use, but I couldn't yet synthesize it. To be precise, something very similar happened after I hibernated with Emacs having a large file opened: the system failed to resume with very similar looking traces. I will continue trying to get a reliable and safe reproduction.
Comment 6 chr() 2011-07-11 17:57:27 UTC
Created attachment 65302 [details]
SYSRQ all buttons

Hi!

While hunting bug #15816 I ran a lot of tests in some virtual machines.
Actually (2.6.39.3) pm-hibernate works fine except that ...

"preallocating image memory can take several minutes" (f...ing long)

The last vm I used for testing was with ext4.
My real machine always locks up (uses xfs).

This log is from testing with 2.6.39.2, xfs-root:

PM: Preallocating image memory ...
[machine does some decent I/O for some seconds but then deep mediation]

SysRq : Show Blocked State

Every task is in refrigerator except this:

pm-hibernate    D 0000000000000000     0  3638   3637 0x00000000
 ffff8800017bf918 0000000000000082 ffff8800017be010 ffff880000000000
 ffff8800017be010 ffff88000b8a6170 0000000000013900 ffff8800017bffd8
 ffff8800017bffd8 0000000000013900 ffffffff8148b020 ffff88000b8a6170
Call Trace:
 [<ffffffff81344ce2>] schedule_timeout+0x22/0xbb
 [<ffffffff81344b64>] wait_for_common+0xcb/0x148
 [<ffffffff810408ea>] ? try_to_wake_up+0x18c/0x18c
 [<ffffffff81345527>] ? down_write+0x2d/0x31
 [<ffffffff81344c7b>] wait_for_completion+0x18/0x1a
 [<ffffffffa02374da>] xfs_reclaim_inode+0x74/0x258 [xfs]
 [<ffffffffa0237853>] xfs_reclaim_inodes_ag+0x195/0x264 [xfs]
 [<ffffffffa0237974>] xfs_reclaim_inode_shrink+0x52/0x90 [xfs]
 [<ffffffff810c4e21>] shrink_slab+0xdb/0x151
 [<ffffffff810c625a>] do_try_to_free_pages+0x204/0x39a
 [<ffffffff8134ce4e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff810c647f>] shrink_all_memory+0x8f/0xa8
 [<ffffffff810cc41a>] ? next_online_pgdat+0x20/0x41
 [<ffffffff8107937d>] hibernate_preallocate_memory+0x1c4/0x30f
 [<ffffffff811a8fa2>] ? kobject_put+0x47/0x4b
 [<ffffffff81077eb2>] hibernation_snapshot+0x45/0x281
 [<ffffffff810781bf>] hibernate+0xd1/0x1b8
 [<ffffffff81076c58>] state_store+0x57/0xce
 [<ffffffff811a8d0b>] kobj_attr_store+0x17/0x19
 [<ffffffff81152bda>] sysfs_write_file+0xfc/0x138
 [<ffffffff810fca74>] vfs_write+0xa9/0x105
 [<ffffffff810fcb89>] sys_write+0x45/0x6c
 [<ffffffff8134c492>] system_call_fastpath+0x16/0x1b

kworker/0:2     D 0000000000000000     0   379      2 0x00000000
 ffff88000ea99b90 0000000000000046 ffff88000b8ff840 0000000000000000
 ffff88000ea98010 ffff88000c319ac0 0000000000013900 ffff88000ea99fd8
 ffff88000ea99fd8 0000000000013900 ffffffff8148b020 ffff88000c319ac0
Call Trace:
 [<ffffffff8134515f>] __mutex_lock_common+0x12e/0x195
 [<ffffffff813451da>] __mutex_lock_slowpath+0x14/0x16
 [<ffffffff813452a8>] mutex_lock+0x1e/0x38
 [<ffffffffa0237738>] xfs_reclaim_inodes_ag+0x7a/0x264 [xfs]
 [<ffffffffa02379cc>] ? xfs_reclaim_inodes+0x1a/0x1a [xfs]
 [<ffffffffa02379ca>] xfs_reclaim_inodes+0x18/0x1a [xfs]
 [<ffffffffa02379e9>] xfs_reclaim_worker+0x1d/0x29 [xfs]
 [<ffffffff81059ec4>] process_one_work+0x1de/0x2ec
 [<ffffffff8105bf60>] worker_thread+0x13d/0x262
 [<ffffffff8105be23>] ? manage_workers+0x194/0x194
 [<ffffffff8105f25a>] kthread+0x7d/0x85
 [<ffffffff8134d5a4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105f1dd>] ? kthread_worker_fn+0x148/0x148
 [<ffffffff8134d5a0>] ? gs_change+0x13/0x13

Hope this helps hunting.
Comment 7 chr() 2011-07-11 21:11:42 UTC
Created attachment 65322 [details]
SYSRQ console dump

I ran an automated pm-hibernate mem stress test in virtualbox with 2.6.39.3, 1.5GB RAM
ext4 30x ok
Using xfs on root locked up in Round 13. I attach the four-button-log.

Talking about speed:

ext4 rootfs:
1  [  115.953022] PM: Allocated 946788 kbytes in 7.33 seconds (129.16 MB/s)
2  [  162.389327] PM: Allocated 946788 kbytes in 18.48 seconds (51.23 MB/s)
3  [  187.594767] PM: Allocated 946764 kbytes in 0.38 seconds (2491.48 MB/s)
4  [  217.393836] PM: Allocated 946772 kbytes in 4.66 seconds (203.16 MB/s)
5  [  333.061986] PM: Allocated 946788 kbytes in 89.46 seconds (10.58 MB/s)
6  [  415.016411] PM: Allocated 946788 kbytes in 47.93 seconds (19.75 MB/s)
7  [  464.904229] PM: Allocated 946780 kbytes in 10.58 seconds (89.48 MB/s)
8  [  495.034637] PM: Allocated 946780 kbytes in 0.23 seconds (4116.43 MB/s)
9  [  529.224049] PM: Allocated 946788 kbytes in 5.31 seconds (178.30 MB/s)
10  [  637.020696] PM: Allocated 946788 kbytes in 73.05 seconds (12.96 MB/s)
11  [  670.111831] PM: Allocated 946772 kbytes in 4.51 seconds (209.92 MB/s)
12  [  709.350145] PM: Allocated 946780 kbytes in 6.99 seconds (135.44 MB/s)
13  [  769.712771] PM: Allocated 946780 kbytes in 28.04 seconds (33.76 MB/s)
14  [  808.437632] PM: Allocated 946788 kbytes in 0.27 seconds (3506.62 MB/s)
15  [  856.310100] PM: Allocated 946788 kbytes in 13.42 seconds (70.55 MB/s)
16  [  891.479187] PM: Allocated 946772 kbytes in 2.65 seconds (357.27 MB/s)
17  [  976.008879] PM: Allocated 946788 kbytes in 50.85 seconds (18.61 MB/s)
18  [ 1004.876786] PM: Allocated 946788 kbytes in 0.44 seconds (2151.79 MB/s)
19  [ 1086.111001] PM: Allocated 946788 kbytes in 52.01 seconds (18.20 MB/s)
20  [ 1126.903633] PM: Allocated 946788 kbytes in 5.91 seconds (160.20 MB/s)
21  [ 1166.799932] PM: Allocated 946780 kbytes in 9.63 seconds (98.31 MB/s)
22  [ 1194.316118] PM: Allocated 946772 kbytes in 0.21 seconds (4508.43 MB/s)
23  [ 1223.257803] PM: Allocated 946780 kbytes in 1.49 seconds (635.42 MB/s)
24  [ 1256.814404] PM: Allocated 946772 kbytes in 7.19 seconds (131.67 MB/s)
25  [ 1300.201709] PM: Allocated 946772 kbytes in 11.50 seconds (82.32 MB/s)
26  [ 1355.503583] PM: Allocated 946780 kbytes in 20.93 seconds (45.23 MB/s)
27  [ 1384.603171] PM: Allocated 946772 kbytes in 0.25 seconds (3787.08 MB/s)
28  [ 1420.846065] PM: Allocated 946788 kbytes in 9.99 seconds (94.77 MB/s)
29  [ 1467.444989] PM: Allocated 946788 kbytes in 16.50 seconds (57.38 MB/s)
30  [ 1519.377339] PM: Allocated 946788 kbytes in 15.16 seconds (62.45 MB/s)
31  [ 1551.678506] PM: Allocated 946788 kbytes in 0.24 seconds (3944.95 MB/s)
no lock up.

xfs:
 1  [   72.535756] PM: Allocated 946788 kbytes in 0.13 seconds (7282.98 MB/s)
 2  [  100.122310] PM: Allocated 946788 kbytes in 2.78 seconds (340.57 MB/s)
 3  [  130.169771] PM: Allocated 946780 kbytes in 2.92 seconds (324.23 MB/s)
 4  [  159.460297] PM: Allocated 946780 kbytes in 1.06 seconds (893.18 MB/s)
 5  [  188.108237] PM: Allocated 946756 kbytes in 2.29 seconds (413.43 MB/s)
 6  [  214.464016] PM: Allocated 946788 kbytes in 0.50 seconds (1893.57 MB/s)
 7  [  240.820218] PM: Allocated 946788 kbytes in 0.18 seconds (5259.93 MB/s)
 8  [  264.931970] PM: Allocated 946788 kbytes in 0.19 seconds (4983.09 MB/s)
 9  [  291.652658] PM: Allocated 946788 kbytes in 1.34 seconds (706.55 MB/s)
10  [  317.058587] PM: Allocated 946788 kbytes in 1.32 seconds (717.26 MB/s)
11  [  372.441160] PM: Allocated 946788 kbytes in 4.42 seconds (214.20 MB/s)
12  [  446.355721] PM: Allocated 946788 kbytes in 19.31 seconds (49.03 MB/s)
13  lockup.
Comment 8 Dave Chinner 2011-07-19 08:00:14 UTC
Not an XFS bug - hibernate is making broken assumptions when it comes to quiescing filesystems (actually, failing to quiesce) prior to allocating the suspend image. See here:

http://oss.sgi.com/archives/xfs/2011-07/msg00313.html
Comment 9 Rafael J. Wysocki 2011-07-20 05:13:46 UTC
Well, actually, it's not making any assumptions at all and it's not
entirely clear who should be responsible for what in that respect.

We definitely need to handle filesystems during hibernation (and suspend
for that matter), but I don't really know how to do that so everyone
involved is happy.
Comment 10 Rafael J. Wysocki 2011-07-31 11:28:05 UTC
Created attachment 67262 [details]
PM / Freezer: Freeze filesystems after user space has been frozen

chr(): Please test if this patch makes a difference.
Comment 11 Rafael J. Wysocki 2011-07-31 11:36:44 UTC
@chr(): Please use "echo disk > /sys/power/state" for the testing, because the
patch breaks s2disk.

Also, the patch is on top of the current Linus' tree (may or may not apply to
earlier kernels).
Comment 12 Rafael J. Wysocki 2011-07-31 20:05:43 UTC
Sorry, there seems to be a separate bug that prevents s2disk from working
correctly in the current Linus' tree.
Comment 13 Rafael J. Wysocki 2011-08-03 20:54:26 UTC
Created attachment 67452 [details]
fs: Fix fsync() regression on block devices

This patch is necessary for s2disk to work with the current Linus' tree.
Comment 14 Rafael J. Wysocki 2011-08-07 18:08:07 UTC
The patch from comment #13 has been merged, so the patch from comment #10
should work on top of the current Linus' tree.
Comment 15 Ferenc Wágner 2011-08-17 12:32:31 UTC
Rafael, I included your patch and will test it in the coming days. However, I'm somewhat sceptical. Please forgive my ignorance, but didn't Dave recommend allocating memory after freezing user space processes but before freezing the filesystems and the rest of the processes? Your patch seems to freeze the filesystems before preallocating the image memory, or am I mistaken?
Comment 16 Rafael J. Wysocki 2011-08-17 18:04:26 UTC
The patch freezes filesystems before preallocating memory, but that should
work too, because in that case the preallocation won't trigger the filesystems'
memory shrinkers to run.

Anyway, please test it.
Comment 17 Ferenc Wágner 2011-08-18 08:40:49 UTC
Created attachment 69242 [details]
A different trace on 3.1.0-rc1+ with the freezing patch applied

The first hibernation took quite some time, with sporadic disk activity during the preallocation phase after freezing the filesystems, but was successful in the end. The second one resulted in the attached warning, and either hung or took much too long.
Comment 18 Rafael J. Wysocki 2011-08-18 21:52:34 UTC
This means that the freezing patch is not effective (i.e. the XFS shrinker
triggers even though the filesystem has been frozen).  So most likely
we should preallocate memory before freezing filesystems and kernel threads.

@Dave: IIRC your position is that we need to freeze XFS during hibernation
to prevent the data on disk from being modified after we've created the
image.  Can you please remind me what the exact corruption scenario is in
that case?

@Ferenc: I'll have another patch to test in the next couple of days, stay
tuned.
Comment 19 Dave Chinner 2011-08-18 23:16:43 UTC
(In reply to comment #16)
> The patch freezes filesystems before preallocating memory, but that should
> work too, because in that case the preallocation won't trigger the filesystems'
> memory shrinkers to run.

That is incorrect. Freeze doesn't guarantee that shrinkers won't run. What freeze is supposed to do on XFS is ensure there aren't any dirty reclaimable XFS inodes left in memory, so that when the shrinkers are run (and they will be run on frozen filesystems) there is only clean inodes to reclaim.
Comment 20 Dave Chinner 2011-08-18 23:33:24 UTC
(In reply to comment #17)
> Created an attachment (id=69242) [details]
> A different trace on 3.1.0-rc1+ with the freezing patch applied

The call trace from the shrinker looks like this:

prune_super
  prune_icache_sb
    dispose_list
      evict
        xfs_fs_evict_inode
          truncate_inode_pages
          xfs_inactive
            xfs_free_eofblocks
              xfs_trans_alloc

This is basically showing the *VFS* inode cache being shrunk in response to the memory allocation, and that is triggering XFS to do work that dirties the filesystem (i.e. truncating away speculative preallocation beyond EOF). This requires transactions, and so the shrinker is quite rightly being blocked on the frozen fileystem.

Note that in comment #19 I said XFS reclaims it's inode caches as much as possible during a freeze - it cannot free inodes that are still active at the VFS level and require work to be able to reclaim. That is the case you are seeing here - the VFS caceh shrinker is running and XFS can't do anything about that....

Shrinker behaviour is just one of the reasons why I suggested that image allocation should run before freezing filesystems....
Comment 21 Dave Chinner 2011-08-18 23:47:13 UTC
(In reply to comment #18)
> This means that the freezing patch is not effective (i.e. the XFS shrinker
> triggers even though the filesystem has been frozen).  So most likely
> we should preallocate memory before freezing filesystems and kernel threads.

Yes, like I originally suggested in the email thread outside the bugzilla:

dgc> Effectively hibernate needs to allocate                                                                                                     
dgc> memory before it freezes kernel/filesystem worker threads:                                                                                                           
dgc>                                                                                                                                                                      
dgc>       freeze_userspace_processes()                                                                                                                                   
dgc>                                                                                                                                                                      
dgc>       // just to clean the page cache quickly                                                                                                                        
dgc>       sys_sync()                                                                                                                                                     
dgc>                                                                                                                                                                      
dgc>       // optionally to free page/inode/dentry caches:                                                                                                                
dgc>               iterate_supers(drop_pagecache_sb, NULL);                                                                                                               
dgc>               drop_slab()                                                                                                                                            
dgc>                                                                                                                                                                      
dgc>       allocate a large amount of memory                                                                                                                              
dgc>                                                                                                                                                                      
dgc>       // Now quiesce the filesystems and clean remaining metadata                                                                                                    
dgc>       iterate_supers(freeze_super, NULL);                                                                                                                            
dgc>                                                                                                                                                                      
dgc>       freeze_remaining_processes()                                                                                                                                   
dgc>                                                                                                                                                                     
dgc> This guarantees that filesystems are still working when memory                                                                                                       
dgc> reclaim comes along to free memory for the hibernate image, and that                                                                                                 
dgc> once it is allocated that filesystems will not be changed until                                                                                                      
dgc> thawed on the hibernate wakeup.

> @Dave: IIRC your position is that we need to freeze XFS during hibernation
> to prevent the data on disk from being modified after we've created the
> image.  Can you please remind me what the exact corruption scenario is in
> that case?

The filesystem can still be active while you are writing the hibernate image. There are background operations (like memory shrinkers run from kswapd) that run that issue IO or process IO completions asynchronously. If they are still actively doing stuff while you are writing the hibernate image (say background IO is still completing), then there is every chance that the work these threads are doing is modifying memory that you've already written into the hibernate image.

That means when you resume that memory image, the state of the filesystem in memory does not match what is on disk. We consider that memory/disk state incoherence to be a corrupted state, because sooner or later it will cause a real physical corruption of the filesystem structure to occur...

Frozen filesystems don't prevent background threads from doing work - they simply stop new modifications from being made. If no modifications are being made, then the in-memory state will not get out-of-sync with the on disk state while the hibernate image is being written....
Comment 22 Rafael J. Wysocki 2011-08-19 20:28:03 UTC
(In reply to comment #21)
> 
> Frozen filesystems don't prevent background threads from doing work - they
> simply stop new modifications from being made. If no modifications are being
> made, then the in-memory state will not get out-of-sync with the on disk state
> while the hibernate image is being written....

This basically means that the only way to prevent corruption is to freeze
those background threads.
Comment 23 Rafael J. Wysocki 2011-08-20 08:30:22 UTC
@Dave: Does the freezing of filesystems guarantee that the on disk state of
the filesystems will not change until the are thawed?
Comment 24 Dave Chinner 2011-08-21 23:33:36 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > 
> > Frozen filesystems don't prevent background threads from doing work - they
> > simply stop new modifications from being made. If no modifications are being
> > made, then the in-memory state will not get out-of-sync with the on disk state
> > while the hibernate image is being written....
> 
> This basically means that the only way to prevent corruption is to freeze
> those background threads.

You can't freeze shrinkers - they operate in a the memory allocation context. hence the only thing that can be done is prevent the shrinker form making new modifications, as has been successfully demonstrated by the deadlock you saw.
Comment 25 Dave Chinner 2011-08-21 23:41:30 UTC
(In reply to comment #23)
> @Dave: Does the freezing of filesystems guarantee that the on disk state of
> the filesystems will not change until the are thawed?

That's the definition of a frozen filesystem - it cannot be changed on disk until it is thawed. That's exactly the guarantee needed by functionality like block level snapshots that require a consistent point-in-time image of the filesystem to work successfully.

By extension, this means filesystem objects cannot be dirtied in memory, either, while the filesytsem is frozen. IOWs, the on disk state won't change because operations that would cause new modifcations are blocked before they can be carried out. Hence after the freeze has written all the in-memory dirty state to disk, the on-disk state will not change until the filesystem is thawed.
Comment 26 Rafael J. Wysocki 2011-08-22 22:14:27 UTC
OK, thanks.

I've tried to move memory preallocation before the freezing of kernel
threads, but that causes all sorts of other problems to happen (at least
lockdep appears to be very unhappy about that).
Comment 27 Rafael J. Wysocki 2011-08-22 22:22:52 UTC
@Dave: I wonder who's supposed to complete the completion that's being
waited on forever in the trace from comment #6?
Comment 28 Dave Chinner 2011-08-22 23:02:57 UTC
(In reply to comment #27)
> @Dave: I wonder who's supposed to complete the completion that's being
> waited on forever in the trace from comment #6?

It's not the completion that is the problem - the IO can't even be submitted because the xfsbufd thread that does delayed write buffer IO submission has already been frozen. Hence the IO that the shrinker needs to have completed can't even be started. 

Part of what freezing an XFS filesystem does is make sure all the delwri buffers are written to disk. That's something that sys_sync() does *not* do because all the dirty metadata is already in the log hence safe against loss if we were to crash (that's what log replay is for). However, for a freeze they need to be written to move the filesystem into a clean, consistent state on disk that can be read without requiring log recovery because can't do log recovery when mounting read-only snapshots taken while the filesystem was frozen!

IOWs, freeze and sys_sync() behaviour differ in many ways that are not immediately obvious to people not familiar with filesystem plumbing, and this is just one of them.
Comment 29 Dave Chinner 2011-08-22 23:04:37 UTC
(In reply to comment #26)
> OK, thanks.
> 
> I've tried to move memory preallocation before the freezing of kernel
> threads, but that causes all sorts of other problems to happen (at least
> lockdep appears to be very unhappy about that).

I can't really comment without seeing the code, but I can't see why lockdep would get unhappy about freeze calls when you aren't already holding any specific locks...
Comment 30 Rafael J. Wysocki 2011-08-27 21:11:43 UTC
Created attachment 70552 [details]
PM / Hibernate: Use GFP_NOFS for memory preallocation

@Ferenc: Please check if the original issue is reproducible with this patch
applied.
Comment 31 Ferenc Wágner 2011-08-30 10:40:36 UTC
Created attachment 70862 [details]
blocked task warning on 3.1.0-rc3+ with the GFP_NOFS patch applied

The GFP_NOFS patch made a difference: there's a very low but continuous disk activity during memory preallocation. Hung task messages like this are still present, though.
Comment 32 Ferenc Wágner 2011-08-30 10:48:56 UTC
Created attachment 70872 [details]
Preallocation statistics on 3.1.0-rc3+ with the GFP_NOFS patch applied

Eventually (after more than 15 minutes) memory preallocation succeeded, but the computer failed to enter S4, as this screenshot shows. This may be an unrelated problem, since I stopped using hibernation some time ago as it basically never got this far due to the issue on hand. I rebooted and tried hibernation right then, without any memory pressure. It succeeded quickly twice in a row.
Comment 33 Ferenc Wágner 2011-08-31 12:48:14 UTC
Created attachment 70982 [details]
another blocked task warning during preallocation (GFP_NOFS)

Another round of testing resulting in a different stack trace. The state of affairs is similar: decent HDD activity for a couple of seconds, then very light continuous activity for more than 10 minutes with various blocked task warnings scrolling away occasionally.
Comment 34 Ferenc Wágner 2011-08-31 12:49:30 UTC
Created attachment 70992 [details]
some other blocked task warnings later from the same preallocation phase
Comment 35 Ferenc Wágner 2011-08-31 12:57:33 UTC
Created attachment 71002 [details]
strange pauses after the finally succeeding preallocation

After 781 seconds of preallocating, there was some progress, then a strange 32-second pause after "PM: thaw of devices complete after 1317 msecs", then an even longer wait before the computer actually switched off. So this time the machine suspended successfully at least.
Comment 36 Ferenc Wágner 2011-08-31 12:59:12 UTC
Created attachment 71012 [details]
Unfortunately, resume halted indefinitely with this on screen
Comment 37 Ferenc Wágner 2011-08-31 13:01:56 UTC
Created attachment 71022 [details]
SysRq-w showed lots of blocked tasks, but there were no automatic warnings
Comment 38 Zhang Rui 2012-01-18 05:01:56 UTC
It's great that the kernel bugzilla is back.

Can you please verify if the problem still exists in the latest upstream
kernel?
Comment 39 Ferenc Wágner 2012-01-27 16:41:11 UTC
Created attachment 72212 [details]
screenshot of new hung task warnings

As this screenshot shows, I still occasionally get indefinite hung task warnings under 3.2, but now they seem to come from another phase of hibernation, namely the thaw phase after a successful preallocation.
(There is also a bug lurking somewhere, see attachment 72194 [details].)
However, the original issue, namely preallocation being unreasonably slow didn't change at all: my computer often just sits there for minutes without any sign of activity (besides occasional hung task warnings), and eventually manages to hibernate (or not, I'm a little hazy on the details, as I'm not running 3.2 since long and I seldom wait several minutes before finally killing it off).
Comment 40 Aaron Lu 2012-12-17 06:49:07 UTC
Hello Ferenc,

Does this issue occur on latest upstream kernel?

Thanks,
Aaron

Note You need to log in before you can comment on or make changes to this bug.