Kernel Bug Tracker – Bug 33622
during hibernation, preallocating image memory can take several minutes
Last modified: 2013-02-20 06:28:56 UTC
When there isn't much free memory when hibernation is initiated, the
PM: Preallocating image memory...
phase can occasionally take more than 10 minutes, which seems excessive on this laptop with 512 MB of memory. I understand that in this case quite some stuff has to be swapped out before the snapshot can be made, but most of this time is spent with no HDD activity at all: after the first couple of seconds the HDD indicator gives only the quickest flash in every 2 or 3 seconds.
The worst is that I get no indication whatsoever about the process: at first I thought the system just froze and powered it off to safely get off the train. I've had much preferred to be told that hibernation is infeasible and suspended to RAM instead. But a time estimate could have helped as well, of course.
/sys/power/image_size contains 204791808 all the time, I don't use uswsusp or similar, pm-utils defaults to the kernel method.
Has the problem started to occur at one point or has it been always present
(as long as memory preallocation is used by the core hibernate code)?
Hard to tell, because earlier I used uswsusp, up to 2.6.36. I can't remember such delays with that. But then I was forced to upgrade my user-space due to the i915 KMS migration, I encountered several PM problems which made me abandon uswsusp for simplicity. Anyway I can't easily test older kernels now. When was memory preallocation introduced, btw? After all, I could write a simple console application which allocates memory, one doesn't need a full featured graphical browser just for this...
Created attachment 64392 [details]
screenshot montage of the lockup messages
Now, with kernel 3.0.0-rc5, hibernation seems to hang all the time. In every 2 minutes a pair of "hung task" messages appear as shown on this screenshot montage. SysRq is active at this time, but the blocked task list is far too long to photograph. The kernel seems to be waiting for some XFS stuff here...
Since the problem appears to be readily reproducible, would you be able to
check if XFS is necessary to trigger it?
Not really, sorry. I couldn't trigger it from the initramfs: the system hibernated all right with a full buffer cache (even quickly) or with a full tmpfs (after pushing most of it out to swap with continuous disk activity as it should). So the problem appears reliably after "real" use, but I couldn't yet synthesize it. To be precise, something very similar happened after I hibernated with Emacs having a large file opened: the system failed to resume with very similar looking traces. I will continue trying to get a reliable and safe reproduction.
Created attachment 65302 [details]
SYSRQ all buttons
While hunting bug #15816 I ran a lot of tests in some virtual machines.
Actually (18.104.22.168) pm-hibernate works fine except that ...
"preallocating image memory can take several minutes" (f...ing long)
The last vm I used for testing was with ext4.
My real machine always locks up (uses xfs).
This log is from testing with 22.214.171.124, xfs-root:
PM: Preallocating image memory ...
[machine does some decent I/O for some seconds but then deep mediation]
SysRq : Show Blocked State
Every task is in refrigerator except this:
pm-hibernate D 0000000000000000 0 3638 3637 0x00000000
ffff8800017bf918 0000000000000082 ffff8800017be010 ffff880000000000
ffff8800017be010 ffff88000b8a6170 0000000000013900 ffff8800017bffd8
ffff8800017bffd8 0000000000013900 ffffffff8148b020 ffff88000b8a6170
[<ffffffff810408ea>] ? try_to_wake_up+0x18c/0x18c
[<ffffffff81345527>] ? down_write+0x2d/0x31
[<ffffffffa02374da>] xfs_reclaim_inode+0x74/0x258 [xfs]
[<ffffffffa0237853>] xfs_reclaim_inodes_ag+0x195/0x264 [xfs]
[<ffffffffa0237974>] xfs_reclaim_inode_shrink+0x52/0x90 [xfs]
[<ffffffff8134ce4e>] ? apic_timer_interrupt+0xe/0x20
[<ffffffff810cc41a>] ? next_online_pgdat+0x20/0x41
[<ffffffff811a8fa2>] ? kobject_put+0x47/0x4b
kworker/0:2 D 0000000000000000 0 379 2 0x00000000
ffff88000ea99b90 0000000000000046 ffff88000b8ff840 0000000000000000
ffff88000ea98010 ffff88000c319ac0 0000000000013900 ffff88000ea99fd8
ffff88000ea99fd8 0000000000013900 ffffffff8148b020 ffff88000c319ac0
[<ffffffffa0237738>] xfs_reclaim_inodes_ag+0x7a/0x264 [xfs]
[<ffffffffa02379cc>] ? xfs_reclaim_inodes+0x1a/0x1a [xfs]
[<ffffffffa02379ca>] xfs_reclaim_inodes+0x18/0x1a [xfs]
[<ffffffffa02379e9>] xfs_reclaim_worker+0x1d/0x29 [xfs]
[<ffffffff8105be23>] ? manage_workers+0x194/0x194
[<ffffffff8105f1dd>] ? kthread_worker_fn+0x148/0x148
[<ffffffff8134d5a0>] ? gs_change+0x13/0x13
Hope this helps hunting.
Created attachment 65322 [details]
SYSRQ console dump
I ran an automated pm-hibernate mem stress test in virtualbox with 126.96.36.199, 1.5GB RAM
ext4 30x ok
Using xfs on root locked up in Round 13. I attach the four-button-log.
Talking about speed:
1 [ 115.953022] PM: Allocated 946788 kbytes in 7.33 seconds (129.16 MB/s)
2 [ 162.389327] PM: Allocated 946788 kbytes in 18.48 seconds (51.23 MB/s)
3 [ 187.594767] PM: Allocated 946764 kbytes in 0.38 seconds (2491.48 MB/s)
4 [ 217.393836] PM: Allocated 946772 kbytes in 4.66 seconds (203.16 MB/s)
5 [ 333.061986] PM: Allocated 946788 kbytes in 89.46 seconds (10.58 MB/s)
6 [ 415.016411] PM: Allocated 946788 kbytes in 47.93 seconds (19.75 MB/s)
7 [ 464.904229] PM: Allocated 946780 kbytes in 10.58 seconds (89.48 MB/s)
8 [ 495.034637] PM: Allocated 946780 kbytes in 0.23 seconds (4116.43 MB/s)
9 [ 529.224049] PM: Allocated 946788 kbytes in 5.31 seconds (178.30 MB/s)
10 [ 637.020696] PM: Allocated 946788 kbytes in 73.05 seconds (12.96 MB/s)
11 [ 670.111831] PM: Allocated 946772 kbytes in 4.51 seconds (209.92 MB/s)
12 [ 709.350145] PM: Allocated 946780 kbytes in 6.99 seconds (135.44 MB/s)
13 [ 769.712771] PM: Allocated 946780 kbytes in 28.04 seconds (33.76 MB/s)
14 [ 808.437632] PM: Allocated 946788 kbytes in 0.27 seconds (3506.62 MB/s)
15 [ 856.310100] PM: Allocated 946788 kbytes in 13.42 seconds (70.55 MB/s)
16 [ 891.479187] PM: Allocated 946772 kbytes in 2.65 seconds (357.27 MB/s)
17 [ 976.008879] PM: Allocated 946788 kbytes in 50.85 seconds (18.61 MB/s)
18 [ 1004.876786] PM: Allocated 946788 kbytes in 0.44 seconds (2151.79 MB/s)
19 [ 1086.111001] PM: Allocated 946788 kbytes in 52.01 seconds (18.20 MB/s)
20 [ 1126.903633] PM: Allocated 946788 kbytes in 5.91 seconds (160.20 MB/s)
21 [ 1166.799932] PM: Allocated 946780 kbytes in 9.63 seconds (98.31 MB/s)
22 [ 1194.316118] PM: Allocated 946772 kbytes in 0.21 seconds (4508.43 MB/s)
23 [ 1223.257803] PM: Allocated 946780 kbytes in 1.49 seconds (635.42 MB/s)
24 [ 1256.814404] PM: Allocated 946772 kbytes in 7.19 seconds (131.67 MB/s)
25 [ 1300.201709] PM: Allocated 946772 kbytes in 11.50 seconds (82.32 MB/s)
26 [ 1355.503583] PM: Allocated 946780 kbytes in 20.93 seconds (45.23 MB/s)
27 [ 1384.603171] PM: Allocated 946772 kbytes in 0.25 seconds (3787.08 MB/s)
28 [ 1420.846065] PM: Allocated 946788 kbytes in 9.99 seconds (94.77 MB/s)
29 [ 1467.444989] PM: Allocated 946788 kbytes in 16.50 seconds (57.38 MB/s)
30 [ 1519.377339] PM: Allocated 946788 kbytes in 15.16 seconds (62.45 MB/s)
31 [ 1551.678506] PM: Allocated 946788 kbytes in 0.24 seconds (3944.95 MB/s)
no lock up.
1 [ 72.535756] PM: Allocated 946788 kbytes in 0.13 seconds (7282.98 MB/s)
2 [ 100.122310] PM: Allocated 946788 kbytes in 2.78 seconds (340.57 MB/s)
3 [ 130.169771] PM: Allocated 946780 kbytes in 2.92 seconds (324.23 MB/s)
4 [ 159.460297] PM: Allocated 946780 kbytes in 1.06 seconds (893.18 MB/s)
5 [ 188.108237] PM: Allocated 946756 kbytes in 2.29 seconds (413.43 MB/s)
6 [ 214.464016] PM: Allocated 946788 kbytes in 0.50 seconds (1893.57 MB/s)
7 [ 240.820218] PM: Allocated 946788 kbytes in 0.18 seconds (5259.93 MB/s)
8 [ 264.931970] PM: Allocated 946788 kbytes in 0.19 seconds (4983.09 MB/s)
9 [ 291.652658] PM: Allocated 946788 kbytes in 1.34 seconds (706.55 MB/s)
10 [ 317.058587] PM: Allocated 946788 kbytes in 1.32 seconds (717.26 MB/s)
11 [ 372.441160] PM: Allocated 946788 kbytes in 4.42 seconds (214.20 MB/s)
12 [ 446.355721] PM: Allocated 946788 kbytes in 19.31 seconds (49.03 MB/s)
Not an XFS bug - hibernate is making broken assumptions when it comes to quiescing filesystems (actually, failing to quiesce) prior to allocating the suspend image. See here:
Well, actually, it's not making any assumptions at all and it's not
entirely clear who should be responsible for what in that respect.
We definitely need to handle filesystems during hibernation (and suspend
for that matter), but I don't really know how to do that so everyone
involved is happy.
Created attachment 67262 [details]
PM / Freezer: Freeze filesystems after user space has been frozen
chr(): Please test if this patch makes a difference.
@chr(): Please use "echo disk > /sys/power/state" for the testing, because the
patch breaks s2disk.
Also, the patch is on top of the current Linus' tree (may or may not apply to
Sorry, there seems to be a separate bug that prevents s2disk from working
correctly in the current Linus' tree.
Created attachment 67452 [details]
fs: Fix fsync() regression on block devices
This patch is necessary for s2disk to work with the current Linus' tree.
The patch from comment #13 has been merged, so the patch from comment #10
should work on top of the current Linus' tree.
Rafael, I included your patch and will test it in the coming days. However, I'm somewhat sceptical. Please forgive my ignorance, but didn't Dave recommend allocating memory after freezing user space processes but before freezing the filesystems and the rest of the processes? Your patch seems to freeze the filesystems before preallocating the image memory, or am I mistaken?
The patch freezes filesystems before preallocating memory, but that should
work too, because in that case the preallocation won't trigger the filesystems'
memory shrinkers to run.
Anyway, please test it.
Created attachment 69242 [details]
A different trace on 3.1.0-rc1+ with the freezing patch applied
The first hibernation took quite some time, with sporadic disk activity during the preallocation phase after freezing the filesystems, but was successful in the end. The second one resulted in the attached warning, and either hung or took much too long.
This means that the freezing patch is not effective (i.e. the XFS shrinker
triggers even though the filesystem has been frozen). So most likely
we should preallocate memory before freezing filesystems and kernel threads.
@Dave: IIRC your position is that we need to freeze XFS during hibernation
to prevent the data on disk from being modified after we've created the
image. Can you please remind me what the exact corruption scenario is in
@Ferenc: I'll have another patch to test in the next couple of days, stay
(In reply to comment #16)
> The patch freezes filesystems before preallocating memory, but that should
> work too, because in that case the preallocation won't trigger the filesystems'
> memory shrinkers to run.
That is incorrect. Freeze doesn't guarantee that shrinkers won't run. What freeze is supposed to do on XFS is ensure there aren't any dirty reclaimable XFS inodes left in memory, so that when the shrinkers are run (and they will be run on frozen filesystems) there is only clean inodes to reclaim.
(In reply to comment #17)
> Created an attachment (id=69242) [details]
> A different trace on 3.1.0-rc1+ with the freezing patch applied
The call trace from the shrinker looks like this:
This is basically showing the *VFS* inode cache being shrunk in response to the memory allocation, and that is triggering XFS to do work that dirties the filesystem (i.e. truncating away speculative preallocation beyond EOF). This requires transactions, and so the shrinker is quite rightly being blocked on the frozen fileystem.
Note that in comment #19 I said XFS reclaims it's inode caches as much as possible during a freeze - it cannot free inodes that are still active at the VFS level and require work to be able to reclaim. That is the case you are seeing here - the VFS caceh shrinker is running and XFS can't do anything about that....
Shrinker behaviour is just one of the reasons why I suggested that image allocation should run before freezing filesystems....
(In reply to comment #18)
> This means that the freezing patch is not effective (i.e. the XFS shrinker
> triggers even though the filesystem has been frozen). So most likely
> we should preallocate memory before freezing filesystems and kernel threads.
Yes, like I originally suggested in the email thread outside the bugzilla:
dgc> Effectively hibernate needs to allocate
dgc> memory before it freezes kernel/filesystem worker threads:
dgc> // just to clean the page cache quickly
dgc> // optionally to free page/inode/dentry caches:
dgc> iterate_supers(drop_pagecache_sb, NULL);
dgc> allocate a large amount of memory
dgc> // Now quiesce the filesystems and clean remaining metadata
dgc> iterate_supers(freeze_super, NULL);
dgc> This guarantees that filesystems are still working when memory
dgc> reclaim comes along to free memory for the hibernate image, and that
dgc> once it is allocated that filesystems will not be changed until
dgc> thawed on the hibernate wakeup.
> @Dave: IIRC your position is that we need to freeze XFS during hibernation
> to prevent the data on disk from being modified after we've created the
> image. Can you please remind me what the exact corruption scenario is in
> that case?
The filesystem can still be active while you are writing the hibernate image. There are background operations (like memory shrinkers run from kswapd) that run that issue IO or process IO completions asynchronously. If they are still actively doing stuff while you are writing the hibernate image (say background IO is still completing), then there is every chance that the work these threads are doing is modifying memory that you've already written into the hibernate image.
That means when you resume that memory image, the state of the filesystem in memory does not match what is on disk. We consider that memory/disk state incoherence to be a corrupted state, because sooner or later it will cause a real physical corruption of the filesystem structure to occur...
Frozen filesystems don't prevent background threads from doing work - they simply stop new modifications from being made. If no modifications are being made, then the in-memory state will not get out-of-sync with the on disk state while the hibernate image is being written....
(In reply to comment #21)
> Frozen filesystems don't prevent background threads from doing work - they
> simply stop new modifications from being made. If no modifications are being
> made, then the in-memory state will not get out-of-sync with the on disk state
> while the hibernate image is being written....
This basically means that the only way to prevent corruption is to freeze
those background threads.
@Dave: Does the freezing of filesystems guarantee that the on disk state of
the filesystems will not change until the are thawed?
(In reply to comment #22)
> (In reply to comment #21)
> > Frozen filesystems don't prevent background threads from doing work - they
> > simply stop new modifications from being made. If no modifications are being
> > made, then the in-memory state will not get out-of-sync with the on disk state
> > while the hibernate image is being written....
> This basically means that the only way to prevent corruption is to freeze
> those background threads.
You can't freeze shrinkers - they operate in a the memory allocation context. hence the only thing that can be done is prevent the shrinker form making new modifications, as has been successfully demonstrated by the deadlock you saw.
(In reply to comment #23)
> @Dave: Does the freezing of filesystems guarantee that the on disk state of
> the filesystems will not change until the are thawed?
That's the definition of a frozen filesystem - it cannot be changed on disk until it is thawed. That's exactly the guarantee needed by functionality like block level snapshots that require a consistent point-in-time image of the filesystem to work successfully.
By extension, this means filesystem objects cannot be dirtied in memory, either, while the filesytsem is frozen. IOWs, the on disk state won't change because operations that would cause new modifcations are blocked before they can be carried out. Hence after the freeze has written all the in-memory dirty state to disk, the on-disk state will not change until the filesystem is thawed.
I've tried to move memory preallocation before the freezing of kernel
threads, but that causes all sorts of other problems to happen (at least
lockdep appears to be very unhappy about that).
@Dave: I wonder who's supposed to complete the completion that's being
waited on forever in the trace from comment #6?
(In reply to comment #27)
> @Dave: I wonder who's supposed to complete the completion that's being
> waited on forever in the trace from comment #6?
It's not the completion that is the problem - the IO can't even be submitted because the xfsbufd thread that does delayed write buffer IO submission has already been frozen. Hence the IO that the shrinker needs to have completed can't even be started.
Part of what freezing an XFS filesystem does is make sure all the delwri buffers are written to disk. That's something that sys_sync() does *not* do because all the dirty metadata is already in the log hence safe against loss if we were to crash (that's what log replay is for). However, for a freeze they need to be written to move the filesystem into a clean, consistent state on disk that can be read without requiring log recovery because can't do log recovery when mounting read-only snapshots taken while the filesystem was frozen!
IOWs, freeze and sys_sync() behaviour differ in many ways that are not immediately obvious to people not familiar with filesystem plumbing, and this is just one of them.
(In reply to comment #26)
> OK, thanks.
> I've tried to move memory preallocation before the freezing of kernel
> threads, but that causes all sorts of other problems to happen (at least
> lockdep appears to be very unhappy about that).
I can't really comment without seeing the code, but I can't see why lockdep would get unhappy about freeze calls when you aren't already holding any specific locks...
Created attachment 70552 [details]
PM / Hibernate: Use GFP_NOFS for memory preallocation
@Ferenc: Please check if the original issue is reproducible with this patch
Created attachment 70862 [details]
blocked task warning on 3.1.0-rc3+ with the GFP_NOFS patch applied
The GFP_NOFS patch made a difference: there's a very low but continuous disk activity during memory preallocation. Hung task messages like this are still present, though.
Created attachment 70872 [details]
Preallocation statistics on 3.1.0-rc3+ with the GFP_NOFS patch applied
Eventually (after more than 15 minutes) memory preallocation succeeded, but the computer failed to enter S4, as this screenshot shows. This may be an unrelated problem, since I stopped using hibernation some time ago as it basically never got this far due to the issue on hand. I rebooted and tried hibernation right then, without any memory pressure. It succeeded quickly twice in a row.
Created attachment 70982 [details]
another blocked task warning during preallocation (GFP_NOFS)
Another round of testing resulting in a different stack trace. The state of affairs is similar: decent HDD activity for a couple of seconds, then very light continuous activity for more than 10 minutes with various blocked task warnings scrolling away occasionally.
Created attachment 70992 [details]
some other blocked task warnings later from the same preallocation phase
Created attachment 71002 [details]
strange pauses after the finally succeeding preallocation
After 781 seconds of preallocating, there was some progress, then a strange 32-second pause after "PM: thaw of devices complete after 1317 msecs", then an even longer wait before the computer actually switched off. So this time the machine suspended successfully at least.
Created attachment 71012 [details]
Unfortunately, resume halted indefinitely with this on screen
Created attachment 71022 [details]
SysRq-w showed lots of blocked tasks, but there were no automatic warnings
It's great that the kernel bugzilla is back.
Can you please verify if the problem still exists in the latest upstream
Created attachment 72212 [details]
screenshot of new hung task warnings
As this screenshot shows, I still occasionally get indefinite hung task warnings under 3.2, but now they seem to come from another phase of hibernation, namely the thaw phase after a successful preallocation.
(There is also a bug lurking somewhere, see attachment 72194 [details].)
However, the original issue, namely preallocation being unreasonably slow didn't change at all: my computer often just sits there for minutes without any sign of activity (besides occasional hung task warnings), and eventually manages to hibernate (or not, I'm a little hazy on the details, as I'm not running 3.2 since long and I seldom wait several minutes before finally killing it off).
Does this issue occur on latest upstream kernel?