Bug 26452

Summary: Image memory preallocation takes a long time - many times upto 10 minutes
Product: Power Management Reporter: devsk (funtoos)
Component: Hibernation/SuspendAssignee: power-management_other
Status: CLOSED DOCUMENTED    
Severity: normal CC: lenb, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37 vanilla Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: The intrumented snapshot.c

Description devsk 2011-01-10 04:26:58 UTC
Created attachment 43042 [details]
The intrumented snapshot.c

First the symptoms:

1. s2disk suspend takes a long time, sometimes upto 10 minutes depending on the uptime and activity of the system.
2. Resume from s2disk makes the desktop so sluggish that its not useable for upto 5 minutes.

Most of the suspend time is being spent preallocating image memory. This is a 12GB RAM system with 12GB swap configured on a disk partition.

What's happening is that preallocation is aggressively throwing away pages and over-compensating. What ends up happening is it almost always allocates 10GB+ of memory for about 2GB of image (not sure why compression doesn't play a role). While preallocating, it seems like its swapping the pages out to the swap file. These very pages will need to be brought back in when the system resumes. So, essentially suspend is saying "I am going to let the demand paging manage the swapping out during suspend and swapping in during resume", whereas it should be saying "I am going to compress the pages and write them out to swap if enough swap is available, otherwise I am going to crib". One big advantage of the latter approach is that image writing is essentially a sequential operation whereas swapping processes out on suspend and particularly swapping them back in on resume, when all of them get to run independent of each other, is a random seek heavy operation, moving the disk heads from one location to next, bringing the bandwidth and the speed of the system to a halt.

So, just to give you some debug help, I instrumented the code to print various values at various places during preallocation. The instrumented file is attached. The results are as follows:

Jan  9 19:27:19 localhost kernel: [14005.735266] Syncing filesystems ... done.
Jan  9 19:27:19 localhost kernel: [14005.785266] Freezing user space processes ... (elapsed 0.02 seconds) done.
Jan  9 19:27:19 localhost kernel: [14005.811932] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
Jan  9 19:27:19 localhost kernel: [14005.825266] PM: Preallocating image memory...
Jan  9 19:27:19 localhost kernel: [14005.921932] PM: save_hm 0 saveable 2535602
Jan  9 19:27:19 localhost kernel: [14005.921932] PM: count 3143227 size 224 high_mem 0 totalreserve_pages 25786
Jan  9 19:27:19 localhost kernel: [14005.921932] PM: count 3117441 size 465096 max_size 1557584 image_size 1905032704
Jan  9 19:27:19 localhost kernel: [14005.921932] PM: min image pages 517623 avail_normal 3143227
Jan  9 19:27:19 localhost kernel: [14013.871931] PM: after preall of high mem pages_highmem 0 alloc 1559857
Jan  9 19:27:19 localhost kernel: [14014.905265] PM: after preall of image mem, pages 1559857 avail_normal 2625604
Jan  9 19:27:30 localhost kernel: [14014.905265] PM: else, before, preall of high mem, pages 1559857 avail_normal 2625604 alloc 1039961 count 3117441 size 517623
Jan  9 19:27:30 localhost kernel: [14134.528586] PM: else, before, preall of high mem, pages 2599818 avail_normal 2625604 alloc 1039961 count 3117441 size 1039961
Jan  9 19:27:30 localhost kernel: [14134.591919] PM: before, data_pages 521562
Jan  9 19:27:30 localhost kernel: [14134.591919] PM: after, save 0
Jan  9 19:27:30 localhost kernel: [14134.591919] PM: before, data_pages+high 0
Jan  9 19:27:30 localhost kernel: [14134.878586] PM: after free, data_pages 521888
Jan  9 19:27:30 localhost kernel: [14134.878586] PM: after free, highmem 0done (allocated 2599818 pages)
Jan  9 19:27:30 localhost kernel: [14134.878586] PM: Allocated 10399272 kbytes in 129.05 seconds (80.58 MB/s)

Is it possible to provide a tunable to tune the preallocation so that its not as aggressive as it is now?
Comment 1 devsk 2011-01-10 04:56:03 UTC
Needless to say that when the system resumes, most of my system is on swap. Zero swap space used before suspend and typically, 5-6GB swap used after resume.
Comment 2 Rafael J. Wysocki 2011-01-10 19:56:48 UTC
It seems that the preallocation algorithm doesn't work correctly on your machine.

What kind of a system is it?  Is the kernel 64-bit?
Comment 3 devsk 2011-01-10 23:21:35 UTC
I am not sure what "what kind of a system" means, so, I will list as much as I think is relavant. Yes, the kernel and userland is 64-bit.

The system is a X58 motherboard, with Intel i7 920 CPU. It has 12GB of RAM and the swap is on a separate drive in a partition of size 12GB. The root is on a different disk.
Comment 4 Rafael J. Wysocki 2011-01-12 22:46:14 UTC
The problem is that the value of image_size is way too small for your system.

Does your user space write to /sys/power/image_size at any point?
Comment 5 devsk 2011-01-12 23:55:03 UTC
I am using pm-ustils and s2disk combo to hibernate. So, if either of them is writing to /sys/power/image_size, then yes. But I am not explicitly writing to it.

There is a "image size" parameter in /etc/suspend.conf, which I have tried to set to various values but none of those make any difference. The current value set in that file is

image size = 6200000000

Other parameters in that file:

snapshot device = /dev/snapshot
resume device = /dev/sdb1
image size = 6200000000
suspend loglevel = 10
compute checksum = y
compress = y
#encrypt = y
early writeout = y
#splash = y

I have 0, 10GB as other two values.

Should I try writing directly to /sys/power/image_size before I suspend?
Comment 6 Rafael J. Wysocki 2011-01-13 00:12:34 UTC
Can you simply comment out the "image size = 6200000000" line in
/etc/suspend.conf and post the results from your annotated snapshot.c
after that (and after a fresh boot), please?
Comment 7 devsk 2011-01-13 00:23:30 UTC
I decided to swapoff the swap and then try echo'ing the size. I made the following observations:

# free
             total       used       free     shared    buffers     cached
Mem:      12299004    4053276    8245728          0      61636     550848
-/+ buffers/cache:    3440792    8858212
Swap:     12697596    4271456    8426140

# time swapoff /dev/disk/by-label/swap 

real    19m31.980s
user    0m0.000s
sys     18m12.552s

# free
             total       used       free     shared    buffers     cached
Mem:      12299004    7272088    5026916          0      63880    1979264
-/+ buffers/cache:    5228944    7070060
Swap:            0          0          0

1. It took swapoff 19.5 minutes to finish.
2. The CPU was pegged to 100% (one full core used by swapoff).

This is exactly what I meant in original post. Depending on the VM to alloc the pages for image will slow the hell out of the suspend process. For obvious reasons the swap is random seek heavy.

Anyways, I will now try to suspend with the image size which is the amount of memory used at this time. Swap is fully free, so writing out 7GB of data (which should be compressed to even smaller size) to 12GB of swap device should be no issue and should be IO bound only for the image writing.

Let's see how it goes.
Comment 8 devsk 2011-01-13 00:25:26 UTC
(In reply to comment #6)
> Can you simply comment out the "image size = 6200000000" line in
> /etc/suspend.conf and post the results from your annotated snapshot.c
> after that (and after a fresh boot), please?

Yes, I have tried that as well. And same result. A reboot will of course hurry up things because there is nothing to write to image (about 400-500MB of the desktop processes).

But let me try this and we will see how it goes.
Comment 9 devsk 2011-01-13 00:42:38 UTC
Ok, I am back after resume. And I think there is a big gotcha somewhere. The image size is not honored. This is what I did:

1. Remove image size from the /etc/suspend.conf

2. Set image size.

# echo "7400000000" >  /sys/power/image_size

# cat /sys/power/image_size
7400000000

That's about 7GB.

3. Hibernate with pm-hibernate:

# pm-hibernate

BEFORE hibernate:

# free
             total       used       free     shared    buffers     cached
Mem:      12299004    7272088    5026916          0      63880    1979264
-/+ buffers/cache:    5228944    7070060
Swap:            0          0          0

Even if it doesn't compress much, all of my image should fit in the image. Now, someone please explain to me why the image size was reduced to 512MB? And of course, it evicted all that used memory and led to slowness.

AFTER hibernate:

# free
             total       used       free     shared    buffers     cached
Mem:      12299004    3229676    9069328          0      17472     362940
-/+ buffers/cache:    2849264    9449740
Swap:     12697596    5450756    7246840

kernel has all the space in the world to write image to swap device, but it decides to evict pages to swap first (using random IO) instead of just writing the image to swap device (sequential IO) and reading it back on resume. Refer my previous post to see how bad the swap bound random IO is.

Here is the debug:

Jan 12 16:32:28 localhost kernel: [229792.840354] PM: Preallocating image memory...
Jan 12 16:32:28 localhost kernel: [229792.943687] PM: save_hm 0 saveable 1916915
Jan 12 16:32:28 localhost kernel: [229792.943687] PM: count 3143266 size 224 high_mem 0 totalreserve_pages 25786
Jan 12 16:32:28 localhost kernel: [229792.943687] PM: count 3117480 size 128000 max_size 1557604 image_size 524288000
Jan 12 16:32:28 localhost kernel: [229792.943687] PM: min image pages 183938 avail_normal 3143266
Jan 12 16:32:28 localhost kernel: [229816.523685] PM: after preall of high mem pages_highmem 0 alloc 1559876
Jan 12 16:32:28 localhost kernel: [229816.673684] PM: after preall of image mem, pages 1559876 avail_normal 2959328
Jan 12 16:32:33 localhost kernel: [229816.673684] PM: else, before, preall of high mem, pages 1559876 avail_normal 2959328 alloc 1373666 count 3117480 size 183938
Jan 12 16:32:33 localhost kernel: [229984.190334] PM: else, before, preall of high mem, pages 2933542 avail_normal 2959328 alloc 1373666 count 3117480 size 1373666
Jan 12 16:32:33 localhost kernel: [229984.247001] PM: before, data_pages 187946
Jan 12 16:32:33 localhost kernel: [229984.247001] PM: after, save 0
Jan 12 16:32:33 localhost kernel: [229984.247001] PM: before, data_pages+high 0
Jan 12 16:32:33 localhost kernel: [229984.593668] PM: after free, data_pages 188244
Jan 12 16:32:33 localhost kernel: [229984.593668] PM: after free, highmem 0done (allocated 2933542 pages)
Jan 12 16:32:33 localhost kernel: [229984.593668] PM: Allocated 11734168 kbytes in 191.75 seconds (61.19 MB/s)
Comment 10 devsk 2011-01-13 00:47:25 UTC
It allocated 11GB of memory for writing what it thinks is a 512MB image. I don't know what's going on.

What about compression? When does that happen? And how is it accounted for?

But before that, we need to explain all that's happening here.
Comment 11 Rafael J. Wysocki 2011-01-13 19:01:09 UTC
Compression only happens when the image itself is being saved, which doesn't
help you at all.

We have to figure out what happens to your image_size first.

Can you do a fresh boot and then try to run a hibernate test with:

# echo core > /sys/power/pm_test
# cat /sys/power/image_size
# echo disk > /sys/power/state

and post the output of your annotated snapshot.c after that (please also post
the number returned by the second instruction).
Comment 12 devsk 2011-01-13 21:35:30 UTC
The gpm process hung and got into 'D' state during this test. Basically, this test code path differs from the normal real hibernation. I have never seen gpm go into 'D' state before. Unfortunately, that means X desktop is gone too. I was able to ssh and extract the info:

[  167.819983] PM: Preallocating image memory...
[  167.923316] PM: save_hm 0 saveable 1435189
[  167.923316] PM: count 3143287 size 224 high_mem 0 totalreserve_pages 25786
[  167.923316] PM: count 3117501 size 1229496 max_size 1557614 image_size 5036015616
[  167.923316] PM: min image pages 171027 avail_normal 3143287
[  167.973316] PM: after preall of high mem pages_highmem 0 alloc 1559887
[  168.109983] PM: after preall of image mem, pages 1559887 avail_normal 2972260
[  168.109983] PM: else, before, preall of high mem, pages 1559887 avail_normal 2972260 alloc 328118 count 3117501 size
1229496
[  168.143316] PM: else, before, preall of high mem, pages 1888005 avail_normal 2972260 alloc 328118 count 3117501 size
328118
[  168.213316] PM: before, data_pages 1227234
[  168.213316] PM: after, save 0
[  168.213316] PM: before, data_pages+high 0
[  168.349983] PM: after free, data_pages 1227395
[  168.349983] PM: after free, highmem 0done (allocated 1888005 pages)
[  168.349983] PM: Allocated 7552020 kbytes in 0.53 seconds (14249.09 MB/s)

The image size echo'ed by 2nd instruction was same as above.

PS: what's with bugzilla becoming inaccessible for long times.
Comment 13 Rafael J. Wysocki 2011-01-13 22:25:56 UTC
OK, so next time please switch to a text console before the test.

As seen from the log, the value of image_size (5036015616) is much saner now.

Please try to do:

# echo core > /sys/power/pm_test
# s2disk

and post the analogous data.

> PS: what's with bugzilla becoming inaccessible for long times.

That's kernel.org having infrastructure problems.
Comment 14 devsk 2011-01-14 00:02:06 UTC
We knew that image_size would be sane on a fresh boot (because the problem doesn't happen on a fresh boot). What's the point of doing the experiment when we know it won't fail?

It fails when the system has been up for some time and a lot of memory has been used and reused.
Comment 15 Rafael J. Wysocki 2011-01-14 00:15:44 UTC
This isn't the problem.  The problem is that in all of your previous runs the
value of image_size was _much_ lower than that in comment #12 and there has
to be reason for that.

Moreover, image_size _does_ _not_ depend on how much memory is used.  It only
tells the hibernate code how much memory to preallocate.  The rule is pretty
much to preallocate (total RAM - image_size / PAGE_SIZE).  So, if image_size
is too low, you end up preallocating so much memory that the system becomes
unusable afterwards and that happens to you, doesn't it?

So _please_ run the test I asked for, because I want to know what's fiddling
with image_size on your system.  Specifically, I'd like to rule out s2disk.

BTW, the problem you reported is _not_ a kernel issue.  It's an issue with
the configuration of your user space, so I could simply close it as invalid
right now.  Still, I'd like to help you fix the problem, OK?
Comment 16 devsk 2011-01-14 01:51:51 UTC
OK. got what you are trying to do here. You are comparing the image_size from plain 'echo disk' vs. s2disk. I will reboot and do the test for you.

BTW, here is the user space algorithm. Since you are helpful, may be you can spot a bug here:

       avail_swap = check_free_swap(snapshot_fd);
        if (avail_swap > pref_image_size)
                image_size = pref_image_size;
        else
                image_size = avail_swap;

                if (set_image_size(snapshot_fd, image_size)) {
                        error = errno;
                        break;
                }

static int set_image_size(int dev, unsigned int size)
{
        int error;

        error = ioctl(dev, SNAPSHOT_PREF_IMAGE_SIZE, size);
        if (error && errno == ENOTTY) {
                report_unsupported_ioctl("SNAPSHOT_PREF_IMAGE_SIZE");
                error = ioctl(dev, SNAPSHOT_SET_IMAGE_SIZE, size);
        }
        return error;
}

Pretty simple really. But I will come back up with the result of the test u asked.
Comment 17 devsk 2011-01-14 03:36:27 UTC
Same thing with text console as well. This time I had removed gpm before doing s2disk. So, udisks-daemon decided to go into D state. I could not get back into X by vt switching on fake resume (real resume works though, there is a call or two missing in the test code path). Basically, there is a bug to be caught here.

But apart from that, the news is that on a fresh boot, s2disk sets the image size to be 512MB. Looks like, suspend.c sets it to be 512MB statically in the code if the option is not set in /etc/suspend.conf. So, that we could have known from the word go....:( Because, surprisingly I find out, you wrote that code too....:-)

So, not setting the image size in /etc/suspend.conf, will always kill the performance because it will set the image size to 512MB and it will try to preallocate 11GB of memory. So, you knew this already!

[  337.949966] PM: Preallocating image memory... 
[  338.056632] PM: save_hm 0 saveable 1421198
[  338.056632] PM: count 3143419 size 224 high_mem 0 totalreserve_pages 25786
[  338.056632] PM: count 3117633 size 128000 max_size 1557680 image_size 524288000
[  338.056632] PM: min image pages 172169 avail_normal 3143419
[  347.493298] PM: after preall of high mem pages_highmem 0 alloc 1559953
[  347.629965] PM: after preall of image mem, pages 1559953 avail_normal 2971250
[  347.629965] PM: else, before, preall of high mem, pages 1559953 avail_normal 2971250 alloc 1385511 count 3117633 size 172169
[  451.793288] PM: else, before, preall of high mem, pages 2945464 avail_normal 2971250 alloc 1385511 count 3117633 size 1385511
[  451.849954] PM: before, data_pages 176035
[  451.849954] PM: after, save 0
[  451.849954] PM: before, data_pages+high 0
[  452.193288] PM: after free, data_pages 176389
[  452.193288] PM: after free, highmem 0done (allocated 2945464 pages)
[  452.193288] PM: Allocated 11781856 kbytes in 114.24 seconds (103.13 MB/s)
Comment 18 devsk 2011-01-14 05:03:57 UTC
Ok, I set the image size in /etc/suspend.conf to 6GB and the image size is sane in kernel (1.9GB) just like above where I echo disk instead of s2disk. So, we haven't proven anything.
Comment 19 devsk 2011-01-14 05:34:37 UTC
Ok, found the bug. Luckily its in user space.

(gdb) p (unsigned int)6200000000
$1 = 1905032704
Comment 20 devsk 2011-01-14 05:44:19 UTC
But it doesn't explain, why I see the large preallocation when I explicitly set the image size to be 0. It only explains why larger values get limited to smaller values. When set to 0, the user space just passes 0 to kernel and kernel should handle it well.
Comment 21 devsk 2011-01-14 06:25:05 UTC
So, after unsigned int fix in userspace, here is the scenario:

1. If I set the image size to 0, kernel preallocates ~12GB and goes to lunch.
2. If I set the image size to large number like 8GB, things work sort of fine, althought it still wants to swap before writing image
3. If I don't set the image size, image size is hard coded to be 512MB by s2disk and kernel preallocates ~11GB and goes to lunch.

Looking at it further, it seems like the following logic limits the image size to half in the best case, it will preallocate 6GB, no matter what:

        /* Compute the maximum number of saveable pages to leave in memory. */
        max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;

'size' here is puny. count is all the pages (12GB).

        size = DIV_ROUND_UP(image_size, PAGE_SIZE);

'size' here is 10GB (2.7m pages).

        if (size > max_size)
                size = max_size;

'size' gets limited to around 6GB.

preallocation logic looks like crap to me.

What's worse, it throws these pages on swap device instead of just throwing them away (its mostly page cache I created with 'cat /largefile > /dev/null', don't even have the X running because I am trying to test without real suspend). swappiness is 1, so I don't know why it still loves to swap instead of tossing pages.
Comment 22 Rafael J. Wysocki 2011-01-14 21:15:39 UTC
(In reply to comment #19)
> Ok, found the bug. Luckily its in user space.
> 
> (gdb) p (unsigned int)6200000000
> $1 = 1905032704

Yes, I was afraid that this was the problem.
Comment 23 Rafael J. Wysocki 2011-01-14 21:25:09 UTC
(In reply to comment #21)
> So, after unsigned int fix in userspace, here is the scenario:
> 
> 1. If I set the image size to 0, kernel preallocates ~12GB and goes to lunch.
> 2. If I set the image size to large number like 8GB, things work sort of
> fine,
> althought it still wants to swap before writing image

That's because the image has to fit into memory.

> 3. If I don't set the image size, image size is hard coded to be 512MB by
> s2disk and kernel preallocates ~11GB and goes to lunch.
> 
> Looking at it further, it seems like the following logic limits the image
> size
> to half in the best case, it will preallocate 6GB, no matter what:
> 
>         /* Compute the maximum number of saveable pages to leave in memory.
>         */
>         max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * SPARE_PAGES;
> 
> 'size' here is puny. count is all the pages (12GB).
> 
>         size = DIV_ROUND_UP(image_size, PAGE_SIZE);
> 
> 'size' here is 10GB (2.7m pages).
> 
>         if (size > max_size)
>                 size = max_size;
> 
> 'size' gets limited to around 6GB.
> 
> preallocation logic looks like crap to me.

Well, not really.  The maximum image size is 50% of RAM, because the image
is created by making a (atomic) copy of all the pages in use at that time
and both the image and the source of it have to be present in RAM at the
same time.

This actually is a well known limitation of our current hibernation subsystem
and I kind of wonder why you haven't heard of it already.  To overcome it,
we'd need to make some serious modifications to the kernel's MM subsystem,
which is a fragile area.  It certainly would be possible, but the resulting
code would be much more complicated than what we have at the moment.

> What's worse, it throws these pages on swap device instead of just throwing
> them away (its mostly page cache I created with 'cat /largefile > /dev/null',
> don't even have the X running because I am trying to test without real
> suspend). swappiness is 1, so I don't know why it still loves to swap instead
> of tossing pages.

That's how the MM subsystem of the kernel works.  It probably may be improved,
but that's something to be discussed with the memory management developers.

I'll need to fix s2disk, but apart from this everything works as expected.