Bug 97201

Summary: s2disk may fail when NR_FILE_MAPPED is high
Product: Power Management Reporter: Rainer Fiebig (jrf)
Component: Hibernation/SuspendAssignee: Chen Yu (yu.c.chen)
Status: RESOLVED CODE_FIX    
Severity: normal CC: aaron.lu, lenb, manuelkrause, rjw, yu.c.chen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.0 and earlier and later Subsystem:
Regression: No Bisected commit-id:
Attachments: Tracing-Log
Proposed fix for failure of s2disk in case of high NR_FILE_MAPPED

Description Rainer Fiebig 2015-04-24 10:41:46 UTC
If NR_FILE_MAPPED is high (e.g. when using VirtualBox), s2disk/s2both may fail.

The characteristics are similar to Bug 47931 and have been discussed there at length (see Comment 50 and following).

For a possible starting point to a solution see my Comment 72.
Comment 1 Rainer Fiebig 2015-10-06 17:10:14 UTC
Just for the record: 

My solution (i.e. excluding NR_FILE_MAPPED from the calculation of "size" in minimum_image_size()) has been working flawlessly for many months now, even in high-load cases like the following where Mapped was almost 4 GB:

> free
             total       used       free     shared    buffers     cached
Mem:      11997692   11272020     725672     317572    1404316    1673400
-/+ buffers/cache:    8194304    3803388
Swap:     10713084     780376    9932708

The impeccable reliability that the modification brought to s2disk (and the problems without it) is proof enough for me that the current way of calculating minimum_image_size is wrong - or at least suboptimal:

static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size;

	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON)
		+ global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE)
		- global_page_state(NR_FILE_MAPPED);

	return saveable <= size ? 0 : saveable - size;
}

According to the function-comment, size is the "number of pages that can be freed in theory".

So the reasoning for subtracting NR_FILE_MAPPED seems to be that there is a certain amount of non-freeable pages in the other summands that must be weeded out this way. Otherwise subtracting NR_FILE_MAPPED wouldn't make sense.

As long as NR_FILE_MAPPED is relatively small, subtracting it doesn't have much impact anyway. But it becomes a problem and leads to unnecessary failure if NR_FILE_MAPPED is high - like when using VirtualBox.

Is the bug relevant? Let's assume there's 1 million people using Linux as their desktop-OS. 10 % use s2disk and of those 10 % use an app that causes a high NR_FILE_MAPPED. 

That would be 10000 users having a suboptimal experience with s2disk in Linux.


Jay
Comment 2 Rainer Fiebig 2015-12-30 11:17:01 UTC
Over a year now of reliable s2disk/s2both for me.

The following values illustrate the problem with the original function. 
They were taken shortly after resuming from hibernation.
It was just a normal use-case, only a few apps open plus a 2-GB-VirtualBox-VM. 
But as you can see "size" would be negative and s2disk would probably fail:

			kb			Pages
nr_slab_reclaimable	 68.368			 17.092
nr_active_anon		904.904			226.226
nr_inactive_anon	436.112			109.028
nr_active_file		351.320			 87.830
nr_inactive_file	163.340			 40.835

freeable:	      1.924.044			481.011

nr_mapped	      2.724.140			681.035

freeable – mapped:     -800.096		       -200.024

--------------------------------------------------------

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         7592344 kB
MemAvailable:    7972260 kB
Buffers:          229960 kB
Cached:           730140 kB
SwapCached:       133868 kB
Active:          1256224 kB
Inactive:         599452 kB
Active(anon):     904904 kB
Inactive(anon):   436112 kB
Active(file):     351320 kB
Inactive(file):   163340 kB
Unevictable:          60 kB
Mlocked:              60 kB
SwapTotal:      10713084 kB
SwapFree:        9850232 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        847876 kB
Mapped:          2724140 kB
Shmem:            445440 kB
Slab:             129984 kB
SReclaimable:      68368 kB
SUnreclaim:        61616 kB
KernelStack:        8128 kB
PageTables:        53692 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16712096 kB
Committed_AS:    6735376 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      578084 kB
VmallocChunk:   34359117432 kB
HardwareCorrupted:     0 kB
AnonHugePages:    276480 kB
HugePages_Total:       0
HugePages_Free:        0
Comment 3 Chen Yu 2016-01-04 11:32:56 UTC
Hi Jay,
I looked at your proposal at Bug 47931, if I understand correctly, you suggested removing the NR_FILE_MAPPED in minimum_image_size, and in order to get rid of page cache, shrink_all_memory(saveable) try to reclaim as much as possible page cache?
shrink_all_memory might not touch dirty page caches and it only asks flusher thread to write these datas to disk, so after the shrink_all_memory  finished, there might be still many NR_FILE_MAPPED  in the system, so I think we should firstly invoke sys_sync IMO.

Yu
Comment 4 Rainer Fiebig 2016-01-04 16:45:29 UTC
(In reply to Chen Yu from comment #3)
> Hi Jay,
> I looked at your proposal at Bug 47931, if I understand correctly, you
> suggested removing the NR_FILE_MAPPED in minimum_image_size, and in order to
> get rid of page cache, shrink_all_memory(saveable) try to reclaim as much as
> possible page cache?
> shrink_all_memory might not touch dirty page caches and it only asks flusher
> thread to write these datas to disk, so after the shrink_all_memory 
> finished, there might be still many NR_FILE_MAPPED  in the system, so I
> think we should firstly invoke sys_sync IMO.
> 
> Yu

Hi and thanks for taking a look!

I found out that excluding NR_FILE_MAPPED in minimum_image_size() is all that's necessary for fixing this bug.
So I left shrink_all_memory(saveable - size) in hibernate_preallocate_memory() unchanged.

In essence, my reasoning for removing NR_FILE_MAPPED from minimum_image_size() goes like this: 
- "saveable" is all that's in memory at that time
- subtracting those items from saveable that can be freed, yields the minimum image size
- but NR_FILE_MAPPED is part of that fraction of saveable that can *not* be freed (according to what I'm observing here)
- so subtracting it from those that can be freed leads to wrong results and unnecessary failure

The assumption is of course that NR_SLAB_RECLAIMABLE etc. are truly free-able.

If I'm understanding you rightly, you are suggesting to do a sys_sync before shrink_all_memory(saveable - size)? Would that be for added stability or as a way to reduce NR_FILE_MAPPED?

Jay
Comment 5 Chen Yu 2016-01-06 09:36:22 UTC
Hi Jay,
(In reply to Jay from comment #4)
> In essence, my reasoning for removing NR_FILE_MAPPED from
> minimum_image_size() goes like this: 
> - "saveable" is all that's in memory at that time
Got
> - subtracting those items from saveable that can be freed, yields the
> minimum image size
Got
> - but NR_FILE_MAPPED is part of that fraction of saveable that can *not* be
> freed (according to what I'm observing here)
Got
> - so subtracting it from those that can be freed leads to wrong results and
> unnecessary failure
Mapped might has 'common set' with Active(file) and Inactive(file),
so the original code wants to exclude it from reclaimable.
In most cases, Mapped might be a subset of the latters.
However, if more than one tasks have mapped the same file, the count of Mapped will grow bigger and bigger. In your case, I guess there are more than one tasks have executed the following code:

fd = open("my_file.dat");
addr = mmap(fd, offset, size);
memcpy(dst, addr, size);


How about the something like this:

static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size, nr_map;

	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON)
		+ global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE);
	nr_map = global_page_state(NR_FILE_MAPPED);
	size = size > nr_map ? (size - nr_map) : 0;

	return saveable <= size ? 0 : saveable - size;
}
Comment 6 Rainer Fiebig 2016-01-06 23:39:12 UTC
(In reply to Chen Yu from comment #5)
> Hi Jay,
> (In reply to Jay from comment #4)
> > In essence, my reasoning for removing NR_FILE_MAPPED from
> > minimum_image_size() goes like this: 
> > - "saveable" is all that's in memory at that time
> Got
> > - subtracting those items from saveable that can be freed, yields the
> > minimum image size
> Got
> > - but NR_FILE_MAPPED is part of that fraction of saveable that can *not* be
> > freed (according to what I'm observing here)
> Got
> > - so subtracting it from those that can be freed leads to wrong results and
> > unnecessary failure
> Mapped might has 'common set' with Active(file) and Inactive(file),
> so the original code wants to exclude it from reclaimable.
> In most cases, Mapped might be a subset of the latters.
> However, if more than one tasks have mapped the same file, the count of
> Mapped will grow bigger and bigger. In your case, I guess there are more
> than one tasks have executed the following code:
> 
> fd = open("my_file.dat");
> addr = mmap(fd, offset, size);
> memcpy(dst, addr, size);
> 

Thanks, that's really helpful information. Maybe just excluding nr_mapped leads to a too optimistic value for minimum_image_size - although it works nicely here.

> 
> How about the something like this:
> 
> static unsigned long minimum_image_size(unsigned long saveable)
> {
>       unsigned long size, nr_map;
> 
>       size = global_page_state(NR_SLAB_RECLAIMABLE)
>               + global_page_state(NR_ACTIVE_ANON)
>               + global_page_state(NR_INACTIVE_ANON)
>               + global_page_state(NR_ACTIVE_FILE)
>               + global_page_state(NR_INACTIVE_FILE);
>       nr_map = global_page_state(NR_FILE_MAPPED);
>       size = size > nr_map ? (size - nr_map) : 0;
> 
>       return saveable <= size ? 0 : saveable - size;
> }

That's a good way to prevent the worst case and it would definitely increase reliability. At least a safeguard like this should be implemented.

But I'm not completely happy yet. Because s2disk would still fail in higher-load-cases. Which would feel like a step back for me.

Your info led me to think about excluding not only nr_mapped but also nr_active_file and nr_inactive_file. "Size" would always be positive, so no safeguard necessary. 

A first look at the data shows me that it would work more often than the original + safeguard. But I'm not yet sure. 
What do you think?
Comment 7 Rainer Fiebig 2016-01-07 22:34:57 UTC
Hi Yu,

for reasons mentioned above, I've been thinking about an alternative to your suggestion.

Taking your info on nr_mapped etc. into consideration, I found the following one which has the advantage that it also works in higher-load-cases - albeit then with a too optimistic estimate for minimum_image_size. So here it is:
	

static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size, sum_1, sum_2, nr_mapped;
	
	nr_mapped = global_page_state(NR_FILE_MAPPED);

	/* no common set with NR_FILE_MAPPED */
	sum_1 = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON);
	
	/* possible common set with NR_FILE_MAPPED */
	sum_2 = + global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE);
	
	if (nr_mapped > sum_2)	/* NR_FILE_MAPPED bigger than common set */
		size = sum_1	/* no point in subtracting it */
		     + sum_2;		
	else
		size = sum_1
		     + sum_2
                     - nr_mapped;
		
	return saveable <= size ? 0 : saveable - size;
}


In case nr_mapped is relatively low, nothing changes - it is subtracted like in the original code.
If nr_mapped is high, we would now prefer the "risk" of a too optimistic estimate for minimum_image_size over an unwarranted failure.
Like in your suggestion, "size" cannot become negative anymore, so we have a safeguard, too.


For VirtualBox there is a close correlation between the size of the VMs and nr_mapped: a 1-GB-VM leads to nr_mapped of 1 GB plus the nr_mapped from other processes, a 2-GB-VM leads to 2 GB + and so on. 

So the "true" value for nr_mapped would be nr_mapped minus the size of the VirtualBox-VMs. For my system that's always a few hundred thousand kB. Unfortunately, I don't see a way to derive that from the values available. But doing the calculations with this "true" nr_mapped gets a minimum_image_size that is clearly within the system's limit of max_size = 5900 MB - even in high-load-cases. 

A way out of this predicament could be to calculate minimum_image_size bottom-up, i. e. building the sum of all those items that can *not* be freed. Would that be possible?


Here are the test-results of a case with low nr_mapped and a case with high nr_mapped. 
(minimum_pages = minimum_image_size.)

gitk + several other apps:

~> free
             total       used       free     shared    buffers     cached
Mem:      11998028    8417232    3580796     343756    2212040    1514996
-/+ buffers/cache:    4690196    7307832
Swap:     10713084          0   10713084

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         3579344 kB
MemAvailable:    6709252 kB
Buffers:         2212048 kB
Cached:          1285388 kB
SwapCached:            0 kB
Active:          5380248 kB
Inactive:        2604112 kB
Active(anon):    4280708 kB
Inactive(anon):   551372 kB
Active(file):    1099540 kB
Inactive(file):  2052740 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:               152 kB
Writeback:             0 kB
AnonPages:       4486972 kB
Mapped:           418004 kB
Shmem:            345160 kB
Slab:             270268 kB
SReclaimable:     231020 kB
SUnreclaim:        39248 kB
KernelStack:        6608 kB
PageTables:        48780 kB

s2both/resume: OK.

PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2209910 pages, 8632 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061389 pages, 11958 MB
count = 3023188 pages, 11809 MB
max_size = 1510460 pages, 5900 MB
user_specified_image_size = 1349778 pages, 5272 MB
adjusted_image_size = 1349779 pages, 5272 MB
minimum_pages = 236928 pages, 925 MB
target_image_size = 1349779 pages, 5272 MB
nr_should_reclaim = 860131 pages, 3359 MB
nr_reclaimed = 586655 pages, 2291 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512728 pages, 5909 MB
to_alloc_adjusted = 1512728 pages, 5909 MB
pages_allocated = 1512728 pages, 5909 MB
debug 26: alloc = 160681
debug 27: size = 0
debug 28: pages_highmem = 0
debug 29: alloc -= size; = 160681
debug 30: size = 160681
debug 31: pages_highmem = 0
debug 32: pages = 1673409
debug 33: pagecount before freeing unnecessary p. = 1348960
debug 34: pagecount after freeing unnecessary p. = 1348126
done (allocated 1673409 pages)
PM: Allocated 6693636 kbytes in 2.02 seconds (3313.68 MB/s)

-------------------------------------------------------------------------

2-GB-VM + gitk + several other apps:

~> free
             total       used       free     shared    buffers     cached
Mem:      11998028    9783252    2214776     293556    2169568    1081812
-/+ buffers/cache:    6531872    5466156
Swap:     10713084     588760   10124324

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         2202012 kB
MemAvailable:    4475772 kB
Buffers:         2174972 kB
Cached:           490356 kB
SwapCached:       442936 kB
Active:          3764520 kB
Inactive:        3399284 kB
Active(anon):    3571804 kB
Inactive(anon):  1226852 kB
Active(file):     192716 kB
Inactive(file):  2172432 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10124620 kB
Dirty:              8740 kB
Writeback:             0 kB
AnonPages:       4095144 kB
Mapped:          2551032 kB
Shmem:            300128 kB
Slab:             210888 kB
SReclaimable:     155076 kB
SUnreclaim:        55812 kB
KernelStack:        7264 kB
PageTables:        54148 kB

s2both/resume: OK.

PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2553652 pages, 9975 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061432 pages, 11958 MB
count = 3023231 pages, 11809 MB
max_size = 1510481 pages, 5900 MB
user_specified_image_size = 1349778 pages, 5272 MB
adjusted_image_size = 1349779 pages, 5272 MB
minimum_pages = 685190 pages, 2676 MB
target_image_size = 1349779 pages, 5272 MB
nr_should_reclaim = 1203873 pages, 4702 MB
nr_reclaimed = 652695 pages, 2549 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512750 pages, 5909 MB
to_alloc_adjusted = 1512750 pages, 5909 MB
pages_allocated = 1512750 pages, 5909 MB
debug 26: alloc = 160702
debug 27: size = 0
debug 28: pages_highmem = 0
debug 29: alloc -= size; = 160702
debug 30: size = 160702
debug 31: pages_highmem = 0
debug 32: pages = 1673452
debug 33: pagecount before freeing unnecessary p. = 1356279
debug 34: pagecount after freeing unnecessary p. = 1343240
done (allocated 1673452 pages)
PM: Allocated 6693808 kbytes in 11.95 seconds (560.15 MB/s)
Comment 8 Chen Yu 2016-01-08 06:56:49 UTC
(In reply to Jay from comment #2) 
> But as you can see "size" would be negative and s2disk would probably fail:
> 
> freeable – mapped:     -800.096                      -200.024
> 
BTW, why the s2disk fail if it is negative?  since size is a unsigned int,the negative value will become very big, thus minimum_image_size will return 0, thus shrink_all_memory will reclaim as much as possible pages. so?
Comment 9 Rainer Fiebig 2016-01-08 19:18:03 UTC
(In reply to Chen Yu from comment #8)
> (In reply to Jay from comment #2) 
> > But as you can see "size" would be negative and s2disk would probably fail:
> > 
> > freeable – mapped:     -800.096                    -200.024
> > 
> BTW, why the s2disk fail if it is negative?  since size is a unsigned
> int,the negative value will become very big, thus minimum_image_size will
> return 0, thus shrink_all_memory will reclaim as much as possible pages. so?


True, but also see https://bugzilla.kernel.org/show_bug.cgi?id=47931#c63. 
In that case s2disk apparently failed although minimum_image_size returned 0.

But the main reason for failure is that the value returned by minimum_image_size() is > max_size. 
And if nr_mapped is high this happens more often than not. Because the function can't handle that situation. The return-value then is way too high.

On the other hand: As long as minimum_image_size is <= max_size, s2disk can handle even very high-load cases.
It almost seems to me that one might use an arbitrary value for minimum_image_size: As long as it is <= max_size (and as long as the actual memory-load is also <= max_size), s2disk works.
Why else would it succeed if the return-value is 0?

Take a look at how that value propagates down to 

pages = preallocate_image_memory(alloc, avail_normal);
	if (pages < alloc) {...}

Once pages < alloc, it's "game over".

With respect to success or not, shrink_all_memory() is IMO not so important. The key-issue is the value returned by minimum_image_size(). This should of course be as accurate as possible. But as long as the actual memory-load is <= max_size, almost any value <= max_size seems better than one > max_size.

For that reason I can live well with a more optimistic value for minimum_image_size. 

Also take a look at the test-result with gitk where nr_mapped is low but anon-pages is high: We have a memory-load of 8632 MB but a minimum_image_size of just 925 MB. To me that seems optimistic, too. :)

If we want more accuracy, maybe building minimum_image_size from those items that can not be freed, is an approach to explore.

So long!
Comment 10 Rainer Fiebig 2016-01-08 19:24:41 UTC
snip
> 
> static unsigned long minimum_image_size(unsigned long saveable)
> {
>       unsigned long size, sum_1, sum_2, nr_mapped;
>       
>       nr_mapped = global_page_state(NR_FILE_MAPPED);
> 
>       /* no common set with NR_FILE_MAPPED */
>       sum_1 = global_page_state(NR_SLAB_RECLAIMABLE)
>               + global_page_state(NR_ACTIVE_ANON)
>               + global_page_state(NR_INACTIVE_ANON);
>       
>       /* possible common set with NR_FILE_MAPPED */
>       sum_2 = + global_page_state(NR_ACTIVE_FILE)
>               + global_page_state(NR_INACTIVE_FILE);
>       
>       if (nr_mapped > sum_2)  /* NR_FILE_MAPPED bigger than common set */
>               size = sum_1    /* no point in subtracting it */
>                    + sum_2;           
>       else
>               size = sum_1
>                    + sum_2
>                      - nr_mapped;
>               
>       return saveable <= size ? 0 : saveable - size;
> }

One unnecessary "+". Should of course be

/* possible common set with NR_FILE_MAPPED */
>       sum_2 = global_page_state(NR_ACTIVE_FILE)
>               + global_page_state(NR_INACTIVE_FILE);

Jay
Comment 11 Rainer Fiebig 2016-01-10 14:26:39 UTC
Turned out my suggestion from Comment 7 is not bullet-proof.

There may be cases where nr_mapped is high but Active(file) + Inactive(file) (= sum_2) is even higher and so it behaves just like the original version and fails (see example below).

So if we want s2disk to also work reliably when nr_mapped is high, I see two possibilities:
- exclude nr_mapped from the calculation
- build minimum_image_size from those items that can not be freed

For the latter I have constructed a minimum_image_size(), to get an idea of the value returned (see example below).
Values over 5900 MB mean failure.

--------

~> free
             total       used       free     shared    buffers     cached
Mem:      11998028    9927608    2070420     305720    4183136    1542068
-/+ buffers/cache:    4202404    7795624
Swap:     10713084     498536   10214548

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         2053796 kB
MemAvailable:    6802532 kB
Buffers:         4183152 kB
Cached:           959476 kB
SwapCached:       417540 kB
Active:          1250348 kB
Inactive:        4849364 kB
Active(anon):     269652 kB
Inactive(anon):  1007892 kB
Active(file):     980696 kB
Inactive(file):  3841472 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10214680 kB
Dirty:               176 kB
Writeback:             0 kB
AnonPages:        605396 kB
Mapped:          3730084 kB
Shmem:            320460 kB
Slab:             247008 kB
SReclaimable:     179960 kB
SUnreclaim:        67048 kB
KernelStack:        7792 kB
PageTables:        50544 kB


PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2586720 pages, 10104 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061361 pages, 11958 MB
count = 3023160 pages, 11809 MB
max_size = 1510446 pages, 5900 MB
user_specified_image_size = 1349778 pages, 5272 MB
adjusted_image_size = 1349779 pages, 5272 MB

minimum_pages (by exluding nr_mapped) = 981340 pages, 3833 MB
minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
minimum_pages, if (nr_mapped > sum_2) = 1927305 pages, 7528 MB
minimum_pages, orig. version = 1927305 pages, 7528 MB

target_image_size = 1349779 pages, 5272 MB
nr_should_reclaim = 1236941 pages, 4831 MB
nr_reclaimed = 1058877 pages, 4136 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512714 pages, 5909 MB
to_alloc_adjusted = 1512714 pages, 5909 MB
pages_allocated = 1512714 pages, 5909 MB
debug 26: alloc = 160667
debug 27: size = 0
debug 28: pages_highmem = 0
debug 29: alloc -= size; = 160667
debug 30: size = 160667
debug 31: pages_highmem = 0
debug 32: pages = 1673381
debug 33: pagecount before freeing unnecessary p. = 1351076
debug 34: pagecount after freeing unnecessary p. = 1350516
done (allocated 1673381 pages)
PM: Allocated 6693524 kbytes in 1.61 seconds (4157.46 MB/s)
Comment 12 Chen Yu 2016-01-12 13:21:24 UTC
(In reply to Jay from comment #11)
> Turned out my suggestion from Comment 7 is not bullet-proof.
> 
> There may be cases where nr_mapped is high but Active(file) + Inactive(file)
> (= sum_2) is even higher and so it behaves just like the original version
> and fails (see example below).
> 
> So if we want s2disk to also work reliably when nr_mapped is high, I see two
> possibilities:
> - exclude nr_mapped from the calculation
> - build minimum_image_size from those items that can not be freed
> 
> For the latter I have constructed a minimum_image_size(), to get an idea of
> the value returned (see example below).
> Values over 5900 MB mean failure.
> 
> --------
> 
> ~> free
>              total       used       free     shared    buffers     cached
> Mem:      11998028    9927608    2070420     305720    4183136    1542068
> -/+ buffers/cache:    4202404    7795624
> Swap:     10713084     498536   10214548
> 
> ~> cat /proc/meminfo
> MemTotal:       11998028 kB
> MemFree:         2053796 kB
> MemAvailable:    6802532 kB
> Buffers:         4183152 kB
> Cached:           959476 kB
> SwapCached:       417540 kB
> Active:          1250348 kB
> Inactive:        4849364 kB
> Active(anon):     269652 kB
> Inactive(anon):  1007892 kB
> Active(file):     980696 kB
> Inactive(file):  3841472 kB
> Unevictable:          40 kB
> Mlocked:              40 kB
> SwapTotal:      10713084 kB
> SwapFree:       10214680 kB
> Dirty:               176 kB
> Writeback:             0 kB
> AnonPages:        605396 kB
> Mapped:          3730084 kB
> Shmem:            320460 kB
> Slab:             247008 kB
> SReclaimable:     179960 kB
> SUnreclaim:        67048 kB
> KernelStack:        7792 kB
> PageTables:        50544 kB
> 
> 
> PM: Preallocating image memory... 
> save_highmem = 0 pages, 0 MB
> saveable = 2586720 pages, 10104 MB
> highmem = 0 pages, 0 MB
> additional_pages = 220 pages, 0 MB
> avail_normal = 3061361 pages, 11958 MB
> count = 3023160 pages, 11809 MB
> max_size = 1510446 pages, 5900 MB
> user_specified_image_size = 1349778 pages, 5272 MB
> adjusted_image_size = 1349779 pages, 5272 MB
> 
> minimum_pages (by exluding nr_mapped) = 981340 pages, 3833 MB
> minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
> minimum_pages, if (nr_mapped > sum_2) = 1927305 pages, 7528 MB
> minimum_pages, orig. version = 1927305 pages, 7528 MB

Interesting, I was almost persuaded by your previous solution at #Comment 7,
but for the nr_mapped > sum_2 case, why not:
if (nr_mapped > sum_2)
		size = sum_1;
since size should be the finally reclaimable page number, and nr-mapped has occupied most of the active/inactive file pages.
Comment 13 Chen Yu 2016-01-12 13:23:41 UTC
> minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
how do you calculate it?
Comment 14 Rainer Fiebig 2016-01-12 19:02:42 UTC
snip
> > 
> > minimum_pages (by exluding nr_mapped) = 981340 pages, 3833 MB
> > minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
> > minimum_pages, if (nr_mapped > sum_2) = 1927305 pages, 7528 MB
> > minimum_pages, orig. version = 1927305 pages, 7528 MB
> 
> Interesting, I was almost persuaded by your previous solution at #Comment 7,

Me too.
But then I looked at the facts. And the facts looked back at me with their hard, cold eyes... ;) 

> but for the nr_mapped > sum_2 case, why not:
> if (nr_mapped > sum_2)
>               size = sum_1;
> since size should be the finally reclaimable page number, and nr-mapped has
> occupied most of the active/inactive file pages.

I think it would fail too often. Like in the following case where "size" then would be only 1.767.476 kB and minimum_image_size 7.226.316 kB:

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         3429856 kB
MemAvailable:    5346516 kB
Buffers:          603656 kB
Cached:          1813444 kB
SwapCached:       207108 kB
Active:          1394516 kB
Inactive:        2213028 kB
Active(anon):     848996 kB
Inactive(anon):   754800 kB
Active(file):     545520 kB
Inactive(file):  1458228 kB
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:      10713084 kB
SwapFree:        9886664 kB
Dirty:               156 kB
Writeback:             0 kB
AnonPages:       1097428 kB
Mapped:          4993336 kB
Shmem:            413352 kB
Slab:             243692 kB
SReclaimable:     163680 kB
SUnreclaim:        80012 kB
KernelStack:        8912 kB
PageTables:        59596 kB

PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2248550 pages, 8783 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061493 pages, 11958 MB
count = 3023292 pages, 11809 MB
max_size = 1510512 pages, 5900 MB
user_specified_image_size = 1349778 pages, 5272 MB
adjusted_image_size = 1349779 pages, 5272 MB

minimum_pages (by exluding nr_mapped) = 1264899 pages, 4941 MB
minimum_pages, by nonfreeable items = 1436206 pages, 5610 MB
minimum_pages, if (nr_mapped > sum_2) = 1264899 pages, 4941 MB
minimum_pages, orig. version = 0 pages, 0 MB

target_image_size = 1349779 pages, 5272 MB
nr_should_reclaim = 898771 pages, 3510 MB
nr_reclaimed = 423743 pages, 1655 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512780 pages, 5909 MB
to_alloc_adjusted = 1512780 pages, 5909 MB
pages_allocated = 1512780 pages, 5909 MB
...
Comment 15 Rainer Fiebig 2016-01-12 19:12:12 UTC
(In reply to Chen Yu from comment #13)
> > minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
> how do you calculate it?

I don't know much about this field, so a good deal of guesswork was involved. 
I hope, not everything is BS.

/* For testing only. minimum_image_size built from (presumably) non-freeable items */
static unsigned long minimum_image_size_nonfreeable(void)
{
	unsigned long minsize;
	
	minsize = global_page_state(NR_UNEVICTABLE)
			+ global_page_state(NR_MLOCK)
			+ global_page_state(NR_FILE_DIRTY)
			+ global_page_state(NR_WRITEBACK)
			+ global_page_state(NR_FILE_MAPPED)
			+ global_page_state(NR_SHMEM)
			+ global_page_state(NR_SLAB_UNRECLAIMABLE)
			+ global_page_state(NR_KERNEL_STACK)
			+ global_page_state(NR_PAGETABLE);
			
	return minsize;
}


Here's a case with a low value for nr_mapped (no VM loaded):

~> cat /proc/meminfo
MemTotal:       11998028 kB
MemFree:         4380836 kB
MemAvailable:    6734148 kB
Buffers:         1319704 kB
Cached:          1387932 kB
SwapCached:            0 kB
Active:          5006612 kB
Inactive:        2210540 kB
Active(anon):    4511240 kB
Inactive(anon):   295600 kB
Active(file):     495372 kB
Inactive(file):  1914940 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:               252 kB
Writeback:             0 kB
AnonPages:       4509608 kB
Mapped:           360992 kB
Shmem:            297316 kB
Slab:             235320 kB
SReclaimable:     196392 kB
SUnreclaim:        38928 kB
KernelStack:        8064 kB
PageTables:        49000 kB

s2both/resume OK. Resume wieder rel. lang (gitk).

PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2012027 pages, 7859 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061353 pages, 11958 MB
count = 3023152 pages, 11809 MB
max_size = 1510442 pages, 5900 MB
user_specified_image_size = 1349778 pages, 5272 MB
adjusted_image_size = 1349779 pages, 5272 MB

minimum_pages (by exluding nr_mapped) = 116532 pages, 455 MB
minimum_pages, by nonfreeable items = 240289 pages, 938 MB
minimum_pages, if (nr_mapped > sum_2) = 224871 pages, 878 MB
minimum_pages, orig. version = 224871 pages, 878 MB

target_image_size = 1349779 pages, 5272 MB
nr_should_reclaim = 662248 pages, 2586 MB
nr_reclaimed = 446618 pages, 1744 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512710 pages, 5909 MB
to_alloc_adjusted = 1512710 pages, 5909 MB
pages_allocated = 1512710 pages, 5909 MB
...
Comment 16 Chen Yu 2016-01-13 08:40:55 UTC
(In reply to Jay from comment #14)
> snip
> > > 
> > > minimum_pages (by exluding nr_mapped) = 981340 pages, 3833 MB
> > > minimum_pages, by nonfreeable items = 1083551 pages, 4232 MB
> > > minimum_pages, if (nr_mapped > sum_2) = 1927305 pages, 7528 MB
> > > minimum_pages, orig. version = 1927305 pages, 7528 MB
> > 
> > Interesting, I was almost persuaded by your previous solution at #Comment
> 7,
> 
> Me too.
> But then I looked at the facts. And the facts looked back at me with their
> hard, cold eyes... ;) 
> 
> > but for the nr_mapped > sum_2 case, why not:
> > if (nr_mapped > sum_2)
> >             size = sum_1;
> > since size should be the finally reclaimable page number, and nr-mapped has
> > occupied most of the active/inactive file pages.
> 
> I think it would fail too often. Like in the following case where "size"
> then would be only 1.767.476 kB and minimum_image_size 7.226.316 kB:
> 
OK. 
I'm still curious about why nr_mapped has occupied so many pages, then I checked the code again, it seems that,  nr_mapped number will not increase  if another task has mmaped the same file (#Comment 5 might be wrong, if task A has mmap the file, task B's mmap the same file with the same offset will not increase nr_mmaped) , so in theory nr_mapped  should not exceed the total number of active+inactive, except that some device drivers have remapped the kernel pages, vmalloc for example, but this situation is not common).

In order to find out the root cause for this problem, we might need to find out why nr_mapped is so high, plz help to test by:

1. before VM is allocated(when nr_mapped is low),

[root@localhost tracing]# pwd
/sys/kernel/debug/tracing

// track page_add_file_rmap, which will increase nr_mapped
[root@localhost tracing]# echo page_add_file_rmap > set_ftrace_filter

// enable function tracer
[root@localhost tracing]# echo function > current_tracer

// enable function callback
[root@localhost tracing]# echo 1 > options/func_stack_trace


2. enable your vm(virtual box?), increase the nr_mapped to a high level, then:

//save the tracer data(might be very big)
[root@localhost tracing]# cat trace > /home/tracer_nr_mapped.log

[root@localhost tracing]# echo 0 > options/func_stack_trace
[root@localhost tracing]#  echo > set_ftrace_filter

//stop the tracer
[root@localhost tracing]# echo 0 > tracing_on


3. attach your tracer_nr_mapped.log
Comment 17 Rainer Fiebig 2016-01-13 22:15:09 UTC
Created attachment 199581 [details]
Tracing-Log
Comment 18 Rainer Fiebig 2016-01-13 22:18:23 UTC
Hi Yu, 

the log is attached, hope it helps.

It would be nice if you could give me a feedback on minimum_image_size_nonfreeable() and tell me which items to delete or  add.
Comment 19 Chen Yu 2016-01-15 00:59:50 UTC
(In reply to Jay from comment #18)
> Hi Yu, 
> 
> the log is attached, hope it helps.
> 
> It would be nice if you could give me a feedback on
> minimum_image_size_nonfreeable() and tell me which items to delete or  add.

thanks for your info.

Theoretically, the minimum_image_size should be the number of pages minus the number of reclaimable 'within' savable_pages, why? because the snaphot image only concerns about the savable pages. You can not simply add the unreclaimable together w/o considering the savable pages IMO, besides, NR_FILE_DIRTY and NR_WRITEBACK might be actually reclaimable.

With regard to NR_FILE_MAPPED,
I checked your log and it shows that most of nr_mapped increased in the following path(yes, it is reading from a mmap(fd)):

      VirtualBox-3151  [000] ...1   523.775961: page_add_file_rmap <-do_set_pte
      VirtualBox-3151  [000] ...1   523.775963: <stack trace>
 => update_curr
 => page_add_file_rmap
 => put_prev_entity
 => page_add_file_rmap
 => do_set_pte
 => filemap_map_pages
 => do_read_fault.isra.61
 => handle_mm_fault
 => get_futex_key
 => hrtimer_wakeup
 => __do_page_fault
 => do_futex
 => do_page_fault
 => page_fault

however before one page is added to NR_FILE_MAPPED,  this page has already been linked to inactive_file, that is to say, Inactive(file) + active(file) is bigger than/equals  NR_FILE_MAPPED. I don't know why NR_FILE_MAPPED is so large, I'll send this question to mm-mailist and ask for a help. And which kernel version are you using? how many CPUs does you platform have?
Comment 20 Rainer Fiebig 2016-01-15 20:24:00 UTC
(In reply to Chen Yu from comment #19)

> Theoretically, the minimum_image_size should be the number of pages minus
> the number of reclaimable 'within' savable_pages, why? because the snaphot
> image only concerns about the savable pages. You can not simply add the
> unreclaimable together w/o considering the savable pages IMO, besides,
> NR_FILE_DIRTY and NR_WRITEBACK might be actually reclaimable.

But in low-nr_mapped-cases the values are close to what the original delivers:

minimum_pages, by nonfreeable items = 224814 pages, 878 MB
minimum_pages, orig. version = 221699 pages, 866 MB

> large, I'll send this question to mm-mailist and ask for a help. And which
> kernel version are you using? how many CPUs does you platform have?

Right now it's kernel 3.18.xx. But the problem was there from at least 3.7 on - whether it was an untouched distribution-kernel or self-compiled. I haven't tried a kernel later than 4. 

One Intel 64-bit-CPU.
Comment 21 Chen Yu 2016-01-16 04:26:06 UTC
(In reply to Jay from comment #20)
> (In reply to Chen Yu from comment #19)
> 
> > Theoretically, the minimum_image_size should be the number of pages minus
> > the number of reclaimable 'within' savable_pages, why? because the snaphot
> > image only concerns about the savable pages. You can not simply add the
> > unreclaimable together w/o considering the savable pages IMO, besides,
> > NR_FILE_DIRTY and NR_WRITEBACK might be actually reclaimable.
> 
> But in low-nr_mapped-cases the values are close to what the original
> delivers:
> 
> minimum_pages, by nonfreeable items = 224814 pages, 878 MB
> minimum_pages, orig. version = 221699 pages, 866 MB
> 
Humm, yes, it is suitable in your case, but I'm not sure if it fits for other users, we need more investigation on this. But before that, could you please have a try on latest 4.4 kernel?  Since I can not reproduce your problem on this version, I  want to confirm if the calculating of nr_mapped has been fixed already.
Comment 22 Rainer Fiebig 2016-01-16 13:35:06 UTC
(In reply to Chen Yu from comment #21)
> (In reply to Jay from comment #20)
> > (In reply to Chen Yu from comment #19)
> > 
> > > Theoretically, the minimum_image_size should be the number of pages minus
> > > the number of reclaimable 'within' savable_pages, why? because the
> snaphot
> > > image only concerns about the savable pages. You can not simply add the
> > > unreclaimable together w/o considering the savable pages IMO, besides,
> > > NR_FILE_DIRTY and NR_WRITEBACK might be actually reclaimable.
> > 
> > But in low-nr_mapped-cases the values are close to what the original
> > delivers:
> > 
> > minimum_pages, by nonfreeable items = 224814 pages, 878 MB
> > minimum_pages, orig. version = 221699 pages, 866 MB
> > 
> Humm, yes, it is suitable in your case, but I'm not sure if it fits for
> other users, we need more investigation on this. But before that, could you
> please have a try on latest 4.4 kernel?  Since I can not reproduce your
> problem on this version, I  want to confirm if the calculating of nr_mapped
> has been fixed already.

No, values are not different with 4.4.0 (see below). Which version of VirtualBox are you using? I'm using 4.3.34. The format of my VM-harddisks is native-VB (.vdi).

Kernel 4.4.0, a 2-GB-XP-VM running. "Baseline"-nr_mapped was 484368 kB:

~> cat /proc/meminfo
MemTotal:       11975164 kB
MemFree:         5432192 kB
MemAvailable:    8155428 kB
Buffers:         2114456 kB
Cached:          1056064 kB
SwapCached:            0 kB
Active:          2212676 kB
Inactive:        1788856 kB
Active(anon):     832660 kB
Inactive(anon):   511620 kB
Active(file):    1380016 kB
Inactive(file):  1277236 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:                80 kB
Writeback:           288 kB
AnonPages:        831128 kB
Mapped:          2769240 kB
Shmem:            513192 kB
Slab:             175244 kB
SReclaimable:     117620 kB
SUnreclaim:        57624 kB
KernelStack:        6976 kB
PageTables:        45912 kB
Comment 23 Chen Yu 2016-01-16 15:57:18 UTC
(In reply to Jay from comment #22)
> (In reply to Chen Yu from comment #21)
> > (In reply to Jay from comment #20)
> > > (In reply to Chen Yu from comment #19)
> > > 
> > > > Theoretically, the minimum_image_size should be the number of pages
> minus
> > > > the number of reclaimable 'within' savable_pages, why? because the
> snaphot
> > > > image only concerns about the savable pages. You can not simply add the
> > > > unreclaimable together w/o considering the savable pages IMO, besides,
> > > > NR_FILE_DIRTY and NR_WRITEBACK might be actually reclaimable.
> > > 
> > > But in low-nr_mapped-cases the values are close to what the original
> > > delivers:
> > > 
> > > minimum_pages, by nonfreeable items = 224814 pages, 878 MB
> > > minimum_pages, orig. version = 221699 pages, 866 MB
> > > 
> > Humm, yes, it is suitable in your case, but I'm not sure if it fits for
> > other users, we need more investigation on this. But before that, could you
> > please have a try on latest 4.4 kernel?  Since I can not reproduce your
> > problem on this version, I  want to confirm if the calculating of nr_mapped
> > has been fixed already.
> 
> No, values are not different with 4.4.0 (see below). Which version of
> VirtualBox are you using? I'm using 4.3.34. The format of my VM-harddisks is
> native-VB (.vdi).
> 
> Kernel 4.4.0, a 2-GB-XP-VM running. "Baseline"-nr_mapped was 484368 kB:
> 
I did not use Virtual Box but wrote a small mmap(2G disk file) to simulate.
What does VM-harddisks(native-VB) mean, is it something like a ramdisk for Virtual Box?
Comment 24 Rainer Fiebig 2016-01-16 17:09:05 UTC
...
> I did not use Virtual Box but wrote a small mmap(2G disk file) to simulate.
> What does VM-harddisks(native-VB) mean, is it something like a ramdisk for
> Virtual Box?

Sorry.
.vdi is VB's standard-format for the VM's virtual harddisk, like .vmdk is for VMware.

I've also created a VB-VM with the .vmdk-format but nr_mapped is the same as with an otherwise identical .vdi-VM.

Don't know whether this is relevant but I've noticed that the nr_mapped for a 32-bit-Windows-VM is significantly higher than that for 64-bit-Linux-VMs. The close correlation of memory-value chosen for the VM and nr_mapped is only true for Windows-VMs: A 2-GB-XP-VM *always* leads to nr_mapped of 2 GB plus baseline-nr_mapped (as in the example in my last comment). But a 2- or 3-GB-Linux-VM increases nr_mapped only by about 500 MB, at least initially. 

With each additional Linux-VM nr_mapped goes up but not as much as with a 32-bit-Windows-VM. Nevertheless, nr_mapped is still relatively high.

Example

Baseline:

~> cat /proc/meminfo
MemTotal:       11997612 kB
MemFree:         8232552 kB
MemAvailable:   10381072 kB
Buffers:         1394968 kB
Cached:          1155064 kB
SwapCached:            0 kB
Active:          1406316 kB
Inactive:        2017316 kB
Active(anon):     875448 kB
Inactive(anon):   303740 kB
Active(file):     530868 kB
Inactive(file):  1713576 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:                60 kB
Writeback:             0 kB
AnonPages:        873648 kB
Mapped:           398880 kB
Shmem:            305588 kB
Slab:             184752 kB
SReclaimable:     146024 kB
SUnreclaim:        38728 kB
KernelStack:        6832 kB
PageTables:        40300 kB

2-GB-Linux-VM, values at login:

~> cat /proc/meminfo
MemTotal:       11997612 kB
MemFree:         7610076 kB
MemAvailable:    9761128 kB
Buffers:         1397120 kB
Cached:          1230064 kB
SwapCached:            0 kB
Active:          1648540 kB
Inactive:        1901528 kB
Active(anon):     924724 kB
Inactive(anon):   378520 kB
Active(file):     723816 kB
Inactive(file):  1523008 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:              2068 kB
Writeback:             0 kB
AnonPages:        922916 kB
Mapped:           935824 kB
Shmem:            380368 kB
Slab:             188808 kB
SReclaimable:     146324 kB
SUnreclaim:        42484 kB
KernelStack:        7200 kB
PageTables:        42348 kB

Another 2-GB-Linux-VM started (from saved state):

~> cat /proc/meminfo
MemTotal:       11997612 kB
MemFree:         6021372 kB
MemAvailable:    8572456 kB
Buffers:         1796464 kB
Cached:          1262508 kB
SwapCached:            0 kB
Active:          1702776 kB
Inactive:        2332824 kB
Active(anon):     978564 kB
Inactive(anon):   410864 kB
Active(file):     724212 kB
Inactive(file):  1921960 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:               576 kB
Writeback:             0 kB
AnonPages:        976764 kB
Mapped:          2040000 kB
Shmem:            412712 kB
Slab:             199012 kB
SReclaimable:     147696 kB
SUnreclaim:        51316 kB
KernelStack:        7600 kB
PageTables:        45504 kB

Another 3-GB-Linux-VM started, values at login:

~> cat /proc/meminfo
MemTotal:       11997612 kB
MemFree:         5272052 kB
MemAvailable:    8006412 kB
Buffers:         1979004 kB
Cached:          1305652 kB
SwapCached:            0 kB
Active:          1777544 kB
Inactive:        2555572 kB
Active(anon):    1050316 kB
Inactive(anon):   453812 kB
Active(file):     727228 kB
Inactive(file):  2101760 kB
Unevictable:          40 kB
Mlocked:              40 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:              1160 kB
Writeback:             0 kB
AnonPages:       1048544 kB
Mapped:          2509620 kB
Shmem:            455660 kB
Slab:             203416 kB
SReclaimable:     148616 kB
SUnreclaim:        54800 kB
KernelStack:        8048 kB
PageTables:        47456 kB
Comment 25 Rainer Fiebig 2016-01-17 23:17:58 UTC
Hi Yu,

I know what caused the high nr_mapped.

Two days ago I noticed that one of the VMs (which I almost never use) did not increase nr_mapped above the baseline-value. So I checked its settings and compared it to those of the other VMs. No relevant differences. Even with every possible option that the VirtualBox-GUI offers set to the same value, nr_mapped stayed low while it rose to the known high levels for the others.

Today I used VB's command-line-tool, VBoxManage, to compare the settings again. Turned out that the VM in fact does have a different setting: "Large Pages: on" instead of "off" like for the others.

The VirtualBox manual says:

"• --largepages on|off: If hardware virtualization and nested paging are enabled, for Intel VT-x only, an additional performance improvement of up to 5% can be obtained by enabling this setting. This causes the hypervisor to use large pages to reduce TLB use and overhead."

Up to this day, I didn't even know this option existed, let alone having set it to "on" for a VM. I use the VB-GUI to create and manage the VMs. And the GUI doesn't show this option.

I must have created this said VM with a VB-version that had "Large Pages: on" as the default, perhaps under Windows. I can't imagine an other explanation.

Setting "Large Pages: on" for the other VMs had a dramatic effect on nr_mapped. For three 2-GB-VMs loaded (plus other apps running): Mapped: 621284 kB.

That's a value the original minimum_image_size() can handle:

PM: Preallocating image memory... 
save_highmem = 0 pages, 0 MB
saveable = 2155130 pages, 8418 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061409 pages, 11958 MB
count = 3023206 pages, 11809 MB
max_size = 1510469 pages, 5900 MB
user_specified_image_size = 1349731 pages, 5272 MB
adjusted_image_size = 1349732 pages, 5272 MB

minimum_pages (by excluding nr_mapped) = 1158750 pages, 4526 MB
minimum_pages, by nonfreeable items = 311775 pages, 1217 MB
minimum_pages, if (nr_mapped > sum_2) = 1328725 pages, 5190 MB
minimum_pages, orig. version = 1328725 pages, 5190 MB

target_image_size = 1349732 pages, 5272 MB
nr_should_reclaim = 805398 pages, 3146 MB
nr_reclaimed = 669510 pages, 2615 MB
...

To me it seems we may close this bug-report as "solved". Although I still think that a high nr_mapped is the Achilles' heel of minimum_image_size().

If this "Large Page: on|off"-info is in any way relevant for the Linux-memory-management, you will know what to do.
Would be nice to know if there's a kernel-setting that could prevent this nasty behaviour.
I shall inform a Linux-forum about this (I promised to do so) and perhaps the VirtualBox-people.

For now: Thanks for your time and effort!

Jay
Comment 26 Chen Yu 2016-01-18 02:20:12 UTC
(In reply to Jay from comment #25)
> Setting "Large Pages: on" for the other VMs had a dramatic effect on
> nr_mapped. For three 2-GB-VMs loaded (plus other apps running): Mapped:
> 621284 kB.
> 
> To me it seems we may close this bug-report as "solved". Although I still
> think that a high nr_mapped is the Achilles' heel of minimum_image_size().
Yes, there is still problem in minimum_image_size IMO.
> 
> If this "Large Page: on|off"-info is in any way relevant for the
> Linux-memory-management, you will know what to do.
> Would be nice to know if there's a kernel-setting that could prevent this
> nasty behaviour.
Can you upload your kernel config file? I wonder if this is related to  transparent pages, or hugetlb. Another question is, is the /proc/meminfo shows any hugetlb info such as:

HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      243504 kB
DirectMap2M:    11210752 kB
DirectMap1G:    24117248 kB
Comment 27 Chen Yu 2016-01-18 02:22:01 UTC
(In reply to Jay from comment #25)
> To me it seems we may close this bug-report as "solved". Although I still
> think that a high nr_mapped is the Achilles' heel of minimum_image_size().
> 
> If this "Large Page: on|off"-info is in any way relevant for the
> Linux-memory-management, you will know what to do.
> Would be nice to know if there's a kernel-setting that could prevent this
> nasty behaviour.
> I shall inform a Linux-forum about this (I promised to do so) and perhaps
> the VirtualBox-people.
It would be nice if you can help ask the VirtualBox people what 'Large Page' does in their software.
> 
> For now: Thanks for your time and effort!
> 
> Jay
Comment 28 Rainer Fiebig 2016-01-18 17:25:42 UTC
(In reply to Chen Yu from comment #26)
> (In reply to Jay from comment #25)
> > Setting "Large Pages: on" for the other VMs had a dramatic effect on
> > nr_mapped. For three 2-GB-VMs loaded (plus other apps running): Mapped:
> > 621284 kB.
> > 
> > To me it seems we may close this bug-report as "solved". Although I still
> > think that a high nr_mapped is the Achilles' heel of minimum_image_size().
> Yes, there is still problem in minimum_image_size IMO.
> > 

And it seems a hard nut to crack.

If you can't find a way to stop nr_mapped going through the roof or can't warm to the idea of excluding nr_mapped, perhaps we should just use a value like "used - buffers/cache" in "free" for minimum_pages. This would have worked most of the time. But I couldn't find the source-code, so I don't know how it is calculated.

> > If this "Large Page: on|off"-info is in any way relevant for the
> > Linux-memory-management, you will know what to do.
> > Would be nice to know if there's a kernel-setting that could prevent this
> > nasty behaviour.
> Can you upload your kernel config file? I wonder if this is related to 
> transparent pages, or hugetlb. Another question is, is the /proc/meminfo
> shows any hugetlb info such as:
> 
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:      243504 kB
> DirectMap2M:    11210752 kB
> DirectMap1G:    24117248 kB

AnonHugePages:    821248 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    12238848 kB
Comment 29 Rainer Fiebig 2016-01-18 17:30:18 UTC
(In reply to Chen Yu from comment #27)

> > I shall inform a Linux-forum about this (I promised to do so) and perhaps
> > the VirtualBox-people.
> It would be nice if you can help ask the VirtualBox people what 'Large Page'
> does in their software.

Well, I got second thoughts about informing the VirtualBox-guys. 

I would have had to create an Oracle account to file a bug-report/forum-post. And for my taste they wanted way too much private information from me for just that. 

But perhaps this info from the manual may be useful for you:

"10.7 Nested paging and VPIDs
In addition to “plain” hardware virtualization, your processor may also support additional sophisticated techniques:2

• A newer feature called “nested paging” implements some memory management in hardware, which can greatly accelerate hardware virtualization since these tasks no longer need to be performed by the virtualization software.

With nested paging, the hardware provides another level of indirection when translating linear to physical addresses. Page tables function as before, but linear addresses are now translated to “guest physical” addresses first and not physical addresses directly. A new set of paging registers now exists under the traditional paging mechanism and translates from guest physical addresses to host physical addresses, which are used to access memory.

Nested paging eliminates the overhead caused by VM exits and page table accesses. In essence, with nested page tables the guest can handle paging without intervention from the hypervisor. Nested paging thus significantly improves virtualization performance.

On AMD processors, nested paging has been available starting with the Barcelona (K10) architecture – they call it now “rapid virtualization indexing” (RVI). 

Intel added support for nested paging, which they call “extended page tables” (EPT), with their Core i7 (Nehalem) processors.

If nested paging is enabled, the VirtualBox hypervisor can also use large pages to reduce TLB usage and overhead. This can yield a performance improvement of up to 5%. To enable this feature for a VM, you need to use the VBoxManage modifyvm --largepages command; see chapter 8.8, VBoxManage modifyvm, page 131.

• On Intel CPUs, another hardware feature called “Virtual Processor Identifiers” (VPIDs) can greatly accelerate context switching by reducing the need for expensive flushing of the processor’s Translation Lookaside Buffers (TLBs).

To enable these features for a VM, you need to use the VBoxManage modifyvm --vtxvpid and --largepages commands; see chapter 8.8, VBoxManage modifyvm, page 131.


2VirtualBox 2.0 added support for AMD’s nested paging; support for Intel’s EPT and VPIDs was added with version 2.1."
Comment 30 Rainer Fiebig 2016-01-19 21:27:29 UTC
Here's what I dug up from the current kernel:

grep -i hugepage .config
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set

grep -i hugetlb .config
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_CGROUP_HUGETLB=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y

The VirtualBox-manual says that "large pages: on|off" is for Intel VT-x only. So may be for AMD-CPUs nr_mapped can not be reduced to a "normal" range.
Comment 31 Rainer Fiebig 2016-01-20 19:55:22 UTC
Could it be that only part of nr_mapped is in Active(file)/Inactive(file), say 10 to 30 % or so?

And the problem comes from subtracting 100 %?

This is the status before and after a drop_caches. It's clear that not all of nr_mapped can be contained in the other two. And Active(file)/Inactive(file) may not belong 100 % to nr_mapped:

Before:

Active(file):     231984 kB
Inactive(file):   412508 kB
Mapped:          2802508 kB


After:

Active(file):     180420 kB
Inactive(file):    93280 kB
Mapped:          2802812 kB
Comment 32 Rainer Fiebig 2016-02-05 21:44:46 UTC
(In reply to Jay from comment #31)
> Could it be that only part of nr_mapped is in Active(file)/Inactive(file),
> say 10 to 30 % or so?
> 
> And the problem comes from subtracting 100 %?
> 
> This is the status before and after a drop_caches. It's clear that not all
> of nr_mapped can be contained in the other two. And
> Active(file)/Inactive(file) may not belong 100 % to nr_mapped:
> 
> Before:
> 
> Active(file):     231984 kB
> Inactive(file):   412508 kB
> Mapped:          2802508 kB
> 
> 
> After:
> 
> Active(file):     180420 kB
> Inactive(file):    93280 kB
> Mapped:          2802812 kB

I wanted to see what the values are in a real s2disk. So I changed int hibernate_preallocate_memory(void) a bit. Among other things I placed a shrink_all_memory() at the beginning of it, before saveable etc. are calculated.

What I get is like this:

PM: Preallocating image memory... 
nr_mapped = 711877 pages, 2780 MB
reclaimable_anon_slab = 428366 pages, 1673 MB
reclaim active_inactive(file) = 370812 pages, 1448 MB	/* shrink_all_memory() */
nr_reclaimed = 336461 pages, 1314 MB
active_inactive(file) = 92412 pages, 360 MB
reclaimable_anon_slab = 361144 pages, 1410 MB
save_highmem = 0 pages, 0 MB
saveable = 1419169 pages, 5543 MB
...

active_inactive(file) *after* shrink_all_memory() was just 360 MB, which means that only 25 % of it were *not* reclaimed. With respect to nr_mapped the ratio is 360/2780 which is just 13 %. So in this case only 13 % of nr_mapped should have been subtracted in minimum_image_size() but 100 % were. And that's too much.

When nr_mapped is in a normal range, the ratio of active_inactive(file)/nr_mapped after shrink_all_memory() is closer to 1, like here:

PM: Preallocating image memory... 
nr_mapped = 128712 pages, 502 MB
reclaimable_anon_slab = 366529 pages, 1431 MB
reclaim active_inactive(file) = 184209 pages, 719 MB
nr_reclaimed = 95499 pages, 373 MB
active_inactive(file) = 96187 pages, 375 MB
reclaimable_anon_slab = 349050 pages, 1363 MB
...

375/502 = 0.75. In such cases it seems acceptable if nr_mapped is fully subtracted.

But I think it's not possible to determine whether nr_mapped is in a "normal" range or too high. So a nr_mapped-free way to calculate minimum_image_size() would be preferable, IMO.

As an alternative to just excluding nr_mapped, I'm testing this now:

- shrink_all_memory("value of active_inactive(file") at the beginning of int hibernate_preallocate_memory()

- what remains of active_inactive(file) cannot be reclaimed and thus is excluded from minimum_image_size(), along with nr_mapped.

minimum_image_size() looks like this:

static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size;
	
	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON);

	return saveable - size;
}

(I guess it's not possible that saveable can be <= size, so I simplified the return-statement.)

So far it's OK.
Comment 33 Chen Yu 2016-02-09 04:52:43 UTC
(In reply to Jay from comment #31)
> Could it be that only part of nr_mapped is in Active(file)/Inactive(file),
> say 10 to 30 % or so?
> 
Indeed.  some mmap file can not be reclaimed, so it is not counted in Active/Inactive(file), for example, there is also a 'Unevictable' lru besides
Active LRU and InActive LRU, some mmap pages might be moved from Active/Inactive(file) lru to  'Unevictable' lru. I also wrote for help to  some people on mm list, they confirmed that:
"There are also unreclaimable pages which wouldn't be accounted for in NR_ACTIVE_FILE/NR_INACTIVE_FILE.  I can see those getting accounted for in NR_FILE_MAPPED, though."
And he also suspect there is a bug in VirtualBox binary driver that increse the number of nr_mapped.
Anyway, we should deal with the 'imprecise' nr_mapped.
Comment 34 Chen Yu 2016-02-09 05:35:02 UTC
(In reply to Jay from comment #32)
> (In reply to Jay from comment #31)
> 
> As an alternative to just excluding nr_mapped, I'm testing this now:
> 
> - shrink_all_memory("value of active_inactive(file") at the beginning of int
> hibernate_preallocate_memory()
> 
> - what remains of active_inactive(file) cannot be reclaimed and thus is
> excluded from minimum_image_size(), along with nr_mapped.
> 
This breaks the orginal context of minimum_image_size:
the reason why minimum_image_size is introduced is because user doesn't want
to reclaim too much pages, if too many page cache are relaimed , after the system resumes, all the page cache will be re-read in, and it cost too much time to resume from hibernation, you can check the code comment for minimum_image_size.
Comment 35 Rainer Fiebig 2016-02-09 22:11:29 UTC
(In reply to Chen Yu from comment #33)
> (In reply to Jay from comment #31)
> > Could it be that only part of nr_mapped is in Active(file)/Inactive(file),
> > say 10 to 30 % or so?
> > 
> Indeed.  some mmap file can not be reclaimed, so it is not counted in
> Active/Inactive(file), for example, there is also a 'Unevictable' lru besides
> Active LRU and InActive LRU, some mmap pages might be moved from
> Active/Inactive(file) lru to  'Unevictable' lru. I also wrote for help to 
> some people on mm list, they confirmed that:
> "There are also unreclaimable pages which wouldn't be accounted for in
> NR_ACTIVE_FILE/NR_INACTIVE_FILE.  I can see those getting accounted for in
> NR_FILE_MAPPED, though."
> And he also suspect there is a bug in VirtualBox binary driver that increse
> the number of nr_mapped.
> Anyway, we should deal with the 'imprecise' nr_mapped.

A "theoretical" shrink_all_memory(in_active(file)) would be nice, so that we could use the return-value as the "true" nr_mapped.

Or is there an alternative to nr_mapped? A (perhaps derived) value that can not be confused by VirtualBox?
Comment 36 Rainer Fiebig 2016-02-09 22:13:22 UTC
(In reply to Chen Yu from comment #34)
> (In reply to Jay from comment #32)
> > (In reply to Jay from comment #31)
> > 
> > As an alternative to just excluding nr_mapped, I'm testing this now:
> > 
> > - shrink_all_memory("value of active_inactive(file") at the beginning of
> int
> > hibernate_preallocate_memory()
> > 
> > - what remains of active_inactive(file) cannot be reclaimed and thus is
> > excluded from minimum_image_size(), along with nr_mapped.
> > 
> This breaks the orginal context of minimum_image_size:
> the reason why minimum_image_size is introduced is because user doesn't want
> to reclaim too much pages, if too many page cache are relaimed , after the
> system resumes, all the page cache will be re-read in, and it cost too much
> time to resume from hibernation, you can check the code comment for
> minimum_image_size.

Maybe the resumes take a bit longer, hard to say. No problem here but may depend on hardware. On the other hand the s2disk-process seems to be a bit faster. So, on balance...

But my primary intention was to find out about the "true" value of those bloated nr_mapped.
Comment 37 Rainer Fiebig 2016-02-11 19:56:04 UTC
... 
> But I think it's not possible to determine whether nr_mapped is in a
> "normal" range or too high. 
...

I may have to recant. 

Surprisingly, (falsely) high nr_mapped-values may be detected through the ratio of reclaimable anon_slab to nr_mapped. 

Values < 2 indicate unusually high nr_mapped and values < 1 signal that we may be close to failure of s2disk (see table at bottom).

This might be used for (yet) another workaround:

/* exclude nr_mapped if unusually and probably falsely high */
static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size, nr_mapped;
	
	nr_mapped = global_page_state(NR_FILE_MAPPED);

	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON);
	
	if (size / nr_mapped < 1)
		nr_mapped = 0;
		
	size += global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE)
		- nr_mapped;
	
	return saveable <= size ? 0 : saveable - size;
}

Only very high nr_mapped are excluded. Otherwise it operates like the original.



*************************************************

Normal nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 185056 pages, 722 MB
reclaimable anon_slab = 454501 pages, 1775 MB
reclaimable active_inactive(file) = 720303 pages, 2813 MB
save_highmem = 0 pages, 0 MB
saveable = 1859611 pages, 7264 MB
...
minimum_pages, condit. excluding nr_mapped = 869863 pages, 3397 MB
minimum_pages, orig. version = 869863 pages, 3397 MB
minimum_pages, excl. nr_mapped = 684807 pages, 2675 MB
...

High nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 851944 pages, 3327 MB
reclaimable anon_slab = 370605 pages, 1447 MB
reclaimable active_inactive(file) = 651405 pages, 2544 MB
save_highmem = 0 pages, 0 MB
saveable = 1903696 pages, 7436 MB
...
minimum_pages, condit. excluding nr_mapped = 881686 pages, 3444 MB
minimum_pages, orig. version = 1733630 pages, 6771 MB	/* Orig. fails */
minimum_pages, excl. nr_mapped = 881686 pages, 3444 MB
...

High nr_mapped and high anon_slab (gitk):

PM: Preallocating image memory... 
nr_mapped = 654991 pages, 2558 MB
reclaimable anon_slab = 1272056 pages, 4968 MB
reclaimable active_inactive(file) = 799541 pages, 3123 MB
save_highmem = 0 pages, 0 MB
saveable = 2758432 pages, 10775 MB
...
minimum_pages, condit. excluding nr_mapped = 1341913 pages, 5241 MB
minimum_pages, orig. version = 1341913 pages, 5241 MB
minimum_pages, excl. nr_mapped = 686906 pages, 2683 MB
...

*************************************************
A: nr_mapped; B: reclaimable_anon_slab;	ratio B/A

A	B	ratio
506	5.082	10,0
573	4.823	8,4
687	2.262	3,3
613	1.919	3,1
589	1.832	3,1
482	1.488	3,1
582	1.759	3,0
597	1.745	2,9
614	1.762	2,9
502	1.431	2,9
634	1.747	2,8
664	1.803	2,7
606	1.616	2,7
617	1.561	2,5
693	1.710	2,5
565	1.315	2,3
725	1.669	2,3
615	1.408	2,3
592	1.296	2,2
1.505	1.939	1,3
2.535	3.219	1,3
1.646	1.917	1,2
1.726	1.980	1,1
1.628	1.830	1,1
1.570	1.712	1,1
1.563	1.612	1,0
1.176	1.197	1,0
2.758	1.712	0,6
2.780	1.673	0,6
2.659	1.472	0,6
2.625	1.286	0,5
2.660	1.189	0,4
3.482	1.167	0,3
Comment 38 Rainer Fiebig 2016-02-12 10:03:32 UTC
(In reply to Jay from comment #37)
 
> /* exclude nr_mapped if unusually and probably falsely high */
> static unsigned long minimum_image_size(unsigned long saveable)
> {
>       unsigned long size, nr_mapped;
>       
>       nr_mapped = global_page_state(NR_FILE_MAPPED);
> 
>       size = global_page_state(NR_SLAB_RECLAIMABLE)
>               + global_page_state(NR_ACTIVE_ANON)
>               + global_page_state(NR_INACTIVE_ANON);
>       
>       if (size / nr_mapped < 1)
>               nr_mapped = 0;
>               
>       size += global_page_state(NR_ACTIVE_FILE)
>               + global_page_state(NR_INACTIVE_FILE)
>               - nr_mapped;
>       
>       return saveable <= size ? 0 : saveable - size;
> }
> 

At times...

This

	if (size / nr_mapped < 1)
		nr_mapped = 0;

should better have been this:

	if (size < nr_mapped)
		nr_mapped = 0;
Comment 39 Rainer Fiebig 2016-02-15 11:12:25 UTC
(In reply to Chen Yu from comment #33)

> And he also suspect there is a bug in VirtualBox binary driver that increse
> the number of nr_mapped.

Perhaps another indication for that is the discrepancy between what the manual states about a performance-gain through "Large pages: on" and what I see here. The manual says:

"[...]If nested paging is enabled, the VirtualBox hypervisor can also use large pages to reduce
TLB usage and overhead. This can yield a performance improvement of up to 5%.[...]"

But what I see here is the opposite: The VMs are a tad quicker with large pages "off". 

So perhaps they simply mixed up "on" with "off" somewhere in their code.
Comment 40 Chen Yu 2016-02-18 02:48:40 UTC
(In reply to Jay from comment #39)
> (In reply to Chen Yu from comment #33)
> 
> > And he also suspect there is a bug in VirtualBox binary driver that increse
> > the number of nr_mapped.
> 
> Perhaps another indication for that is the discrepancy between what the
> manual states about a performance-gain through "Large pages: on" and what I
> see here. The manual says:
> 
> "[...]If nested paging is enabled, the VirtualBox hypervisor can also use
> large pages to reduce
> TLB usage and overhead. This can yield a performance improvement of up to
> 5%.[...]"
> 
> But what I see here is the opposite: The VMs are a tad quicker with large
> pages "off". 
> 
> So perhaps they simply mixed up "on" with "off" somewhere in their code.

The 'nest pages' is a hardware feature that can reduce the usage of page table for vm guest, if it is in a "off" state, linux host has to use more page tables to translate the guest pa to actual pa(pa stands for physical address). So nr_mapped might be related to host page tables, but I don't know the exact place.
Comment 41 Chen Yu 2016-02-18 02:54:12 UTC
(In reply to Jay from comment #38)
> (In reply to Jay from comment #37)
>  
> > /* exclude nr_mapped if unusually and probably falsely high */
> > static unsigned long minimum_image_size(unsigned long saveable)
> > {
> >     unsigned long size, nr_mapped;
> >     
> >     nr_mapped = global_page_state(NR_FILE_MAPPED);
> > 
> >     size = global_page_state(NR_SLAB_RECLAIMABLE)
> >             + global_page_state(NR_ACTIVE_ANON)
> >             + global_page_state(NR_INACTIVE_ANON);
> >     
> >     if (size / nr_mapped < 1)
> >             nr_mapped = 0;
> >             
> >     size += global_page_state(NR_ACTIVE_FILE)
> >             + global_page_state(NR_INACTIVE_FILE)
> >             - nr_mapped;
> >     
> >     return saveable <= size ? 0 : saveable - size;
> > }
> > 
> 
> At times...
> 
> This
> 
>       if (size / nr_mapped < 1)
>               nr_mapped = 0;
> 
> should better have been this:
> 
>       if (size < nr_mapped)
>               nr_mapped = 0;

In order to sell this solution out, we need to convince maintainers why we did like this, is it a experience formula or based on a certain evidence, in most cases the latter would be more persuasive :)

BTW, do you know if using kvm could also reproduce this problem?
Comment 42 Rainer Fiebig 2016-02-18 21:28:24 UTC
...
> 
> In order to sell this solution out, we need to convince maintainers why we
> did like this, is it a experience formula or based on a certain evidence, in
> most cases the latter would be more persuasive :)
> 

Experience and evidence, I think. That excluding nr_mapped works, is a proven fact for me now. And that size < nr_mapped may be a good way to catch the cases of problematically high nr_mapped is based on the data that I've analyzed so far (see the table in Comment 37). 

What I can say is that under normal circumstances (i.e. no bloated nr_mapped due to a VB-VM) the sum of

   global_page_state(NR_SLAB_RECLAIMABLE)
   + global_page_state(NR_ACTIVE_ANON)
   + global_page_state(NR_INACTIVE_ANON)
	
is never < nr_mapped on my system. And I suppose this is also the case on other systems. But I do not know WHY it is so. I'm just observing and using it. Providing the explanation ("Normally, the sum of... is > nr_mapped because...") would be your part. ;)

But before submitting the idea we should try to make sure that we're not just dealing with a bug in VirtualBox. I would have already filed a bug report but for that they want your name, address, phone-no. etc. And I don't like that.

But perhaps you could give your colleague at Oracle a call and point him to this thread? Surely they must be interested in an opportunity to improve their product - as well as good relations with Intel? ;)


> BTW, do you know if using kvm could also reproduce this problem?

I haven't used kvm so far. But perhaps Aaron Lu has. He's on the sunny side of the street because he uses a virt.-app that makes use of anon-pages instead of nr_mapped (see https://bugzilla.kernel.org/show_bug.cgi?id=47931#c56).
Comment 43 Aaron Lu 2016-02-19 03:32:03 UTC
Right, qemu makes use of anonymous pages instead of mapped.
Comment 44 Chen Yu 2016-02-21 15:52:37 UTC
(In reply to Jay from comment #42)
> due to a VB-VM) the sum of
> 
>    global_page_state(NR_SLAB_RECLAIMABLE)
>    + global_page_state(NR_ACTIVE_ANON)
>    + global_page_state(NR_INACTIVE_ANON)
>       
> is never < nr_mapped on my system. And I suppose this is also the case on
> other systems. 
Unfortunately I have to say, ANON+SLAB does not have any relationship with NR_MAPPED in the source code... they are just two seperated kinds of pages.
> But perhaps you could give your colleague at Oracle a call and point him to
> this thread? Surely they must be interested in an opportunity to improve
> their product - as well as good relations with Intel? ;)
I don't have a stool pigeon in Oracle, but I'll try VirtualBox on my side, hope it is not too hard..
Comment 45 Chen Yu 2016-02-21 15:55:22 UTC
(In reply to Aaron Lu from comment #43)
> Right, qemu makes use of anonymous pages instead of mapped.
Hi Aaron, I remember Jay has mentioned that he is using a VirtualBox's standard-format for the VM's virtual harddisk, can kvm simulate a harddisk too?
Comment 46 Aaron Lu 2016-02-22 01:57:39 UTC
(In reply to Chen Yu from comment #45)
> (In reply to Aaron Lu from comment #43)
> > Right, qemu makes use of anonymous pages instead of mapped.
> Hi Aaron, I remember Jay has mentioned that he is using a VirtualBox's
> standard-format for the VM's virtual harddisk, can kvm simulate a harddisk
> too?

Of course, qemu is capable of using multiple formats of hard disks and I have been using qcow2.
Comment 47 Rainer Fiebig 2016-02-22 21:34:23 UTC
(In reply to Chen Yu from comment #44)
> (In reply to Jay from comment #42)
> > due to a VB-VM) the sum of
> > 
> >    global_page_state(NR_SLAB_RECLAIMABLE)
> >    + global_page_state(NR_ACTIVE_ANON)
> >    + global_page_state(NR_INACTIVE_ANON)
> >     
> > is never < nr_mapped on my system. And I suppose this is also the case on
> > other systems. 

> Unfortunately I have to say, ANON+SLAB does not have any relationship with
> NR_MAPPED in the source code... they are just two seperated kinds of pages.

Tough luck, then. But it's still the best indicator I have found. So I'll continue to use it.

> > But perhaps you could give your colleague at Oracle a call and point him to
> > this thread? Surely they must be interested in an opportunity to improve
> > their product - as well as good relations with Intel? ;)

> I don't have a stool pigeon in Oracle, but I'll try VirtualBox on my side,
> hope it is not too hard..

It's easy. Once you have created a VM, you can use the following commands to check for and set "Large Pages: on/off

~> VBoxManage showvminfo "name of VM" --details

~> VBoxManage modifyvm "name of VM" --largepages on
Comment 48 Chen Yu 2016-02-24 07:45:00 UTC
ok, I can reproduced it on virtualbox now, interesting..
Comment 49 Rainer Fiebig 2016-03-06 15:15:20 UTC
(In reply to Chen Yu from comment #48)
> ok, I can reproduced it on virtualbox now, interesting..

Do you think it's a bug or just a quirk?
		
I've posted a few words about possible problems with s2disk and VirtualBox and the fix through "large pages: on" in two openSuse-forums. That info can be found by googling and should help people with Intel-CPUs.

However, "large pages: on" won't work for AMD-CPUs (acc. to the VB-manual). But are they affected by high nr_mapped at all?

And if they are - and if high nr_mapped is just a quirk of VirtualBox - would that warrant a change in the kernel-code?

In other words: how shall we proceed?
Comment 50 Chen Yu 2016-03-06 16:14:17 UTC
(In reply to Jay from comment #49)
> (In reply to Chen Yu from comment #48)
> > ok, I can reproduced it on virtualbox now, interesting..
> 
> Do you think it's a bug or just a quirk?
>               
> I've posted a few words about possible problems with s2disk and VirtualBox
> and the fix through "large pages: on" in two openSuse-forums. That info can
> be found by googling and should help people with Intel-CPUs.
> 
> However, "large pages: on" won't work for AMD-CPUs (acc. to the VB-manual).
> But are they affected by high nr_mapped at all?
> 
> And if they are - and if high nr_mapped is just a quirk of VirtualBox -
> would that warrant a change in the kernel-code?
> 
> In other words: how shall we proceed?

I've looked through some of the virtualbox driver code, but unfortunately I haven't found any piece of code would manipulate the counter of nr_mapped, so I suspect there should be some black magic inside the linux, although I do not have much bandwidth on this recently, I'll return to this thread later.
Comment 51 Rainer Fiebig 2016-03-31 11:20:21 UTC
(In reply to Chen Yu from comment #50)
snip
> 
> I've looked through some of the virtualbox driver code, but unfortunately I
> haven't found any piece of code would manipulate the counter of nr_mapped,
> so I suspect there should be some black magic inside the linux, although I
> do not have much bandwidth on this recently, I'll return to this thread
> later.

Maybe this is obvious to you - but just to have mentioned it:

A while ago I've noticed that "large pages: on" reduces the amount of memory added to "mapped" by the factor 8.

So a 2048-MB-VM adds only around 256 MB to mapped with "large pages:on" 
but 2048 MB with "large pages:off".

Perhaps "on" sets the page size to 4096 bytes and "off" to 512.
Comment 52 Rainer Fiebig 2016-04-09 13:28:31 UTC
Maybe I have found a way to get rid of that nr_mapped-problem.

By reclaiming the cache at the beginning of the second part of hibernate_preallocate_memory() - which is entered only in high-mem-load cases - we can exclude the cache-part from minimum_image_size().

In low- to medium-load cases (probably the majority) everything stays as is. In high-load cases a shrink_all_memory() is done anyway in the original function - so no major difference. Only that now it is done earlier and in a slightly modified form.

It looks like this:

...
	if (size >= saveable) {
		pages = preallocate_image_highmem(save_highmem);
		pages += preallocate_image_memory(saveable - pages, avail_normal);
		goto out;
	}
	
	/* reclaim at least the cache and adjust saveable */
	to_reclaim = max(reclaimable_file, saveable - size);
	saveable -= shrink_all_memory(to_reclaim);
	
	/* Estimate the minimum size of the image. */
	pages = minimum_image_size(saveable);
...

And minimum_image_size() is now free of nr_mapped:

static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size;

	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON);
	
	return saveable <= size ? 0 : saveable - size;
}

Needs further testing but results so far look good. s2disks and resumes in higher-load cases are at least as fast as with the orig.

For example a high-load case (2-GB-VM + gitk + others):

PM: Preallocating image memory... 
nr_mapped = 631808 pages, 2468 MB
active_inactive(file) = 774156 pages, 3024 MB
reclaimable_anon_slab = 1150294 pages, 4493 MB
save_highmem = 0 pages, 0 MB
saveable = 2610116 pages, 10195 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061450 pages, 11958 MB
count = 3023247 pages, 11809 MB
max_size = 1510489 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB

to_reclaim = 1260381 pages, 4923 MB
nr_reclaimed = 604510 pages, 2361 MB
saveable = 2005606 pages, 7834 MB
active_inactive(file) = 206154 pages, 805 MB
reclaimable_anon_slab = 1108223 pages, 4328 MB
minimum_pages = 897383 pages, 3505 MB
target_image_size = 1349735 pages, 5272 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512758 pages, 5909 MB
to_alloc_adjusted = 1512758 pages, 5909 MB
pages_allocated = 1512758 pages, 5909 MB
done (allocated 1673512 pages)
PM: Allocated 6694048 kbytes in 10.08 seconds (664.09 MB/s)


Normal-load:

PM: Preallocating image memory... 
nr_mapped = 627429 pages, 2450 MB
active_inactive(file) = 379814 pages, 1483 MB
reclaimable_anon_slab = 185542 pages, 724 MB
save_highmem = 0 pages, 0 MB
saveable = 1247975 pages, 4874 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061315 pages, 11958 MB
count = 3023112 pages, 11809 MB
max_size = 1510422 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
done (allocated 1247975 pages)
PM: Allocated 4991900 kbytes in 0.35 seconds (14262.57 MB/s)
Comment 53 Rainer Fiebig 2016-05-19 16:35:51 UTC
The approach outlined above works. I have modified it a bit, mainly because shrink_all_memory() does not always deliver at the first call what it is asked for. It is also not overly precise, +/- 30% or more off target are not unusual.

Omitting the debug-statements, it now looks like this:

...
	if (saveable > size) 
		saveable -= shrink_all_memory(saveable - size);
	
	/*
	 * If the desired number of image pages is at least as large as the
	 * current number of saveable pages in memory, allocate page frames for
	 * the image and we're done.
	 */
	if (saveable <= size) {
		pages = preallocate_image_highmem(save_highmem);
		pages += preallocate_image_memory(saveable - pages, avail_normal);
		goto out;
	}
	
	/* reclaim at least the cache and adjust saveable */
	reclaimable_file = sum_in_active_file();
	to_reclaim = max(reclaimable_file, saveable - size);
	saveable -= shrink_all_memory(to_reclaim);
	
	/* Estimate the minimum size of the image. */
	pages = minimum_image_size(saveable);
...

This works reliably and irrespective of nr_mapped-values. It also seems consistent to me. So far I haven't seen negative effects like prolonged resumes or so.

So I finally consider the problem solved for me.

As I have also published the way how to avoid high nr_mapped-values for VirtualBox-VMs at the outset, other users can find a simple solution too.

But perhaps I'm the only one anyway who uses s2disk while running VirtualBox-VMs. ;)

So long!

Jay


Examples

Baseline after reboot:

PM: Preallocating image memory... 
nr_mapped = 109655 pages, 428 MB
active_inactive(file) = 193346 pages, 755 MB
reclaimable_anon_slab = 379733 pages, 1483 MB
save_highmem = 0 pages, 0 MB
saveable = 688159 pages, 2688 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061318 pages, 11958 MB
count = 3023115 pages, 11809 MB
max_size = 1510423 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
done (allocated 688159 pages)
PM: Allocated 2752636 kbytes in 0.26 seconds (10587.06 MB/s)
************************

Medium load, high nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 652131 pages, 2547 MB
active_inactive(file) = 526683 pages, 2057 MB
reclaimable_anon_slab = 403426 pages, 1575 MB
save_highmem = 0 pages, 0 MB
saveable = 1615454 pages, 6310 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061294 pages, 11958 MB
count = 3023091 pages, 11808 MB
max_size = 1510411 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
saveable = 1209334 pages, 4723 MB
done (allocated 1209334 pages)
PM: Allocated 4837336 kbytes in 1.05 seconds (4606.98 MB/s)
...
PM: Need to copy 1202273 pages
PM: Hibernation image created (1202273 pages copied)
***********************

Medium load, normal nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 137698 pages, 537 MB
active_inactive(file) = 689275 pages, 2692 MB
reclaimable_anon_slab = 379854 pages, 1483 MB
save_highmem = 0 pages, 0 MB
saveable = 1752291 pages, 6844 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061418 pages, 11958 MB
count = 3023215 pages, 11809 MB
max_size = 1510473 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
saveable = 1347312 pages, 5262 MB
done (allocated 1347312 pages)
PM: Allocated 5389248 kbytes in 0.61 seconds (8834.83 MB/s)
...
PM: Need to copy 1341278 pages
************************

High load, high nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 843359 pages, 3294 MB
active_inactive(file) = 695509 pages, 2716 MB
reclaimable_anon_slab = 416732 pages, 1627 MB
save_highmem = 0 pages, 0 MB
saveable = 1992456 pages, 7783 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061320 pages, 11958 MB
count = 3023117 pages, 11809 MB
max_size = 1510424 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
saveable = 1358617 pages, 5307 MB
to_reclaim = 100962 pages, 394 MB
nr_reclaimed = 107933 pages, 421 MB
saveable = 1250684 pages, 4885 MB
active_inactive(file) = 47934 pages, 187 MB
reclaimable_anon_slab = 313496 pages, 1224 MB
minimum_pages = 937188 pages, 3660 MB
target_image_size = 1349735 pages, 5272 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512693 pages, 5908 MB
to_alloc_adjusted = 1512693 pages, 5908 MB
pages_allocated = 1512693 pages, 5908 MB
done (allocated 1673382 pages)
PM: Allocated 6693528 kbytes in 2.21 seconds (3028.74 MB/s)
...
PM: Need to copy 1246806 pages
PM: Hibernation image created (1246806 pages copied)
*********************

High load, normal nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 128972 pages, 503 MB
active_inactive(file) = 1018633 pages, 3979 MB
reclaimable_anon_slab = 443409 pages, 1732 MB
save_highmem = 0 pages, 0 MB
saveable = 2147765 pages, 8389 MB
highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061345 pages, 11958 MB
count = 3023142 pages, 11809 MB
max_size = 1510437 pages, 5900 MB
user_specified_image_size = 1349734 pages, 5272 MB
adjusted_image_size = 1349735 pages, 5272 MB
saveable = 1364570 pages, 5330 MB
to_reclaim = 271074 pages, 1058 MB
nr_reclaimed = 225778 pages, 881 MB
saveable = 1138792 pages, 4448 MB
active_inactive(file) = 78415 pages, 306 MB
reclaimable_anon_slab = 362997 pages, 1417 MB
minimum_pages = 775795 pages, 3030 MB
target_image_size = 1349735 pages, 5272 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512705 pages, 5909 MB
to_alloc_adjusted = 1512705 pages, 5909 MB
pages_allocated = 1512705 pages, 5909 MB
done (allocated 1673407 pages)
PM: Allocated 6693628 kbytes in 1.71 seconds (3914.40 MB/s)
...
PM: Need to copy 1127012 pages
PM: Hibernation image created (1127012 pages copied)
**********************
Comment 54 Chen Yu 2016-05-20 01:50:00 UTC
(In reply to Jay from comment #53)
> The approach outlined above works. I have modified it a bit, mainly because
> shrink_all_memory() does not always deliver at the first call what it is
> asked for. It is also not overly precise, +/- 30% or more off target are not
> unusual.
> 
> Omitting the debug-statements, it now looks like this:
> 
> ...
>       if (saveable > size) 
>               saveable -= shrink_all_memory(saveable - size);
>       
>       /*
>        * If the desired number of image pages is at least as large as the
>        * current number of saveable pages in memory, allocate page frames for
>        * the image and we're done.
>        */
>       if (saveable <= size) {
>               pages = preallocate_image_highmem(save_highmem);
>               pages += preallocate_image_memory(saveable - pages,
> avail_normal);
>               goto out;
>       }
>       
>       /* reclaim at least the cache and adjust saveable */
>       reclaimable_file = sum_in_active_file();
>       to_reclaim = max(reclaimable_file, saveable - size);
>       saveable -= shrink_all_memory(to_reclaim);
>       
>       /* Estimate the minimum size of the image. */
>       pages = minimum_image_size(saveable);
> ...
> 
> This works reliably and irrespective of nr_mapped-values. It also seems
> consistent to me. So far I haven't seen negative effects like prolonged
> resumes or so.
> 
> So I finally consider the problem solved for me.
> 
> As I have also published the way how to avoid high nr_mapped-values for
> VirtualBox-VMs at the outset, other users can find a simple solution too.
> 
> But perhaps I'm the only one anyway who uses s2disk while running
> VirtualBox-VMs. ;)
> 
> So long!
> 
Thanks for your continuous effort, I'm back and as you are doing testing I will try to figure out where the nr_map exactly comes from.
Comment 55 Rainer Fiebig 2016-05-23 16:34:05 UTC
(In reply to Chen Yu from comment #54)
> (In reply to Jay from comment #53)

snip

> Thanks for your continuous effort, I'm back and as you are doing testing I
> will try to figure out where the nr_map exactly comes from.

That's probably not going to be easy, I think. Good luck, then!

Don't know if this offers you a clue but I have noticed 2 or 3 times after having started a second VM that nr_mapped was higher than expected, even though both VMs were set to "large pages: on". But I can not reproduce this and there wasn't anything obviously special about the situation where it happened, perhaps (!) relatively high memory-load.
Comment 56 Rainer Fiebig 2016-05-28 11:10:22 UTC
As I understand it, the idea behind subtracting nr_mapped from the file cache in minimum_image_size() is to weed out the common set of the two (also see Comment 5). Which implies that nr_mapped should not be greater than the file cache.

But take a look at the following numbers. They were obtained immediately after a "sync && sysctl vm.drop_caches=1". What's left of the cache now, should be the common set (maximally).

In the first case only the desktop-environment, 2 terminal-windows and an editor were loaded. Even there nr_mapped was greater than the file cache and subtracting it from the latter would have given a wrong result. More so in the second case.

So even with "normal" nr_mapped-values, using nr_mapped in the calculation of minimum_image_size() may produce wrong results.


PM: Preallocating image memory... 
nr_mapped = 71202 pages, 278 MB
active_inactive(file) = 47595 pages, 185 MB
reclaimable_anon_slab = 223598 pages, 873 MB
save_highmem = 0 pages, 0 MB
saveable = 381520 pages, 1490 MB
...

PM: Preallocating image memory... 
nr_mapped = 208641 pages, 815 MB
active_inactive(file) = 80216 pages, 313 MB
reclaimable_anon_slab = 446329 pages, 1743 MB
save_highmem = 0 pages, 0 MB
saveable = 1689922 pages, 6601 MB
...
Comment 57 Rainer Fiebig 2016-06-03 16:53:07 UTC
(In reply to Jay from comment #56)
> As I understand it, the idea behind subtracting nr_mapped from the file
> cache in minimum_image_size() is to weed out the common set of the two (also
> see Comment 5). Which implies that nr_mapped should not be greater than the
> file cache.
> 
> But take a look at the following numbers. They were obtained immediately
> after a "sync && sysctl vm.drop_caches=1". What's left of the cache now,
> should be the common set (maximally).
> 
> In the first case only the desktop-environment, 2 terminal-windows and an
> editor were loaded. Even there nr_mapped was greater than the file cache and
> subtracting it from the latter would have given a wrong result. More so in
> the second case.
> 
> So even with "normal" nr_mapped-values, using nr_mapped in the calculation
> of minimum_image_size() may produce wrong results.
> 
...


It was reproducible in KDE, xfce and after logging in to just an X-terminal-session: after drop_caches, nr_mapped was always greater than the filecache (47668 kB vs. 29364 kB in the last case).

That means that even in a basic environment nr_mapped can be greater than the true number of mapped filecache pages.

I think this speaks against subtracting nr_mapped from the filecache as a means to get the number of reclaimable pages of the cache (s. Comment 5).

In most cases the resulting error will be relatively small and veiled by the fact that filecache + anon pages is usually much greater than nr_mapped. But the more nr_mapped deviates from the true number of mapped filecache pages, the greater the error - up to unnecessary failure.

If it is not possible (or desirable) to calculate NR_FILE_MAPPED in a way that it truly represents the number of mapped filecache pages, I basically see three alternatives:

1) Leave everything as is
	accept a certain inaccuracy 
	accept possible failure in case of high nr_mapped-values

2) Exclude nr_mapped from the calculation of minimum_image_size()
	also accept a certain inaccuracy
	no failure due to high nr_mapped
	as simple as it gets
	thoroughly tested (by me)

3) Remove "cache - nr_mapped" from minimum_image_size(), let the system reclaim the cache instead
	probably the most accurate results
	no failure due to high nr_mapped
	rather simple
	no failure or negatives so far
Comment 58 Chen Yu 2016-06-07 12:22:08 UTC
(In reply to Jay from comment #57)
> It was reproducible in KDE, xfce and after logging in to just an
> X-terminal-session: after drop_caches, nr_mapped was always greater than the
> filecache (47668 kB vs. 29364 kB in the last case).
> 
> That means that even in a basic environment nr_mapped can be greater than
> the true number of mapped filecache pages.
> 
> I think this speaks against subtracting nr_mapped from the filecache as a
> means to get the number of reclaimable pages of the cache (s. Comment 5).
> 
So you mean this problem can be reproduced without virtualbox? Then I'd rather treat this as a pure hibernation bug. But is there any easy way to reproduce, could you attach your meminfo before/after drop_cache? And did you echo 3 to drop_cache to reproduce it?
Anyway I concern about the root cause for what
the nr_mapped means to linux, and once this is figured out, I'd recommend you propose a patch to fix it in  the code. What do you think?
Comment 59 Rainer Fiebig 2016-06-07 20:25:35 UTC
(In reply to Chen Yu from comment #58)
> (In reply to Jay from comment #57)
> > It was reproducible in KDE, xfce and after logging in to just an
> > X-terminal-session: after drop_caches, nr_mapped was always greater than
> the
> > filecache (47668 kB vs. 29364 kB in the last case).
> > 
> > That means that even in a basic environment nr_mapped can be greater than
> > the true number of mapped filecache pages.
> > 
> > I think this speaks against subtracting nr_mapped from the filecache as a
> > means to get the number of reclaimable pages of the cache (s. Comment 5).
> > 
> So you mean this problem can be reproduced without virtualbox? Then I'd
> rather treat this as a pure hibernation bug. But is there any easy way to
> reproduce, could you attach your meminfo before/after drop_cache? And did
> you echo 3 to drop_cache to reproduce it?

I have booted the kernel without all VirtualBox-modules and then logged in to a "failsafe" session (just an X-terminal) and found this before and after "sync && sysctl vm.drop_caches=3":

			before		after
active(file):		107376		14048 kB
inactive(file):		252440		17868 kB
Mapped:			47488		47460 kB

Even without anything of VirtualBox running, nr_mapped was greater than the cache after drop_caches. So I think that the problem may indeed not be VirtualBox-specific. VB may just be an extreme case because the VMs consume so much memory.

Would be interesting to know if you can reproduce it. Just check meminfo before and after 
	sync && sysctl vm.drop_caches=3 
(or 1 or 2) in a terminal. You have to be root for that.

> Anyway I concern about the root cause for what
> the nr_mapped means to linux, and once this is figured out, I'd recommend
> you propose a patch to fix it in  the code. What do you think?

I also think that nr_mapped is the root of the matter. Perhaps this is what needs to be fixed. We'll see what to do when you have solved that riddle.



**************

Before and after "sync && sysctl vm.drop_caches=3". KDE, just 2 terminal-windows and Kate running.

Before:
~> cat /proc/meminfo
MemTotal:       11997636 kB
MemFree:        10443668 kB
MemAvailable:   10894300 kB
Buffers:           89060 kB
Cached:           633744 kB
SwapCached:            0 kB
Active:           906172 kB
Inactive:         389528 kB
Active(anon):     573848 kB
Inactive(anon):   139544 kB
Active(file):     332324 kB
Inactive(file):   249984 kB
Unevictable:          32 kB
Mlocked:              32 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:               112 kB
Writeback:             0 kB
AnonPages:        572920 kB
Mapped:           228636 kB
Shmem:            140504 kB
Slab:             108416 kB
SReclaimable:      74516 kB
SUnreclaim:        33900 kB
KernelStack:        5792 kB
PageTables:        35900 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16711900 kB
Committed_AS:    2263376 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      576520 kB
VmallocChunk:   34359133740 kB
HardwareCorrupted:     0 kB
AnonHugePages:     79872 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0                                                               
Hugepagesize:       2048 kB                                                            
DirectMap4k:        7680 kB                                                            
DirectMap2M:    12238848 kB         


After:
~> cat /proc/meminfo                                                 
MemTotal:       11997636 kB                                                            
MemFree:        10903044 kB                                                            
MemAvailable:   10925228 kB                                                            
Buffers:            4480 kB                                                            
Cached:           320012 kB                                                            
SwapCached:            0 kB                                                            
Active:           667800 kB                                                            
Inactive:         226820 kB                                                            
Active(anon):     570508 kB                                                            
Inactive(anon):   141928 kB                                                            
Active(file):      97292 kB                                                            
Inactive(file):    84892 kB                                                            
Unevictable:          32 kB                                                            
Mlocked:              32 kB
SwapTotal:      10713084 kB
SwapFree:       10713084 kB
Dirty:               644 kB
Writeback:             0 kB
AnonPages:        570572 kB
Mapped:           234980 kB
Shmem:            141928 kB
Slab:              51624 kB
SReclaimable:      17868 kB
SUnreclaim:        33756 kB
KernelStack:        5776 kB
PageTables:        35748 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    16711900 kB
Committed_AS:    2257396 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      576520 kB
VmallocChunk:   34359133740 kB
HardwareCorrupted:     0 kB
AnonHugePages:     79872 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    12238848 kB
Comment 60 Rainer Fiebig 2016-06-08 20:23:01 UTC
I booted Debian 7 in "recovery mode" which leads to a commandline environment. Here it was for the first time that nr_mapped was actually *not* greater than the cache after drop_caches:

		before		after
Active(file):	12152		2036 kB
Inactive(file):	37220		 776 kB
Mapped:		 2552		2504 kB
Comment 61 Chen Yu 2016-06-19 15:20:44 UTC
After some code walking, some parts of the mapped pages comes from shmem and ipc/shm,
and the rest comes from ordinary file page cache, thus active,inactive file.
So it is possible that mapped got bigger than active+inactive file, I wonder if there is potential problem that mmaped contains the number of shmem and ipc, I'll take a look then.
Comment 62 Rainer Fiebig 2016-06-19 21:51:59 UTC
(In reply to Chen Yu from comment #61)
> After some code walking, some parts of the mapped pages comes from shmem and
> ipc/shm,
> and the rest comes from ordinary file page cache, thus active,inactive file.
> So it is possible that mapped got bigger than active+inactive file, I wonder
> if there is potential problem that mmaped contains the number of shmem and
> ipc, I'll take a look then.

Interesting info, I appreciate your effort!

The comment to NR_FILE_MAPPED in mmzone.h says

	/* pagecache pages mapped into pagetables.
			   only modified from process context */

I guess this was the original concept for nr_mapped and the reason why it is used in minimum_image_size().

According to what you have found out, nr_mapped is no longer reliable in that sense. One problem this causes is that minimum_image_size tends to be too big. And there may be others. So I think this should be fixed, if possible.

Or at least the a/m comment needs to be modified. ;)
Comment 63 Rainer Fiebig 2016-06-20 17:56:49 UTC
(In reply to Chen Yu from comment #61)
> After some code walking, some parts of the mapped pages comes from shmem and
> ipc/shm,

Just to be sure: *all* of shmem is contained in nr_mapped? Or only part of it?
 
And do you know of a way to query the system about the amount of ipc/shm? (shmem or nr_shmem is of course no problem.)

> and the rest comes from ordinary file page cache, thus active,inactive file.
> So it is possible that mapped got bigger than active+inactive file, I wonder
> if there is potential problem that mmaped contains the number of shmem and
> ipc, I'll take a look then.
Comment 64 Chen Yu 2016-06-23 12:44:40 UTC
(In reply to Jay from comment #62)
> (In reply to Chen Yu from comment #61)
> > After some code walking, some parts of the mapped pages comes from shmem
> and
> > ipc/shm,
> > and the rest comes from ordinary file page cache, thus active,inactive
> file.
> > So it is possible that mapped got bigger than active+inactive file, I
> wonder
> > if there is potential problem that mmaped contains the number of shmem and
> > ipc, I'll take a look then.
> 
> Interesting info, I appreciate your effort!
> 
> The comment to NR_FILE_MAPPED in mmzone.h says
> 
>       /* pagecache pages mapped into pagetables.
>                          only modified from process context */
> 
> I guess this was the original concept for nr_mapped and the reason why it is
> used in minimum_image_size().
> 
> According to what you have found out, nr_mapped is no longer reliable in
> that sense. One problem this causes is that minimum_image_size tends to be
> too big. And there may be others. So I think this should be fixed, if
> possible.
> 
> Or at least the a/m comment needs to be modified. ;)

After some investigation I have to say, there is no problem of shm, the problem is in minimum_image_size.

First, for shm pages, it is marked as file-pages in shmem_mmap, so
any mmap to the shm memory would increase the number of NR_FILE_MAPPED.

Second, the new allocated shm pages are linked into anon lru(yes, this is a little weird, but the fact is that,shm does not want any user to know its real file position in the system, users just leverage shmget and shmat to use the shm, and they do not need to know the exact place where the shm memory is).So, the increment of shm goes to 'NR_ACTIVE_ANON or NR_INACTIVE_ANON', as well as NR_SHMEM.

so general speaking, each time you access the mmaped shm region, the number of
NR_FILE_MAPPED++, NR_SHMEM++, (NR_ACTIVE_ANON or NR_INACTIVE_ANON)++

Third, the increment of NR_SHMEM is not only during accessing mmap, but also in other context, such as fallocate/tmpfs write, they also increase the number of  NR_SHMEM but without touching other counters.


please give a solution:)
Comment 65 Rainer Fiebig 2016-06-24 16:22:19 UTC
(In reply to Chen Yu from comment #64)
> (In reply to Jay from comment #62)
> > (In reply to Chen Yu from comment #61)
> > > After some code walking, some parts of the mapped pages comes from shmem
> and
> > > ipc/shm,
> > > and the rest comes from ordinary file page cache, thus active,inactive
> file.
> > > So it is possible that mapped got bigger than active+inactive file, I
> wonder
> > > if there is potential problem that mmaped contains the number of shmem
> and
> > > ipc, I'll take a look then.
> > 
> > Interesting info, I appreciate your effort!
> > 
> > The comment to NR_FILE_MAPPED in mmzone.h says
> > 
> >     /* pagecache pages mapped into pagetables.
> >                        only modified from process context */
> > 
> > I guess this was the original concept for nr_mapped and the reason why it
> is
> > used in minimum_image_size().
> > 
> > According to what you have found out, nr_mapped is no longer reliable in
> > that sense. One problem this causes is that minimum_image_size tends to be
> > too big. And there may be others. So I think this should be fixed, if
> > possible.
> > 
> > Or at least the a/m comment needs to be modified. ;)
> 
> After some investigation I have to say, there is no problem of shm, the
> problem is in minimum_image_size.
> 
> First, for shm pages, it is marked as file-pages in shmem_mmap, so
> any mmap to the shm memory would increase the number of NR_FILE_MAPPED.
> 
> Second, the new allocated shm pages are linked into anon lru(yes, this is a
> little weird, but the fact is that,shm does not want any user to know its
> real file position in the system, users just leverage shmget and shmat to
> use the shm, and they do not need to know the exact place where the shm
> memory is).So, the increment of shm goes to 'NR_ACTIVE_ANON or
> NR_INACTIVE_ANON', as well as NR_SHMEM.
> 
> so general speaking, each time you access the mmaped shm region, the number
> of
> NR_FILE_MAPPED++, NR_SHMEM++, (NR_ACTIVE_ANON or NR_INACTIVE_ANON)++
> 
> Third, the increment of NR_SHMEM is not only during accessing mmap, but also
> in other context, such as fallocate/tmpfs write, they also increase the
> number of  NR_SHMEM but without touching other counters.
> 
> 
> please give a solution:)

Mission impossible. :)

But first let me see if I understood your findings correctly (I assumed that shm and shmem are the same):

1) NR_FILE_MAPPED = nr_mapped_filecache_pages + nr_mapped_shmem_pages

This would explain why nr_mapped can be greater than the file cache after executing drop_caches.
And the comment to NR_FILE_MAPPED in mmzone.h should be changed. (Or is there a difference between "filecache pages" and "pagecache pages"?)

2) (NR_ACTIVE_ANON + NR_INACTIVE_ANON) includes nr_mapped_shmem_pages

3) nr_mapped_shmem_pages <= NR_SHMEM


If 1) and 2) are true, subtracting NR_FILE_MAPPED as in the original minimum_image_size() would be perfectly right. And approaches 2) and 3) from my Comment 57 would be wrong. Although both work where the orig. doesn't.

I have to think about it.
Comment 66 Chen Yu 2016-06-26 12:51:39 UTC
(In reply to Jay from comment #65)
> (In reply to Chen Yu from comment #64)
> > (In reply to Jay from comment #62)
> > > (In reply to Chen Yu from comment #61)
> > > > After some code walking, some parts of the mapped pages comes from
> shmem and
> > > > ipc/shm,
> > > > and the rest comes from ordinary file page cache, thus active,inactive
> file.
> > > > So it is possible that mapped got bigger than active+inactive file, I
> wonder
> > > > if there is potential problem that mmaped contains the number of shmem
> and
> > > > ipc, I'll take a look then.
> > > 
> > > Interesting info, I appreciate your effort!
> > > 
> > > The comment to NR_FILE_MAPPED in mmzone.h says
> > > 
> > >   /* pagecache pages mapped into pagetables.
> > >                      only modified from process context */
> > > 
> > > I guess this was the original concept for nr_mapped and the reason why it
> is
> > > used in minimum_image_size().
> > > 
> > > According to what you have found out, nr_mapped is no longer reliable in
> > > that sense. One problem this causes is that minimum_image_size tends to
> be
> > > too big. And there may be others. So I think this should be fixed, if
> > > possible.
> > > 
> > > Or at least the a/m comment needs to be modified. ;)
> > 
> > After some investigation I have to say, there is no problem of shm, the
> > problem is in minimum_image_size.
> > 
> > First, for shm pages, it is marked as file-pages in shmem_mmap, so
> > any mmap to the shm memory would increase the number of NR_FILE_MAPPED.
> > 
> > Second, the new allocated shm pages are linked into anon lru(yes, this is a
> > little weird, but the fact is that,shm does not want any user to know its
> > real file position in the system, users just leverage shmget and shmat to
> > use the shm, and they do not need to know the exact place where the shm
> > memory is).So, the increment of shm goes to 'NR_ACTIVE_ANON or
> > NR_INACTIVE_ANON', as well as NR_SHMEM.
> > 
> > so general speaking, each time you access the mmaped shm region, the number
> > of
> > NR_FILE_MAPPED++, NR_SHMEM++, (NR_ACTIVE_ANON or NR_INACTIVE_ANON)++
> > 
> > Third, the increment of NR_SHMEM is not only during accessing mmap, but
> also
> > in other context, such as fallocate/tmpfs write, they also increase the
> > number of  NR_SHMEM but without touching other counters.
> > 
> > 
> > please give a solution:)
> 
> Mission impossible. :)
> 
> But first let me see if I understood your findings correctly (I assumed that
> shm and shmem are the same):
> 
> 1) NR_FILE_MAPPED = nr_mapped_filecache_pages + nr_mapped_shmem_pages
yes
> 
> This would explain why nr_mapped can be greater than the file cache after
> executing drop_caches.
> And the comment to NR_FILE_MAPPED in mmzone.h should be changed. (Or is
> there a difference between "filecache pages" and "pagecache pages"?)
they are the same.
> 
> 2) (NR_ACTIVE_ANON + NR_INACTIVE_ANON) includes nr_mapped_shmem_pages
> 
yes
> 3) nr_mapped_shmem_pages <= NR_SHMEM
yes
> 
> 
> If 1) and 2) are true, subtracting NR_FILE_MAPPED as in the original
> minimum_image_size() would be perfectly right. And approaches 2) and 3) from
> my Comment 57 would be wrong. Although both work where the orig. doesn't.
Correct, the original minimum_image_size is doing the right thing.The reason why hibernation fail on your system may be that, the code does not want us to reclaim
shmem, maybe it is just a strategy.
Comment 67 Rainer Fiebig 2016-06-26 20:20:49 UTC
(In reply to Chen Yu from comment #66)
> (In reply to Jay from comment #65)
> > (In reply to Chen Yu from comment #64)
> > > (In reply to Jay from comment #62)
> > > > (In reply to Chen Yu from comment #61)
> > > > > After some code walking, some parts of the mapped pages comes from
> shmem and
> > > > > ipc/shm,
> > > > > and the rest comes from ordinary file page cache, thus
> active,inactive file.
> > > > > So it is possible that mapped got bigger than active+inactive file, I
> wonder
> > > > > if there is potential problem that mmaped contains the number of
> shmem and
> > > > > ipc, I'll take a look then.
> > > > 
> > > > Interesting info, I appreciate your effort!
> > > > 
> > > > The comment to NR_FILE_MAPPED in mmzone.h says
> > > > 
> > > >         /* pagecache pages mapped into pagetables.
> > > >                            only modified from process context */
> > > > 
> > > > I guess this was the original concept for nr_mapped and the reason why
> it is
> > > > used in minimum_image_size().
> > > > 
> > > > According to what you have found out, nr_mapped is no longer reliable
> in
> > > > that sense. One problem this causes is that minimum_image_size tends to
> be
> > > > too big. And there may be others. So I think this should be fixed, if
> > > > possible.
> > > > 
> > > > Or at least the a/m comment needs to be modified. ;)
> > > 
> > > After some investigation I have to say, there is no problem of shm, the
> > > problem is in minimum_image_size.
> > > 
> > > First, for shm pages, it is marked as file-pages in shmem_mmap, so
> > > any mmap to the shm memory would increase the number of NR_FILE_MAPPED.
> > > 
> > > Second, the new allocated shm pages are linked into anon lru(yes, this is
> a
> > > little weird, but the fact is that,shm does not want any user to know its
> > > real file position in the system, users just leverage shmget and shmat to
> > > use the shm, and they do not need to know the exact place where the shm
> > > memory is).So, the increment of shm goes to 'NR_ACTIVE_ANON or
> > > NR_INACTIVE_ANON', as well as NR_SHMEM.
> > > 
> > > so general speaking, each time you access the mmaped shm region, the
> number
> > > of
> > > NR_FILE_MAPPED++, NR_SHMEM++, (NR_ACTIVE_ANON or NR_INACTIVE_ANON)++
> > > 
> > > Third, the increment of NR_SHMEM is not only during accessing mmap, but
> also
> > > in other context, such as fallocate/tmpfs write, they also increase the
> > > number of  NR_SHMEM but without touching other counters.
> > > 
> > > 
> > > please give a solution:)
> > 
> > Mission impossible. :)
> > 
> > But first let me see if I understood your findings correctly (I assumed
> that
> > shm and shmem are the same):
> > 
> > 1) NR_FILE_MAPPED = nr_mapped_filecache_pages + nr_mapped_shmem_pages
> yes
> > 
> > This would explain why nr_mapped can be greater than the file cache after
> > executing drop_caches.
> > And the comment to NR_FILE_MAPPED in mmzone.h should be changed. (Or is
> > there a difference between "filecache pages" and "pagecache pages"?)
> they are the same.
> > 
> > 2) (NR_ACTIVE_ANON + NR_INACTIVE_ANON) includes nr_mapped_shmem_pages
> > 
> yes
> > 3) nr_mapped_shmem_pages <= NR_SHMEM
> yes
> > 
> > 
> > If 1) and 2) are true, subtracting NR_FILE_MAPPED as in the original
> > minimum_image_size() would be perfectly right. And approaches 2) and 3)
> from
> > my Comment 57 would be wrong. Although both work where the orig. doesn't.
> Correct, the original minimum_image_size is doing the right thing.The reason
> why hibernation fail on your system may be that, the code does not want us
> to reclaim
> shmem, maybe it is just a strategy.

I still think that the high nr_mapped values cause the problem. 

But after your latest findings I also think that this is limited to VirtualBox. Nevertheless, even if it's just a bug in that sofware, some (kernelwise) provision against nr_mapped going through the roof would be good, imo. 

Anyway - I have adapted my personal safeguard according to your latest findings. The main difference to the orignal is that shrink_all_memory(saveable - size) is called earlier (which has the additional advantage that phase 2 is often bypassed) and minimum_image_size() is modified to limit nr_mapped to its (theoretically) biggest value (see code below).

In case of "normal" nr_mapped-values it works just like the orignal, except for the earlier shrink_all_memory().
In higher load cases the values for minimum_image_size are now almost identical with the original - in case of normal nr_mapped, that is (s. examples below).

As far as I am concerned, you may close this bug report. I think we've come to a happy end here, or not?

Thank you for your time and effort!

Jay

***

int hibernate_preallocate_memory(void)
...
	if (saveable > size) {
		saveable -= shrink_all_memory(saveable - size);
	}
	
	/*
	 * If the desired number of image pages is at least as large as the
	 * current number of saveable pages in memory, allocate page frames for
	 * the image and we're done.
	 */
	if (saveable <= size) {
		pages = preallocate_image_highmem(save_highmem);
		pages += preallocate_image_memory(saveable - pages, avail_normal);
		goto out;
	}
	
	/* Estimate the minimum size of the image. */
	pages = minimum_image_size(saveable);
...

/* 2016-06-26: modified to keep falsely high nr_mapped values in check */
static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size, nr_mapped, max_nr_mapped;

	nr_mapped = global_page_state(NR_FILE_MAPPED);
	max_nr_mapped = global_page_state(NR_ACTIVE_FILE) 
			+ global_page_state(NR_INACTIVE_FILE) 
			+ global_page_state(NR_SHMEM);
	
	if (nr_mapped > max_nr_mapped)
	    nr_mapped = max_nr_mapped;
	
	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON)
		+ global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE)
		- nr_mapped;
	
	return saveable <= size ? 0 : saveable - size;
}

***

Examples
Normal nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 202515 pages, 791 MB
nr_shmem = 116517 pages, 455 MB
active_inactive(file) = 594637 pages, 2322 MB
reclaimable_anon_slab = 389834 pages, 1522 MB
save_highmem = 0 pages, 0 MB
saveable = 2112603 pages, 8252 MB

highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061342 pages, 11958 MB
count = 3023165 pages, 11809 MB
max_size = 1510448 pages, 5900 MB
user_specified_image_size = 1346986 pages, 5261 MB
adjusted_image_size = 1346987 pages, 5261 MB

saveable = 1657565 pages, 6474 MB
active_inactive(file) = 145201 pages, 567 MB
reclaimable_anon_slab = 371581 pages, 1451 MB
nr_mapped = 199885 pages, 780 MB
nr_shmem = 51507 pages, 201 MB
minimum_pages = 1337491 pages, 5224 MB

target_image_size = 1346987 pages, 5261 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512717 pages, 5909 MB
to_alloc_adjusted = 1512717 pages, 5909 MB
pages_allocated = 1512717 pages, 5909 MB
done (allocated 1676178 pages)
PM: Allocated 6704712 kbytes in 4.02 seconds (1667.83 MB/s)
...
PM: Need to copy 1348817 pages
PM: Hibernation image created (1348817 pages copied)

minimum_pages, orig.:		5199 MB
minimum_pages curr. version:	5224 MB

***

High nr_mapped:

PM: Preallocating image memory... 
nr_mapped = 871132 pages, 3402 MB
nr_shmem = 95182 pages, 371 MB
active_inactive(file) = 980056 pages, 3828 MB
reclaimable_anon_slab = 310479 pages, 1212 MB
save_highmem = 0 pages, 0 MB
saveable = 2415533 pages, 9435 MB

highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061315 pages, 11958 MB
count = 3023138 pages, 11809 MB
max_size = 1510435 pages, 5900 MB
user_specified_image_size = 1346986 pages, 5261 MB
adjusted_image_size = 1346987 pages, 5261 MB

saveable = 1566214 pages, 6118 MB
active_inactive(file) = 160351 pages, 626 MB
reclaimable_anon_slab = 277972 pages, 1085 MB
nr_mapped = 867213 pages, 3387 MB
nr_shmem = 49347 pages, 192 MB
minimum_pages = 1337589 pages, 5224 MB

target_image_size = 1346987 pages, 5261 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512703 pages, 5908 MB
to_alloc_adjusted = 1512703 pages, 5908 MB
pages_allocated = 1512703 pages, 5908 MB
done (allocated 1676151 pages)
PM: Allocated 6704604 kbytes in 2.63 seconds (2549.27 MB/s)
...
PM: Need to copy 1346249 pages
PM: Hibernation image created (1346249 pages copied)

minimum_pages, orig.:		7797 MB	  /* would fail */
minimum_pages curr. vers.:	5224 MB

***

Normal nr_mapped, high nr_anon:

PM: Preallocating image memory... 
nr_mapped = 71519 pages, 279 MB
nr_shmem = 72891 pages, 284 MB
active_inactive(file) = 939631 pages, 3670 MB
reclaimable_anon_slab = 1185771 pages, 4631 MB
save_highmem = 0 pages, 0 MB
saveable = 2242008 pages, 8757 MB

highmem = 0 pages, 0 MB
additional_pages = 220 pages, 0 MB
avail_normal = 3061320 pages, 11958 MB
count = 3023143 pages, 11809 MB
max_size = 1510437 pages, 5900 MB
user_specified_image_size = 1346986 pages, 5261 MB
adjusted_image_size = 1346987 pages, 5261 MB

saveable = 1371575 pages, 5357 MB
active_inactive(file) = 103980 pages, 406 MB
reclaimable_anon_slab = 1147489 pages, 4482 MB
nr_mapped = 69038 pages, 269 MB
nr_shmem = 30205 pages, 117 MB
minimum_pages = 189144 pages, 738 MB

target_image_size = 1346987 pages, 5261 MB
preallocated_high_mem = 0 pages, 0 MB
to_alloc = 1512706 pages, 5909 MB
to_alloc_adjusted = 1512706 pages, 5909 MB
pages_allocated = 1512706 pages, 5909 MB
done (allocated 1676156 pages)
PM: Allocated 6704624 kbytes in 1.15 seconds (5830.10 MB/s)
...
PM: Need to copy 1346554 pages
PM: Hibernation image created (1346554 pages copied)

minimum_pages, orig.:		735 MB
minimum_pages curr. vers.:	738 MB
Comment 68 Chen Yu 2016-06-27 13:39:54 UTC
As your proposal is to deal with insane nr_mapped(may be caused by virtual box driver, not generic shm), close this thread for now and let's sync offline.
Comment 69 Rainer Fiebig 2016-08-13 12:55:38 UTC
No reopening, just FYI: 

since kernel 4.2 the workaround shown in Comment 25 doesn't work anymore - nr_mapped is high whether the VM is set to "large pages: on" or not.

It still works in 4.1.
Comment 70 Rainer Fiebig 2017-03-03 22:27:47 UTC
I think the bug *is* valid and I think I finally know what's behind it.

The explanation is simple: What was probably overlooked is that nr_mapped may be counted *twice* in "saveable" - once as NR_FILE_MAPPED and then as contributor to [in]active(file) and [in]active(anon) (s. Comment 64; s.1). 

If this is right, we subtract too little from saveable when we subtract [in]active(file) and [in]active(anon) *without* the nr_mapped-component (by subtracting nr_mapped). The nr_mapped-component then remains in saveable. What we then get as minimum_image_size is nr_mapped + nr_mapped. This is why the result for minimum_image_size is indeed always roughly twice the size of nr_mapped - which means: 100 % too high.

As long as nr_mapped is relatively low this goes unnoticed. But if nr_mapped is high enough, it comes to light.

So I think that *not* subtracting NR_FILE_MAPPED in minimum_image_size() is indeed the right thing to do. This simple solution has also proven very reliable in practice (s. Comment 1):


static unsigned long minimum_image_size(unsigned long saveable)
{
	unsigned long size;

	size = global_page_state(NR_SLAB_RECLAIMABLE)
		+ global_page_state(NR_ACTIVE_ANON)
		+ global_page_state(NR_INACTIVE_ANON)
		+ global_page_state(NR_ACTIVE_FILE)
		+ global_page_state(NR_INACTIVE_FILE);
		/* - global_page_state(NR_FILE_MAPPED); */

	/* return saveable <= size ? 0 : saveable - size; */
	return saveable - size;
}


***

1) "saveable" always amounts to NR_FILE_MAPPED + [in]active(file) + [in]active(anon), give or take a few pages.
Comment 71 Rainer Fiebig 2017-03-09 17:21:20 UTC
I'm reopening this because

- the problem is still there
- it's not VirtualBox' fault (s. Comment 50)
- the cause is identified (s. Comment 70)
- the fix is simple and has been tested

If you don't agree with my findings or the fix, please briefly explain why.

I wouldn't mind if you then closed the bug again as it is my intention to solve this and not to start an endless debate.
Comment 72 Chen Yu 2017-03-12 17:47:33 UTC
OK I'll look at this one this week, but it will take me sometime to pick up the history(I forgot the detail..)
Comment 73 Rainer Fiebig 2017-03-13 11:22:03 UTC
(In reply to Chen Yu from comment #72)
> OK I'll look at this one this week, but it will take me sometime to pick up
> the history(I forgot the detail..)

Thanks. Take your time, it's not urgent.
Comment 74 Rainer Fiebig 2017-04-01 12:32:33 UTC
(In reply to Chen Yu from comment #72)
> OK I'll look at this one this week, but it will take me sometime to pick up
> the history(I forgot the detail..)


Dozed off picking up the history? ;)

Understandable. So here's the story in nutshell:

- s2disk/s2both may *unnecessarily* fail if NR_FILE_MAPPED is high
- rather popular VirtualBox causes such high nr_mapped, vaguely proportionally to the memory that is assigned to the VM, i. e. a Windows-VM with 2 GB system-memory increases nr_mapped by about 2 GB
- according to your findings VB does not manipulate memory-counters in an illegal way
- the problem is how the minimum-image-size is calculated in minimum_image_size()
- I think subtracting nr_mapped from the other counters in m_i_s() is simply an error in reasoning and leads to a mi-size that is roughly twice as high as it should be
- this is only veiled by the fact that in most cases nr_mapped is small compared to the other counters
- therefore, not subtracting nr_mapped in minimum_image_size() is the way to go

So let's come to a decision: either try solve this in the foreseeable future or close this thing now. 

I could live with both. Because to be frank: I more and more like the idea of being *the only guy on this planet* who enjoys a truly reliable s2disk in Linux, irrespective of what apps are running. :)
Comment 75 Rainer Fiebig 2017-04-10 15:38:34 UTC
The sound of silence was loud enough - so I'm out of here now.

I was a bit insisting in this matter because I think it is important - especially since there is no workaround.

And I'm afraid that with a shortcoming like this, "Linux on the desktop" is not going to happen.

Don't blame it on me. ;)
Comment 76 Chen Yu 2017-04-10 16:14:07 UTC
(In reply to Rainer Fiebig from comment #70)
> I think the bug *is* valid and I think I finally know what's behind it.
> 
> The explanation is simple: What was probably overlooked is that nr_mapped
> may be counted *twice* in "saveable" - once as NR_FILE_MAPPED and then as
> contributor to [in]active(file) and [in]active(anon) (s. Comment 64; s.1). 
> 
> If this is right, we subtract too little from saveable when we subtract
> [in]active(file) and [in]active(anon) *without* the nr_mapped-component (by
> subtracting nr_mapped). The nr_mapped-component then remains in saveable.
> What we then get as minimum_image_size is nr_mapped + nr_mapped. This is why
> the result for minimum_image_size is indeed always roughly twice the size of
> nr_mapped - which means: 100 % too high.
> 
> As long as nr_mapped is relatively low this goes unnoticed. But if nr_mapped
> is high enough, it comes to light.
> 
> So I think that *not* subtracting NR_FILE_MAPPED in minimum_image_size() is
> indeed the right thing to do. This simple solution has also proven very
> reliable in practice (s. Comment 1):
> 
> 
> static unsigned long minimum_image_size(unsigned long saveable)
> {
>       unsigned long size;
> 
>       size = global_page_state(NR_SLAB_RECLAIMABLE)
>               + global_page_state(NR_ACTIVE_ANON)
>               + global_page_state(NR_INACTIVE_ANON)
>               + global_page_state(NR_ACTIVE_FILE)
>               + global_page_state(NR_INACTIVE_FILE);
>               /* - global_page_state(NR_FILE_MAPPED); */
> 
>       /* return saveable <= size ? 0 : saveable - size; */
>       return saveable - size;
> }
> 
> 
> ***
> 
> 1) "saveable" always amounts to NR_FILE_MAPPED + [in]active(file) +
> [in]active(anon), give or take a few pages.
If we don't substract the NR_FILE_MAPPED from the size, it means we will
get a smaller value of minimum image size, which means, we might over-relaim some pages compared to current code flow?
Comment 77 Rainer Fiebig 2017-04-11 20:13:52 UTC
(In reply to Chen Yu from comment #76)
> (In reply to Rainer Fiebig from comment #70)
> > I think the bug *is* valid and I think I finally know what's behind it.
> > 
> > The explanation is simple: What was probably overlooked is that nr_mapped
> > may be counted *twice* in "saveable" - once as NR_FILE_MAPPED and then as
> > contributor to [in]active(file) and [in]active(anon) (s. Comment 64; s.1). 
> > 
> > If this is right, we subtract too little from saveable when we subtract
> > [in]active(file) and [in]active(anon) *without* the nr_mapped-component (by
> > subtracting nr_mapped). The nr_mapped-component then remains in saveable.
> > What we then get as minimum_image_size is nr_mapped + nr_mapped. This is
> why
> > the result for minimum_image_size is indeed always roughly twice the size
> of
> > nr_mapped - which means: 100 % too high.
> > 
> > As long as nr_mapped is relatively low this goes unnoticed. But if
> nr_mapped
> > is high enough, it comes to light.
> > 
> > So I think that *not* subtracting NR_FILE_MAPPED in minimum_image_size() is
> > indeed the right thing to do. This simple solution has also proven very
> > reliable in practice (s. Comment 1):
> > 
> > 
> > static unsigned long minimum_image_size(unsigned long saveable)
> > {
> >       unsigned long size;
> > 
> >       size = global_page_state(NR_SLAB_RECLAIMABLE)
> >               + global_page_state(NR_ACTIVE_ANON)
> >               + global_page_state(NR_INACTIVE_ANON)
> >               + global_page_state(NR_ACTIVE_FILE)
> >               + global_page_state(NR_INACTIVE_FILE);
> >               /* - global_page_state(NR_FILE_MAPPED); */
> > 
> >       /* return saveable <= size ? 0 : saveable - size; */
> >       return saveable - size;
> > }
> > 
> > 
> > ***
> > 
> > 1) "saveable" always amounts to NR_FILE_MAPPED + [in]active(file) +
> > [in]active(anon), give or take a few pages.
> If we don't substract the NR_FILE_MAPPED from the size, it means we will
> get a smaller value of minimum image size, which means, we might over-relaim
> some pages compared to current code flow?

Yes, minimum_image_size() would return a smaller value - that is the intended effect to prevent unnecessary failure.

And you're right, we might indeed at times over-reclaim. Like in this high-load case:

PM: Preallocating image memory... 
...
saveable = 3011927 pages, 11765 MB
nr_mapped = 1479297 pages, 5778 MB
active_inactive(file) = 1012822 pages, 3956 MB
nr_sreclaimable = 33485 pages, 130 MB
active_inactive(anon) = 555363 pages, 2169 MB
...
minimum_pages = 1410257 pages, 5508 MB	/* 11288 MB and failure with orig. */
nr_mapped = 1477778 pages, 5772 MB	/* After shrink_all_memory() */
...
PM: Hibernation image created (1416711 pages copied)

The value returned by minimum_image_size() is < nr_mapped. This makes avail_normal appear greater than it actually is:

 	...
	if (avail_normal > pages)
    		avail_normal -= pages;
	...

I'm still trying to figure out the consequences, if instead of the first block of the following sequence the else-block would be entered erroneously due to a too optimistic value for avail_normal:

	...
	pages = preallocate_image_memory(alloc, avail_normal);
	if (pages < alloc) {
		...leads to failure
	} else {
		...success
	}

Would s2disk still succeed? Or would it simply fail and we get back to the desktop? Or would the system crash?
So far I haven't had a crash, only successes or (on intentional overload) failure and return to the desktop.

What's the expert's view?
Comment 78 Rainer Fiebig 2017-04-15 09:31:57 UTC
(In reply to Rainer Fiebig from comment #77)
snip

> 
> I'm still trying to figure out the consequences, if instead of the first
> block of the following sequence the else-block would be entered erroneously
> due to a too optimistic value for avail_normal:
> 
>       ...
>       pages = preallocate_image_memory(alloc, avail_normal);
>       if (pages < alloc) {
>               ...leads to failure
>       } else {
>               ...success
>       }
> 
> Would s2disk still succeed? Or would it simply fail and we get back to the
> desktop? Or would the system crash?
> So far I haven't had a crash, only successes or (on intentional overload)
> failure and return to the desktop.
> 
> What's the expert's view?


Took a closer look. And I think a falsely too high avail_normal is not a problem (a falsely too low is):

If avail_normal is greater than alloc (falsely or not), preallocate_image_memory() returns either "alloc" or less, depending on how many free pages can be found by preallocate_image_pages(). So it's either success or failure. But no crash. Agrees with what I've been seeing here.

However, if one doesn't like the notion that minimum_pages can at times be less than nr_mapped, one could do something like this:

	/* Estimate the minimum size of the image. */
	pages = minimum_image_size(saveable);
	if(pages < global_node_page_state(NR_FILE_MAPPED))
		pages = global_node_page_state(NR_FILE_MAPPED);
Comment 79 Rainer Fiebig 2017-04-19 17:41:21 UTC
Created attachment 255931 [details]
Proposed fix for failure of s2disk in case of high NR_FILE_MAPPED

Make a copy of the original snapshot.c

Place the patch in the directory above the kernel-sources. Open a terminal there, move into the sources-dir. (e.g.: cd linux-4.9.20). 

Then test by entering
patch -p1 < ../patch-s2disk_without_debug_k4920_or_later --dry-run

If no warnings or error-messages, patch by entering 
patch -p1 < ../patch-s2disk_without_debug_k4920_or_later

Compare your snapshot.orig to the patched snapshot.c, to see what has changed.
If that's OK with you, recompile/install the kernel.

If you later want to update the kernel-sources, you first have to remove the patch: 
patch -R -p1 < ../patch-s2disk_without_debug_k4920_or_later
Comment 80 Rainer Fiebig 2017-04-19 17:44:30 UTC
This on/off communication is unproductive, so let's come to an end now.

As a final contribution I have attached a patch for kernel 4.9.20 or later(*). Perhaps for testing by experienced fellow sufferers. Provided with good intent but without guarantees.

For me, one question was left to be answered: why did s2isk fail in a case with an overflow in minimum_image_size() (https://bugzilla.kernel.org/show_bug.cgi?id=47931#c63) - while overflow always led to success on my system?

In hindsight the answer is pretty simple: in the said case memory was *really* used up:
...
[ 110.091830] to_alloc_adjusted=495253 pages, 1934MB	/* wanted */
[ 111.824189] pages_allocated=474182 pages, 1852MB	/* actually allocated */
...

But in case of high nr_mapped more often than not memory only *seems* to be used up - due to the wrong calculation of mis(). And in those cases an overflow in mis() can actually be *necessary* for s2disk to succeed (see 3) and 4) below).

That's all. Thanks for your time and effort!


---

Summary of situation with the original code, total memory of 12 GB:

1) s2disk succeeds in this case:
saveable = 1815966 pages, 7093 MB
nr_mapped = 702215 pages, 2743 MB
active_inactive(file) = 660676 pages, 2580 MB
nr_sreclaimable = 29769 pages, 116 MB
active_inactive(anon) = 456063 pages, 1781 MB

2) Fails in this:
saveable = 2033746 pages, 7944 MB
nr_mapped = 912466 pages, 3564 MB
active_inactive(file) = 673444 pages, 2630 MB
nr_sreclaimable = 26927 pages, 105 MB
active_inactive(anon) = 475323 pages, 1856 MB

3) May succeed if 2) is followed immediately by a second s2disk.

4) Succeeds at the *first* attempt in a case where memory-usage is even *higher* than in case 2):
saveable = 2362834 pages, 9229 MB
nr_mapped = 1423210 pages, 5559 MB
active_inactive(file) = 598559 pages, 2338 MB
nr_sreclaimable = 26542 pages, 103 MB
active_inactive(anon) = 324172 pages, 1266 MB

Who wouldn't be confused by that?

---

*) Instructions: see comment to patch
Comment 81 Chen Yu 2017-04-20 14:53:39 UTC
Thanks for your effort, previously I was a little busy with some critical internal bugs, but I promise to return to this one once they are done.
Comment 82 Manuel Krause 2017-06-26 20:48:46 UTC
To give a little feedback for you: 
I have Jay Rainer Fiebig's most recent patches from BUG 47931 and from here in use for 1 1/4 days now at kernel 4.11.7, reaching a kind of milestone with 10 successful ==resuming hibernations atm.
Normally with in-kernel hibernation problems begin after attempt No.3 with segfaults in random processes. Most times they affect needed KDE subprocesses by max. No.10, often earlier, requiring a reboot. Also negative: the resume speed until the desktop incl. firefox gets interactive again.

This time, patched, I'm astonished by resume speed and reliability. Approx. 2.5min to resume to an all-responsive desktop state is very good and near to former TuxOnIce timings. So far only one segfault (FF's plugin-container, meaning flash-plugin). And only one really slow desktop recovery at No.9 -- where I had filled the /dev/shm ramdisk to ~1/2 (but not before and later). Main problem with No.9 was heavy swapping.

My most relevant setup info: Notebook with Core2duo CPU, 8G RAM, integrated GFX (upto 2G shared), 13G swap on 2nd internal standard HDD, 3G /dev/shm as ramdisk (often in use).
The testings were done under my normal use patterns, with having changed /dev/shm and firefox loads.

I'd keep up testing til milestone 20(?) if all goes well :-)
Comment 83 Manuel Krause 2017-07-03 17:34:43 UTC
New findings, as more testing time went by:
* removed "kmozillahelper": This package is most likely to have infected RAM and parent- and child-processes of Firefox, involving KDE, leading to page/ protection faults, and invoking nasty coredumps via systemd-coredump. (Only, as these coredumps likely occurred after resumes from s2disk, I've mentioned it on here. In fact the fautls were unpredictable.)
* I've also thoroughly tested the patches from Jay/ Rainer Fiebig from BUG 47931,  https://bugzilla.kernel.org/attachment.cgi?id=152151 vs. https://bugzilla.kernel.org/attachment.cgi?id=155521, and have the experience, that the former patch makes resume-timings shorter and more predictable. In corner cases the second patch can leave the KDE & Firefox unresponsive for over ~5min, means: reboot & restarting applications instead would be faster. 
The first patch, on the other hand, always remains between 1 - 2.5min resume time, until desktop and FF get responsive. And that's fine.
And BTW, I'm now at 4d 2h, 17th resume, with this kernel. Only changes made to the FF setup.
Comment 84 Manuel Krause 2017-07-03 17:41:10 UTC
I've forgotton to add another comparison:
In normal -- unpatched -- mainline 4.11.8 -- KDE & Firefox can remain unresponsive for more than 7mins in corner cases.
Comment 85 Rainer Fiebig 2017-08-28 10:15:16 UTC
(In reply to Manuel Krause from comment #82)
> To give a little feedback for you: 
> I have Jay Rainer Fiebig's most recent patches from BUG 47931 and from here
> in use for 1 1/4 days now at kernel 4.11.7, reaching a kind of milestone
> with 10 successful ==resuming hibernations atm.
> Normally with in-kernel hibernation problems begin after attempt No.3 with
> segfaults in random processes. Most times they affect needed KDE
> subprocesses by max. No.10, often earlier, requiring a reboot. Also
> negative: the resume speed until the desktop incl. firefox gets interactive
> again.
> 
> This time, patched, I'm astonished by resume speed and reliability. Approx.
> 2.5min to resume to an all-responsive desktop state is very good and near to
> former TuxOnIce timings. So far only one segfault (FF's plugin-container,
> meaning flash-plugin). And only one really slow desktop recovery at No.9 --
> where I had filled the /dev/shm ramdisk to ~1/2 (but not before and later).
> Main problem with No.9 was heavy swapping.
> 
> My most relevant setup info: Notebook with Core2duo CPU, 8G RAM, integrated
> GFX (upto 2G shared), 13G swap on 2nd internal standard HDD, 3G /dev/shm as
> ramdisk (often in use).
> The testings were done under my normal use patterns, with having changed
> /dev/shm and firefox loads.
> 
> I'd keep up testing til milestone 20(?) if all goes well :-)

It's not a good idea to mix patches. If your problems are caused by high NR_FILE_MAPPED, the patch provided here is the only one you need.

Otherwise stick with the original code and a long-term-kernel (perhaps 4.9) and try to track down the real issue (graphics?) - for instance by systematically excluding possible troublemakers while keeping everything else unchanged.

BTW: 10 successful hibernations may be a "milestone" for a polar bear. But for a Linux-system? ;)
Comment 86 Rainer Fiebig 2017-10-03 16:26:28 UTC
On grounds of theoretical musings (see Comment 70), lack of convincing counter-arguments and the practical evidence of now thousands of successful s2disks I consider this solved.

In case minimum_image_size() was deliberately designed to return a "generous" value in order to limit pressure on the normal zone in 32bit-machines (and get additional pages from highmem), a conditional compilation could take care of this.

So long!
Comment 87 Rainer Fiebig 2018-04-02 18:20:18 UTC
Fixed in 4.16