Bug 11437 - Burst swap-in in case of lots of free RAM
Summary: Burst swap-in in case of lots of free RAM
Status: RESOLVED OBSOLETE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Page Allocator (show other bugs)
Hardware: All Linux
: P1 enhancement
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-27 14:17 UTC by Xu
Modified: 2012-10-30 15:04 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.26
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Xu 2008-08-27 14:17:53 UTC
Consider a system with following setup:
* 3 GiB RAM
* 3 GiB swap

The 3 GiB RAM are nearly used by lots of running processes,
The 3 GiB swap are nearly free.

Now a memory hog process is started. The virtual memory of all the processes is going to be swapped out. The memory hog process finishes (for example because the memory hog process is too hungry and gets an "out of memory" error).

Then, the state is the following:
* 3 GiB RAM is nearly free
* 3 GiB swap is nearly full

The normal Linux swap-in strategy is to swap in page by page (which usually means 4KB-page by 4KB-page). This takes a very long time for 3 GiB to be transferred into RAM. Thus, in reality, the system is not unusably during the execution of the memory hog process, but also about 2..3 minutes afterwards. And even then, switching to an application which was not swapped-in yet (re-)creates other delays. Thus, the accumulated swap-in time may well be in order of 4 minutes.

This can be optimized: Just read all the swap pages linearly in the same order as they are laid out on disk. Then, the system may become completely usable in about 30 seconds.
Comment 1 Francois Cartegnie 2009-02-16 20:17:10 UTC
(In reply to comment #0)

Few comments on you bug, just my opinion.

> Now a memory hog process is started. The virtual memory of all the processes
> is
> going to be swapped out. The memory hog process finishes (for example because

The virtual *oldest inactive* memory (LRU) of the processes is going to be swapped out.

> This can be optimized: Just read all the swap pages linearly in the same
> order
> as they are laid out on disk. Then, the system may become completely usable
> in
> about 30 seconds.

They're not laid on disk in a linear order, or any particular order, as pages activity differs along the swapout process which is not atomic. Swap size can be equal to physical mem size.. or not. You're pointing out the fact that the swap vm process has the same page size granularity as vm. Your linear read optimization would work if the swap process was working with extents containing the full processes, or some segments, but in that case it would also be keeping unused memory areas in physical mem.

Reading linearly 4K pages won't mean that you're reading all the pages belonging to the process requiring swapin, and then you'll need to swapout pages. Then if you think about reading/filtering needed pages while linearly reading, remember that the swap is not wrote linearly and data can be spread on 10 times the phy mem size.

For me, the swap behaviour seems correct. The system is not usable with the hog process, because of the 'dance' of pages in/out swap. Reading linearly ? Why would the system restore all linear pages (which might have been spread everywhere) ? It only restores pages when required, the same way it swaps out pages, minimizing disk accesses. Your system can't guess the required new memory usage at T+1, so you can't ask it to restore the T-1 state.
Comment 2 Xu 2009-02-17 03:08:33 UTC
Okay, let's be more precise. We care about the case where ratio between the current swap usage and the current free memory is below 1 (or maybe slightly over 1).

> Reading linearly 4K pages won't mean that you're reading all the pages
> belonging to the process requiring swapin, and then you'll need to swapout
> pages.

No. Because we have more free memory than the current swap usage, we can swap in all pages without the penalty of requiring a swapout of other pages.

> Then if you think about reading/filtering needed pages while linearly
> reading, remember that the swap is not wrote linearly and data can be spread
> on
> 10 times the phy mem size.

While this may be true and while this may only apply if there is less free memory than the current swap usage, linear reading may still be faster in this case: Consider modern hard disks. Their linear burst throughput may be about 60MiB/s, their worst case random access throughput is however still at about 0.5MiB/s (seek time = 8ms, this means 125 seeks per second, 4KiB read per seek). Thus, even if the data is not only spread over an area of 10 times the size of the data, but 120 times of the size of the data, it may be still faster to do a big sequential read than many small random accesses.

> For me, the swap behaviour seems correct. The system is not usable with the
> hog
> process, because of the 'dance' of pages in/out swap. Reading linearly ? Why
> would the system restore all linear pages (which might have been spread
> everywhere) ? It only restores pages when required, the same way it swaps out
> pages, minimizing disk accesses.

Well, this is actually the point. The system minimizes the total number of 4KiB read accesses. However, Linux should minimize the total amount of swap-in time. Minimizing the total amount of swap-in time can be modeled by splitting up a random access 4KiB read into
(1) a random access seek
(2) a sequential 4KiB read.
What is really costly is the random access seek (8.000ms per seek), not the 4KiB read (0.066ms per 4KiB read).
Thus, the system should much more minimize the total number of seeks than the total number of sequential 4KiB reads.

> Your system can't guess the required new
> memory usage at T+1, so you can't ask it to restore the T-1 state.

Well, that's what heuristics are for. One simple example: If there have been 50 swap-ins within 1 second and there is currently more free memory than current swap usage, then go into sequential read mode. (There will be more sophisticated heuristics which take into account measured device speeds, other device usage and other data).
Comment 3 Xu 2009-02-19 04:04:03 UTC
Ah, by the way, the same type of linear burst swap-in should also apply in case swap device is deactivated using

# swapoff /dev/someswapdevice

Currently, this seems to be not the case.


Maybe it is easier to implement linear burst swap-in just for swapoff. Then, a userspace application could trigger a

# swapoff /dev/someswapdevice
# swapon /dev/someswapdevice

sequence, leading to roughly the same effect as described above.
Comment 4 Francois Cartegnie 2009-02-19 18:01:19 UTC
(In reply to comment #2 and #3)

> No. Because we have more free memory than the current swap usage, we can swap
> in all pages without the penalty of requiring a swapout of other pages.

Well it's not a question of free memory. The system will swap or keep pages in swap even if you have free memory.
This swap behaviour is controlled by /proc/sys/vm/swappiness
By default, the system will always start to swap before the memory is full.
So you'll need something like 20% more physical memory than disk swap.
What you describe is only for swappiness 0.

> Well, that's what heuristics are for. One simple example: If there have been
> 50
> swap-ins within 1 second and there is currently more free memory than current
> swap usage, then go into sequential read mode. (There will be more
> sophisticated heuristics which take into account measured device speeds,
> other
> device usage and other data).

Well, an heuristic already exists: That's what's behind swapiness.

If I would write my own heuristic to keep minimal swap, this is what I would do for your case: A process has claimed a lot of memory until T-1 (bloat) and all those pages are now freed at time T. It should wait a bit from T+1 and see what is required by processes at T+2, T+3, ... (not only 1 second later) because we saw that the system's recent activity was to allocate lots of ram until T. If no memory is required by processes, then gradually increase the quantity of pages fetched from swap.
But this wouldn't solve your problem, as there's still delay for restoring content into phy mem.

Anyway, the linux swap control is really basic, and maybe we could have few more parameters for tuning it, like it can be done with controlling locked pages.

># swapoff /dev/someswapdevice
>Currently, this seems to be not the case.

I agree on this. In that case, if pages can fit in ram, it should restore as fast as possible.
Comment 5 Alan 2012-10-30 15:04:58 UTC
If this is still seen on modern kernels then please re-open/update

Note You need to log in before you can comment on or make changes to this bug.