Bug 64121 - [BISECTED] "mm" performance regression updating from 3.2 to 3.3
Summary: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-31 10:53 UTC by Thomas Jarosch
Modified: 2016-07-29 17:00 UTC (History)
0 users

See Also:
Kernel Version: 3.3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Dmesg output (15.10 KB, application/octet-stream)
2013-10-31 10:53 UTC, Thomas Jarosch
Details
Kernel config (69.69 KB, text/plain)
2013-10-31 10:54 UTC, Thomas Jarosch
Details

Description Thomas Jarosch 2013-10-31 10:53:47 UTC
Created attachment 112881 [details]
Dmesg output

Hi,

I've updated a productive box running kernel 3.0.x to 3.4.67.
This caused a severe I/O performance regression.

After some hours I've bisected it down to this commit:

---------------------------
# git bisect good
ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d is the first bad commit
commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
Author: Johannes Weiner <jweiner@redhat.com>
Date:   Tue Jan 10 15:07:42 2012 -0800

    mm: exclude reserved pages from dirtyable memory

    Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    ...
---------------------------

With the "problematic" patch:
# dd_rescue -A /dev/zero img.disk
dd_rescue: (info): ipos:     15296.0k, opos:     15296.0k, xferd:     15296.0k
                   errs:      0, errxfer:         0.0k, succxfer:     15296.0k
             +curr.rate:      681kB/s, avg.rate:      681kB/s, avg.load:  0.3%


Without the patch (using 25bd91bd27820d5971258cecd1c0e64b0e485144):
# dd_rescue -A /dev/zero img.disk
dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:    293888.0k
                   errs:      0, errxfer:         0.0k, succxfer:    293888.0k
             +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load:  3.3%



The kernel is 32bit using PAE mode. The system has 32GB of RAM.
(compiled with "gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)")

Interestingly: If I limit the amount of RAM to roughly 20GB
via the "mem=20000m" boot parameter, the performance is fine.
When I increase it to f.e. "mem=23000m", performance is bad.

Also tested kernel 3.10.17 in 32bit + PAE mode,
it was fine out of the box.


So basically we need a fix for the LTS kernel 3.4, I can work around
this issue with "mem=20000m" until I upgrade to 3.10.

I'll probably have access to the hardware for one more week
to test patches, it was lent to me to debug this specific problem.

The same issue appeared on a complete different machine in July
using the same 3.4.x kernel. The box had 16GB of RAM.
I didn't get a chance to access the hardware back then.

Attached is the dmesg output and my kernel config.

HTH,
Thomas
Comment 1 Thomas Jarosch 2013-10-31 10:54:13 UTC
Created attachment 112891 [details]
Kernel config
Comment 2 Andrew Morton 2013-10-31 20:46:15 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 31 Oct 2013 10:53:47 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=64121
> 
>             Bug ID: 64121
>            Summary: [BISECTED] "mm" performance regression updating from
>                     3.2 to 3.3
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.3
>           Hardware: i386
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: thomas.jarosch@intra2net.com
>         Regression: No
> 
> Created attachment 112881 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=112881&action=edit
> Dmesg output
> 
> Hi,
> 
> I've updated a productive box running kernel 3.0.x to 3.4.67.
> This caused a severe I/O performance regression.
> 
> After some hours I've bisected it down to this commit:
> 
> ---------------------------
> # git bisect good
> ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d is the first bad commit
> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> Author: Johannes Weiner <jweiner@redhat.com>
> Date:   Tue Jan 10 15:07:42 2012 -0800
> 
>     mm: exclude reserved pages from dirtyable memory
> 
>     Per-zone dirty limits try to distribute page cache pages allocated for
>     writing across zones in proportion to the individual zone sizes, to
>     reduce
>     the likelihood of reclaim having to write back individual pages from the
>     LRU lists in order to make progress.
> 
>     ...
> ---------------------------
> 
> With the "problematic" patch:
> # dd_rescue -A /dev/zero img.disk
> dd_rescue: (info): ipos:     15296.0k, opos:     15296.0k, xferd:    
> 15296.0k
>                    errs:      0, errxfer:         0.0k, succxfer:    
>                    15296.0k
>              +curr.rate:      681kB/s, avg.rate:      681kB/s, avg.load: 
>              0.3%
> 
> 
> Without the patch (using 25bd91bd27820d5971258cecd1c0e64b0e485144):
> # dd_rescue -A /dev/zero img.disk
> dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:   
> 293888.0k
>                    errs:      0, errxfer:         0.0k, succxfer:   
>                    293888.0k
>              +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load: 
>              3.3%
> 
> 
> 
> The kernel is 32bit using PAE mode. The system has 32GB of RAM.
> (compiled with "gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)")
> 
> Interestingly: If I limit the amount of RAM to roughly 20GB
> via the "mem=20000m" boot parameter, the performance is fine.
> When I increase it to f.e. "mem=23000m", performance is bad.
> 
> Also tested kernel 3.10.17 in 32bit + PAE mode,
> it was fine out of the box.
> 
> 
> So basically we need a fix for the LTS kernel 3.4, I can work around
> this issue with "mem=20000m" until I upgrade to 3.10.
> 
> I'll probably have access to the hardware for one more week
> to test patches, it was lent to me to debug this specific problem.
> 
> The same issue appeared on a complete different machine in July
> using the same 3.4.x kernel. The box had 16GB of RAM.
> I didn't get a chance to access the hardware back then.
> 
> Attached is the dmesg output and my kernel config.

32GB of memory on a highmem machine just isn't going to work well,
sorry.  Our rule of thumb is that 16G is the max.  If it was previously
working OK with 32G then you were very lucky!

That being said, we should try to work out exactly why that commit
caused the big slowdown - perhaps there is something we can do to
restore things.  It appears that the (small?) increase in the per-zone
dirty limit is what kicked things over - perhaps we can permit that to
be tuned back again.  Or something.  Johannes, could you please have a
think about it?
Comment 3 Johannes Weiner 2013-11-01 18:43:43 UTC
On Thu, Oct 31, 2013 at 01:46:10PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Thu, 31 Oct 2013 10:53:47 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=64121
> > 
> >             Bug ID: 64121
> >            Summary: [BISECTED] "mm" performance regression updating from
> >                     3.2 to 3.3
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.3
> >           Hardware: i386
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: thomas.jarosch@intra2net.com
> >         Regression: No
> > 
> > Created attachment 112881 [details]
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=112881&action=edit
> > Dmesg output
> > 
> > Hi,
> > 
> > I've updated a productive box running kernel 3.0.x to 3.4.67.
> > This caused a severe I/O performance regression.
> > 
> > After some hours I've bisected it down to this commit:
> > 
> > ---------------------------
> > # git bisect good
> > ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d is the first bad commit
> > commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> > Author: Johannes Weiner <jweiner@redhat.com>
> > Date:   Tue Jan 10 15:07:42 2012 -0800
> > 
> >     mm: exclude reserved pages from dirtyable memory
> > 
> >     Per-zone dirty limits try to distribute page cache pages allocated for
> >     writing across zones in proportion to the individual zone sizes, to
> reduce
> >     the likelihood of reclaim having to write back individual pages from
> the
> >     LRU lists in order to make progress.
> > 
> >     ...
> > ---------------------------
> > 
> > With the "problematic" patch:
> > # dd_rescue -A /dev/zero img.disk
> > dd_rescue: (info): ipos:     15296.0k, opos:     15296.0k, xferd:    
> 15296.0k
> >                    errs:      0, errxfer:         0.0k, succxfer:    
> 15296.0k
> >              +curr.rate:      681kB/s, avg.rate:      681kB/s, avg.load: 
> 0.3%
> > 
> > 
> > Without the patch (using 25bd91bd27820d5971258cecd1c0e64b0e485144):
> > # dd_rescue -A /dev/zero img.disk
> > dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:   
> 293888.0k
> >                    errs:      0, errxfer:         0.0k, succxfer:   
> 293888.0k
> >              +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load: 
> 3.3%
> > 
> > 
> > 
> > The kernel is 32bit using PAE mode. The system has 32GB of RAM.
> > (compiled with "gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)")
> > 
> > Interestingly: If I limit the amount of RAM to roughly 20GB
> > via the "mem=20000m" boot parameter, the performance is fine.
> > When I increase it to f.e. "mem=23000m", performance is bad.
> > 
> > Also tested kernel 3.10.17 in 32bit + PAE mode,
> > it was fine out of the box.
> > 
> > 
> > So basically we need a fix for the LTS kernel 3.4, I can work around
> > this issue with "mem=20000m" until I upgrade to 3.10.
> > 
> > I'll probably have access to the hardware for one more week
> > to test patches, it was lent to me to debug this specific problem.
> > 
> > The same issue appeared on a complete different machine in July
> > using the same 3.4.x kernel. The box had 16GB of RAM.
> > I didn't get a chance to access the hardware back then.
> > 
> > Attached is the dmesg output and my kernel config.
> 
> 32GB of memory on a highmem machine just isn't going to work well,
> sorry.  Our rule of thumb is that 16G is the max.  If it was previously
> working OK with 32G then you were very lucky!
> 
> That being said, we should try to work out exactly why that commit
> caused the big slowdown - perhaps there is something we can do to
> restore things.  It appears that the (small?) increase in the per-zone
> dirty limit is what kicked things over - perhaps we can permit that to
> be tuned back again.  Or something.  Johannes, could you please have a
> think about it?

It is a combination of two separate things on these setups.

Traditionally, only lowmem is considered dirtyable so that dirty pages
don't scale with highmem and the kernel doesn't overburden itself with
lowmem pressure from buffers etc.  This is purely about accounting.

My patches on the other hand were about dirty page placement and
avoiding writeback from page reclaim: by subtracting the watermark and
the lowmem reserve (memory not available for user memory / cache) from
each zone's dirtyable memory, we make sure that the zone can always be
rebalanced without writeback.

The problem now is that the lowmem reserves scale with highmem and
there is a point where they entirely overshadow the Normal zone.  This
means that no page cache at all is allowed in lowmem.  Combine this
with how dirtyable memory excludes highmem, and the sum of all
dirtyable memory is nil.  This effectively disables the writeback
cache.

I figure if anything should be fixed it should be the full exclusion
of highmem from dirtyable memory and find a better way to calculate a
minimum.

HOWEVER,

the lowmem reserve is highmem/32 per default.  With a Normal zone of
around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
entirely.  This is almost double of what you consider still okay...

So how would we even pick a sane minimum of dirtyable memory on these
machines?  It's impossible to pick something and say this should work
for most people, those setups are barely working to begin with.  Plus,
people can always set the vm.highmem_is_dirtyable sysctl to 1 or just
set dirty memory limits with dirty_bytes and dirty_background_bytes to
something that gets their crazy setups limping again.

Maybe we should just ignore everything above 16G on 32 bit, but that
would mean actively breaking setups that _individually_ worked before
and never actually hit problems due to their specific circumstances.

On the other hand, I don't think it's reasonable to support this
anymore and it should be more clear that people doing these things are
on their own.

What makes it worse is that all of these reports have been modern 64
bit machines, with modern amounts of memory, running 32 bit kernels.
I'd be more inclined to seriously look into this if it were hardware
that couldn't just run a 64 bit kernel...
Comment 4 Thomas Jarosch 2013-11-04 11:54:35 UTC
On Friday, 1. November 2013 14:43:32 Johannes Weiner wrote:
> Maybe we should just ignore everything above 16G on 32 bit, but that
> would mean actively breaking setups that _individually_ worked before
> and never actually hit problems due to their specific circumstances.
> 
> On the other hand, I don't think it's reasonable to support this
> anymore and it should be more clear that people doing these things are
> on their own.
> 
> What makes it worse is that all of these reports have been modern 64
> bit machines, with modern amounts of memory, running 32 bit kernels.
> I'd be more inclined to seriously look into this if it were hardware
> that couldn't just run a 64 bit kernel...

thanks for your detailed analysis! 

It's good to know the exact cause of this. Other people with
the same symptoms can now stumble upon this problem report.

We run the same distribution on 32 bit and 64 bit CPUs, that's why we've
avoided to upgrade to 64 bit yet. For our purposes, 16 GB of RAM is more than
enough. So I've implemented a small hack to limit the memory to 16 GB.
That gives way better performance than f.e. a memory limit of 20 GB.


Limit to 20 GB (for comparison):
# dd_rescue /dev/zero disk.img
dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:    293888.0k
                   errs:      0, errxfer:         0.0k, succxfer:    293888.0k
             +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load:  3.3%


With the new 16GB limit:
dd_rescue: (info): ipos:   1638400.0k, opos:   1638400.0k, xferd:   1638400.0k
                   errs:      0, errxfer:         0.0k, succxfer:   1638400.0k
             +curr.rate:    83685kB/s, avg.rate:    81205kB/s, avg.load:  6.1%


-> Limiting to 16GB with an "override" boot parameter for people
that really need more RAM might be a good idea even for mainline.


---hackish patch----------------------------------------------------------
Limit memory to 16 GB. See kernel bugzilla #64121.

diff -u -r -p linux.orig/arch/x86/mm/init_32.c linux.i2n/arch/x86/mm/init_32.c
--- linux.orig/arch/x86/mm/init_32.c	2013-11-04 11:52:55.881152576 +0100
+++ linux.i2n/arch/x86/mm/init_32.c	2013-11-04 11:52:01.309151985 +0100
@@ -621,6 +621,13 @@ void __init highmem_pfn_init(void)
 	}
 #endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
+#ifdef CONFIG_HIGHMEM64G
+	/* Intra2net: Limit memory to 16GB */
+	if (max_pfn > MAX_NONPAE_PFN * 4) {
+		max_pfn = MAX_NONPAE_PFN * 4;
+		printk(KERN_WARNING "Limited memory to 16GB. See kernel bugzilla #64121\n");
+	}
+#endif
 }
 
 /*
--------------------------------------------------------------------------

Thanks again for your help,
Thomas
Comment 5 Thomas Jarosch 2016-07-18 22:32:14 UTC
Hi Johannes,

referring to an old kernel bugzilla issue:
https://bugzilla.kernel.org/show_bug.cgi?id=64121

Am 01.11.2013 um 19:43 wrote Johannes Weiner:
> It is a combination of two separate things on these setups.
> 
> Traditionally, only lowmem is considered dirtyable so that dirty pages
> don't scale with highmem and the kernel doesn't overburden itself with
> lowmem pressure from buffers etc.  This is purely about accounting.
> 
> My patches on the other hand were about dirty page placement and
> avoiding writeback from page reclaim: by subtracting the watermark and
> the lowmem reserve (memory not available for user memory / cache) from
> each zone's dirtyable memory, we make sure that the zone can always be
> rebalanced without writeback.
> 
> The problem now is that the lowmem reserves scale with highmem and
> there is a point where they entirely overshadow the Normal zone.  This
> means that no page cache at all is allowed in lowmem.  Combine this
> with how dirtyable memory excludes highmem, and the sum of all
> dirtyable memory is nil.  This effectively disables the writeback
> cache.
> 
> I figure if anything should be fixed it should be the full exclusion
> of highmem from dirtyable memory and find a better way to calculate a
> minimum.

recently we've updated our production mail server from 3.14.69
to 3.14.73 and it worked fine for a few days. When the box is really
busy (=incoming malware via email), the I/O speed drops to crawl,
write speed is about 5 MB/s on Intel SSDs. Yikes.

The box has 16GB RAM, so it should be a safe HIGHMEM configuration.

Downgrading to 3.14.69 or booting with "mem=15000M" works. I've tested
both approaches and the box was stable. Booting 3.14.73 again triggered
the problem within minutes.

Clearly something with the automatic calculation of the lowmem reserve
crossed a tipping point again, even with the previously considered safe
amount of 16GB RAM for HIGHMEM configs. I don't see anything obvious in
the changelogs from 3.14.69 to 3.14.73, but I might have missed it.

> HOWEVER,
> 
> the lowmem reserve is highmem/32 per default.  With a Normal zone of
> around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
> entirely.  This is almost double of what you consider still okay...

is there a way to read out the calculated lowmem reserve via /proc?

It might be interesting the see the lowmem reserve
when booted with mem=15000M or kernel 3.14.69 for comparison.

Do you think it might be worth tinkering with "lowmem_reserve_ratio"?


/proc/meminfo from the box using "mem=15000M" + kernel 3.14.73:

MemTotal:       15001512 kB
HighTotal:      14219160 kB
HighFree:        9468936 kB
LowTotal:         782352 kB
LowFree:          117696 kB
Slab:             430612 kB
SReclaimable:     416752 kB
SUnreclaim:        13860 kB


/proc/meminfo from a similar machine with 16GB RAM + kernel 3.14.73:
(though that machine is just a firewall, so no real disk I/O)

MemTotal:       16407652 kB
HighTotal:      15636376 kB
HighFree:       14415472 kB
LowTotal:         771276 kB
LowFree:          562852 kB
Slab:              34712 kB
SReclaimable:      20888 kB
SUnreclaim:        13824 kB


Any help is appreciated,
Thomas
Comment 6 Vlastimil Babka 2016-07-21 14:02:17 UTC
On 07/19/2016 12:23 AM, Thomas Jarosch wrote:
> Hi Johannes,
>
> referring to an old kernel bugzilla issue:
> https://bugzilla.kernel.org/show_bug.cgi?id=64121
>
> Am 01.11.2013 um 19:43 wrote Johannes Weiner:
>> It is a combination of two separate things on these setups.
>>
>> Traditionally, only lowmem is considered dirtyable so that dirty pages
>> don't scale with highmem and the kernel doesn't overburden itself with
>> lowmem pressure from buffers etc.  This is purely about accounting.
>>
>> My patches on the other hand were about dirty page placement and
>> avoiding writeback from page reclaim: by subtracting the watermark and
>> the lowmem reserve (memory not available for user memory / cache) from
>> each zone's dirtyable memory, we make sure that the zone can always be
>> rebalanced without writeback.
>>
>> The problem now is that the lowmem reserves scale with highmem and
>> there is a point where they entirely overshadow the Normal zone.  This
>> means that no page cache at all is allowed in lowmem.  Combine this
>> with how dirtyable memory excludes highmem, and the sum of all
>> dirtyable memory is nil.  This effectively disables the writeback
>> cache.
>>
>> I figure if anything should be fixed it should be the full exclusion
>> of highmem from dirtyable memory and find a better way to calculate a
>> minimum.
>
> recently we've updated our production mail server from 3.14.69
> to 3.14.73 and it worked fine for a few days. When the box is really
> busy (=incoming malware via email), the I/O speed drops to crawl,
> write speed is about 5 MB/s on Intel SSDs. Yikes.
>
> The box has 16GB RAM, so it should be a safe HIGHMEM configuration.
>
> Downgrading to 3.14.69 or booting with "mem=15000M" works. I've tested
> both approaches and the box was stable. Booting 3.14.73 again triggered
> the problem within minutes.
>
> Clearly something with the automatic calculation of the lowmem reserve
> crossed a tipping point again, even with the previously considered safe
> amount of 16GB RAM for HIGHMEM configs. I don't see anything obvious in
> the changelogs from 3.14.69 to 3.14.73, but I might have missed it.

I don't see anything either, might be some change e.g. under fs/ though. 
How about git bisect?

>> HOWEVER,
>>
>> the lowmem reserve is highmem/32 per default.  With a Normal zone of
>> around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
>> entirely.  This is almost double of what you consider still okay...
>
> is there a way to read out the calculated lowmem reserve via /proc?

Probably not, but might be possible with live crash session.

> It might be interesting the see the lowmem reserve
> when booted with mem=15000M or kernel 3.14.69 for comparison.
>
> Do you think it might be worth tinkering with "lowmem_reserve_ratio"?
>
>
> /proc/meminfo from the box using "mem=15000M" + kernel 3.14.73:
>
> MemTotal:       15001512 kB
> HighTotal:      14219160 kB
> HighFree:        9468936 kB
> LowTotal:         782352 kB
> LowFree:          117696 kB
> Slab:             430612 kB
> SReclaimable:     416752 kB
> SUnreclaim:        13860 kB
>
>
> /proc/meminfo from a similar machine with 16GB RAM + kernel 3.14.73:
> (though that machine is just a firewall, so no real disk I/O)
>
> MemTotal:       16407652 kB
> HighTotal:      15636376 kB
> HighFree:       14415472 kB
> LowTotal:         771276 kB
> LowFree:          562852 kB
> Slab:              34712 kB
> SReclaimable:      20888 kB
> SUnreclaim:        13824 kB
>
>
> Any help is appreciated,
> Thomas
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
Comment 7 Thomas Jarosch 2016-07-27 09:18:43 UTC
On Thursday, 21. July 2016 16:02:06 Vlastimil Babka wrote:
> > recently we've updated our production mail server from 3.14.69
> > to 3.14.73 and it worked fine for a few days. When the box is really
> > busy (=incoming malware via email), the I/O speed drops to crawl,
> 
> I don't see anything either, might be some change e.g. under fs/ though.
> How about git bisect?

One day later I failed to trigger it, so no easy git bisect.

Yesterday another busy mail server showed the same problem during backup 
creation. This time I knew about slabtop and could see that the 
ext4_inode_cache occupied about 393MB of the 776MB total low memory.
Write speed was down to 25 MB/s.

"sysctl -w vm.drop_caches=3" cleared the inode cache
and the write speed was back to 300 MB/s.

It might be related to memory fragmentation of low memory due to the 
inode cache, the mail server has over 1.400.000 millions files.

I suspect the problem is unrelated to 3.14.73 per se, it seems to trigger 
depending how busy the machine is and the memory layout.

A 64 bit kernel (even with a 32 bit userspace) is the proper solution here.
Still that would mean to deprecate working 32 bit only boxes.

Cheers,
Thomas
Comment 8 Thomas Jarosch 2016-07-27 09:30:41 UTC
On Wednesday, 27. July 2016 11:18:36 Thomas Jarosch wrote:
> It might be related to memory fragmentation of low memory due to the
> inode cache, the mail server has over 1.400.000 millions files.

1.400.000 files of course. Millions would be a bit much :)

Thomas
Comment 9 Linus Torvalds 2016-07-27 16:44:03 UTC
On Wed, Jul 27, 2016 at 2:18 AM, Thomas Jarosch
<thomas.jarosch@intra2net.com> wrote:
>
> Yesterday another busy mail server showed the same problem during backup
> creation. This time I knew about slabtop and could see that the
> ext4_inode_cache occupied about 393MB of the 776MB total low memory.

Honestly, we're never going to really fix the problem with low memory
on 32-bit kernels. PAE is a horrible hardware hack, and it was always
very fragile. It's only going to get more fragile as fewer and fewer
people are running 32-bit environments in any big way.

Quite frankly, 32GB of RAM on a 32-bit kernel is so crazy as to be
ludicrous, and nobody sane will support that. Run 32-bit user space by
all means, but the kernel needs to be 64-bit if you have more than 8GB
of RAM.

Realistically, PAE is "workable" up to approximately 4GB of physical
RAM, where the exact limit depends on your workload.

So if the bulk of your memory use is just user-space processes, then
you can more comfortably run with more memory (so 8GB or even 16GB of
RAM might work quite well).

And as mentioned, things are getting worse, and not better. We cared
much more deeply about PAE back in the 2.x timeframe. Back then, it
was a primary target, and you would find people who cared. These days,
it simply isn't. These days, the technical solution to PAE literally
is "just run a 64-bit kernel".

                   Linus
Comment 10 Thomas Jarosch 2016-07-29 17:00:58 UTC
On Wednesday, 27. July 2016 09:44:00 Linus Torvalds wrote:
> Quite frankly, 32GB of RAM on a 32-bit kernel is so crazy as to be
> ludicrous, and nobody sane will support that. Run 32-bit user space by
> all means, but the kernel needs to be 64-bit if you have more than 8GB
> of RAM.

thanks for the detailed explanation.

Upgrading to a 64-bit kernel with a 32-bit userspace is the mid-term plan 
which might turn into a short term plan given the occasional hiccup
with PAE / low memory pressure.

Something tells me there might be issues with mISDN using a 64-bit kernel 
with a 32-bit userspace since ISDN is a feature that's not used much 
nowadays either. But that should be more or less easy to solve.

-> I consider the issue "fixed" from my side.

Cheers,
Thomas

Note You need to log in before you can comment on or make changes to this bug.