Bug 11156 - Old kernels copy memory faster than new
Old kernels copy memory faster than new
Status: CLOSED CODE_FIX
Product: IO/Storage
Classification: Unclassified
Component: Block Layer
All Linux
: P1 normal
Assigned To: Jens Axboe
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-24 10:57 UTC by smalcom
Modified: 2008-07-28 00:58 UTC (History)
0 users

See Also:
Kernel Version: 2.6.24, 2.6.25
Tree: Mainline
Regression: Yes


Attachments

Description smalcom 2008-07-24 10:57:42 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.24
Distribution: Slackware 10-12

First machine:
CPU - AMD Athlon3600+(2ГГц)
Chipset - nForce 6150(MCP51)
RAM - 3G DDR2
Video - internal GeForce6150
Kernel - 2.6.25.4(own built)
Copy speed - 1.7GByte/s

on another kernel:
Kernel - 2.6.23.5(own built)
Copy speed - 43.5GByte/s
--------------------------------------------
Second machine:
CPU - PII-350
MB i440BX
RAM - 128M SDRAM
Video - 3DFX Voodoo3
Kernel - 2.6.21.5(Vanila from slackware distribution)
Copy speed - 11.3GByte/s

Steps to reproduce:
dd if=/dev/zero of=/dev/null bs=16M count=10000
Comment 1 Anonymous Emailer 2008-07-24 12:27:14 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 24 Jul 2008 10:57:42 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=11156
> 
>            Summary: Old kernels copy memory faster than new
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.24, 2.6.25
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Block Layer
>         AssignedTo: axboe@kernel.dk
>         ReportedBy: smal.root@gmail.com
> 
> 
> Latest working kernel version: 2.6.25
> Earliest failing kernel version: 2.6.24
> Distribution: Slackware 10-12
> 
> First machine:
> CPU - AMD Athlon3600+(2______)
> Chipset - nForce 6150(MCP51)
> RAM - 3G DDR2
> Video - internal GeForce6150
> Kernel - 2.6.25.4(own built)
> Copy speed - 1.7GByte/s
> 
> on another kernel:
> Kernel - 2.6.23.5(own built)
> Copy speed - 43.5GByte/s
> --------------------------------------------
> Second machine:
> CPU - PII-350
> MB i440BX
> RAM - 128M SDRAM
> Video - 3DFX Voodoo3
> Kernel - 2.6.21.5(Vanila from slackware distribution)
> Copy speed - 11.3GByte/s
> 
> Steps to reproduce:
> dd if=/dev/zero of=/dev/null bs=16M count=10000
> 

lol.  OK, who did that?

Perhaps ZERO_PAGE changes?

Comment 2 smalcom 2008-07-24 13:14:55 UTC
>>(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).

i don't like mail list. it not humanoidal.

>>lol.  OK, who did that?
court tester.

>>Perhaps ZERO_PAGE changes?
am not kernel developer. i will trace program tomorrow
Comment 3 Nick Piggin 2008-07-25 00:28:58 UTC
Probably ZERO_PAGE(In reply to comment #2)
> >>(switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> i don't like mail list. it not humanoidal.
> 
> >>lol.  OK, who did that?
> court tester.
> 
> >>Perhaps ZERO_PAGE changes?
> am not kernel developer. i will trace program tomorrow

Probably is ZERO_PAGE changes. Thank you for noticing this and reporting it, but I think the issue is that we just changed the way /dev/zero (and a few other things) works. It is known that it will be slower in benchmarks like this, but should be OK for more useful work.

Rather than use /dev/zero as the input, maybe you could just use some malloc allocated memory or a tmpfs file for example. If the slowdown disappears, that would confirm it is the /dev/zero changes.
Comment 4 smalcom 2008-07-27 16:05:57 UTC
Mmm
>>extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
every call division?

but for 32-bit x86 cpu unsigned long is 32bit.

PS. haveno time for test now. am wanna try to build kernel with old ZERO_PAGE and look at result... i think tomorrow
Comment 5 smalcom 2008-07-27 16:13:41 UTC
>>or a tmpfs file for example
on /dev/ram the fastest PC show better result.
Comment 6 Nick Piggin 2008-07-27 20:33:33 UTC
The issue is that the old code to read from /dev/zero would do tricks to make it very fast, but it was a significant complication to the virtual memory manager, which was deemed not useful for real world applications.

Performance critical code would not be reading swaths of zeroes like this (because it is just useless work, it is already known to be a zero result). Of course there may be some corner cases where some real workload suffers, but we have not run into one yet, and this does not look like one either.

But thanks for reporting. It is very good to know people are keeping an eye on things like this, so it is very helpful.

Can we close this bug?
Comment 7 Hugh Dickins 2008-07-28 10:13:44 UTC
On Thu, 24 Jul 2008, Andrew Morton wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=11156
> > 
> > Kernel - 2.6.25.4(own built)
> > Copy speed - 1.7GByte/s
> > 
> > Kernel - 2.6.23.5(own built)
> > Copy speed - 43.5GByte/s
> > 
> > Steps to reproduce:
> > dd if=/dev/zero of=/dev/null bs=16M count=10000
> 
> lol.  OK, who did that?
> 
> Perhaps ZERO_PAGE changes?

Yes, the ZERO_PAGE changes: readprofile clearly shows lots of time
in clear_user() on 2.6.24 onwards, clearing each page instead of
using the ZERO_PAGE.

I see Nick has already answered this, and the bug is now closed
(guess he's on 2.6.23 whereas I'm on later ;).  I agree with him,
copying from /dev/zero to /dev/null is not an operation which
deserves VM tricks to optimize; but I wanted to add one point.

The particular awfulness of those dd rates (on machines I've
tried I see new kernels as 10 to 30 times worse than old kernels
at that test) owes a lot to the large blocksize (16M) being used.

That blocksize will not fit in the processor's memory cache, so
repeatedly clearing the pages is very slow.  Bring the blocksize
down to something that easily fits in the L2 cache, perhaps 1M or
256k, and new kernels then appear only twice(ish) as bad as old.

Nothing to be proud of, but not nearly so bad as the bs=16M case.

Hugh


Note You need to log in before you can comment on or make changes to this bug.