Efficient memcpy()/memmove() for G2/G3 cores...

Thu Sep 4 22:05:16 EST 2008

On Thursday 04 September 2008 04:04:58 Paul Mackerras wrote:
> prodyut hazarika writes:
> > glibc memxxx for powerpc are horribly inefficient. For optimal
> > performance, we should should dcbt instruction to establish the source
> > address in cache, and dcbz to establish the destination address in cache.
> > We should do dcbt and dcbz such that the touches happen a line ahead of
> > the actual copy.
> >
> > The problem which is see is that dcbt and dcbz instructions don't work on
> > non-cacheable memory (obviously!). But memxxx function are used for both
> > cached and non-cached memory. Thus this optimized memcpy should be smart
> > enough to figure out that both source and destination address fall in
> > cacheable space, and only then
> > used the optimized dcbt/dcbz instructions.
>
> I would be careful about adding overhead to memcpy.  I found that in
> the kernel, almost all calls to memcpy are for less than 128 bytes (1
> cache line on most 64-bit machines).  So, adding a lot of code to
> detect cacheability and do prefetching is just going to slow down the
> common case, which is short copies.  I don't have statistics for glibc
> but I wouldn't be surprised if most copies were short there also.

Then please explain the following. This is a memcpy() speed test for different 
sized blocks on a MPC5121e (DIU is turned on). The first case is glibc code 
without optimizations, and the second case is 16-register strides with 
dcbt/dcbz instructions, written in assembly language (see attachment)

$ ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes   :  3.48 Mbyte/s ( throughput:  6.96 Mbytes/s)
50000 chunks of 16 bytes   :  14.3 Mbyte/s ( throughput:  28.6 Mbytes/s)
10000 chunks of 100 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)
5000 chunks of 256 bytes   :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
1000 chunks of 1000 bytes  :  14.4 Mbyte/s ( throughput:  28.7 Mbytes/s)
50 chunks of 16384 bytes   :  14.2 Mbyte/s ( throughput:  28.4 Mbytes/s)
1 chunks of 1048576 bytes  :  14.4 Mbyte/s ( throughput:  28.8 Mbytes/s)

$ LD_PRELOAD=./libmemcpye300dj.so ./memcpyspeed
Fully aligned:
100000 chunks of 5 bytes   :  7.44 Mbyte/s ( throughput:  14.9 Mbytes/s)
50000 chunks of 16 bytes   :  13.1 Mbyte/s ( throughput:  26.2 Mbytes/s)
10000 chunks of 100 bytes  :  29.4 Mbyte/s ( throughput:  58.8 Mbytes/s)
5000 chunks of 256 bytes   :  90.2 Mbyte/s ( throughput:   180 Mbytes/s)
1000 chunks of 1000 bytes  :    77 Mbyte/s ( throughput:   154 Mbytes/s)
50 chunks of 16384 bytes   :  96.8 Mbyte/s ( throughput:   194 Mbytes/s)
1 chunks of 1048576 bytes  :  97.6 Mbyte/s ( throughput:   195 Mbytes/s)

(I have edited the output of this tool to fit into an e-mail without wrapping 
lines for readability).
Please tell me how on earth there can be such a big difference???
Note that on a MPC5200B this is TOTALLY different, and both processors have an 
e300 core (different versions of it though).

> The other thing that I have found is that code that is optimal for
> cache-cold copies is usually significantly slower than optimal for
> cache-hot copies, because the cache management instructions consume
> cycles and don't help in the cache-hot case.
>
> In other words, I don't think we should be tuning the glibc memcpy
> based on tests of how fast it copies multiple megabytes.

I don't just copy multiple megabytes! See above example. Also I do constant 
performance testing of different applications using LD_PRELOAD, to se the 
impact. Recentrly I even tried prboom (a free doom port), to remember the 
good old days of PC benchmarking ;-)
I have yet to come across a test that has lower performance with this 
optimization (on an MPC5121e that is).

> Still, for 6xx/e300 cores, we probably do want to use dcbt/dcbz for
> larger copies.  We don't want to use dcbt/dcbz on the larger 64-bit

At least for MPC5121e you really, really need it!!

> processors (POWER4/5/6) because the hardware prefetching and
> write-combining mean that dcbt/dcbz don't help and just slow things
> down.

That's explainable.
What's not explainable, are the results I am getting on the MPC5121e.
Please, could someone tell me what I am doing wrong? (I must be doing 
something wrong, I'm almost sure).
One thing that I realize is not quite "right" with memcpyspeed.c is the fact 
that it copies consecutive blocks of memory, that should have an impact on 
5-byte and 16-bytes copy results I guess (a cacheline for the following block 
may already be fetched), but not anymore for 100-byte blocks and bigger (with 
32-byte cache lines). In fact, 16-bytes seems to be the only size where the 
additional overhead has some impact (which is negligible).

Another thing is that performance probably matters most to the end-user when 
applications need to copy big amounts of data (e.g. video frames or bitmap 
data), which is most probably done using big blocks of memcpy(), so 
eventually hurting performance for small copies probably has less weight on 
overall experience.

Best regards,

-- 
David Jander
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memcpy_e300_dj.S
Type: text/x-objcsrc
Size: 5413 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20080904/4a086dc6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memcpyspeed.c
Type: text/x-csrc
Size: 4073 bytes
Desc: not available
URL: <http://lists.ozlabs.org/pipermail/linuxppc-dev/attachments/20080904/4a086dc6/attachment.c>