Home > mailing lists
Re: Faster StrNCpy - Mailing list pgsql-hackers

From	Strong, David
Subject	Re: Faster StrNCpy
Date	October 2, 2006 16:06:54
Msg-id	B6419AF36AC8524082E1BC17DA2506E802579E2C@USMV-EXCH2.na.uis.unisys.com Whole thread Raw
In response to	Re: Faster StrNCpy ("Strong, David" <david.strong@unisys.com>)
Responses	Re: Faster StrNCpy (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers
Tree view
Mark,
Thanks for attaching the C code for your test. I ran a few tests on a 3Ghz Intel Xeon Paxville (dual core) system. I
hopethe formatting of this table survives: 
Method  Size  N=1024*1024       N=1
MEMCPY  63     6964927 us     582494 us
MEMCPY  32     7102497 us     582467 us
MEMCPY  16     7116358 us     582538 us
MEMCPY  8      6965239 us     582796 us
MEMCPY  4      6964722 us     583183 us
STRNCPY 63    10131174 us    8843010 us
STRNCPY 32    10648202 us    9563868 us
STRNCPY 16     9187398 us    7969947 us
STRNCPY 8      9275353 us    8042777 us
STRNCPY 4      9067570 us    8058532 us
STRLCPY 63    15045507 us   14379702 us
STRLCPY 32     8960303 us    8120471 us
STRLCPY 16     7393607 us    4915457 us
STRLCPY 8      7222983 us    3211931 us
STRLCPY 4      7181267 us    1725546 us
LENCPY  63     7608932 us    4416602 us
LENCPY  32     7252849 us    3807535 us
LENCPY  16    11680927 us   10331487 us
LENCPY  8     10409685 us    9660616 us
LENCPY  4     10824632 us    9525082 us

The first column is the copy method, the second column is the source string size (size of -DSTRING), the 3rd and 4th
columnsare the different -DN settings. 
The memcpy () call is the clear winner, at all source string sizes. The strncpy () call is better than strlcpy (),
untilthe size of the string decreases. This is probably due to the zero padding effect of strncpy. The lencpy () call
startsout strong, but degrades as the size of the string decreases. This was a little surprising and I don't have an
explanationfor this behavior at this time. 
The AMD optimization manuals have some interesting examples for optimizations for memcpy, along the lines of cache line
copiesand prefetching: 

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF#search=%22amd%20optimization%20manual%22
h
<http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22>
ttp://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22
There also used to be an interesting article on the SGI web site called "Optimizing CPU to Memory Accesses on the SGI
VisualWorkstations 320 and 540", but this seems to have been pulled. I did find a copy of the article here: 
http://eunchul.com/database/board/cat.php?data=Win32_API&board_group=D42a8ff5c3a9b9
Obviously, different copy mechanisms suit different data sizes. So, I added a little debug to the strlcpy () function
thatwas added to Postgres the other day. I ran a test against Postgres for ~15 minutes that used 2 client backends and
theBG writer - 8330804 calls to strlcpy () were generated by the test. 
Out of the 8330804 calls, 6226616 calls used a maximum copy size of 2213 bytes e.g. strlcpy (dest, src, 2213) and
2104074calls used a maximum copy size of 64 bytes. 
I know the 2213 size calls come from the set_ps_display () function. I don't know where the 64 size calls come from,
yet.
In the 64 size case, with the exception of 35 calls, calls for size 64 are only copying 1 byte - I would assume this is
aNULL. 
In the 2213 size case, 1933027 calls copy 20 bytes; 2189415 calls copy 5 bytes; 85550 calls copy 6 bytes and 2018482
callscopy 7 bytes. 
Based on this data, it would seem that either memcpy () or strlcpy () calls would be better due to the source string
size. 
Call originating from the set_ps_display () function might be able to use the memcpy () call as  the size of the source
stringshould be known. The other calls probably need something like strlcpy () as the source string might not be known,
althoughusing memcpy () to copy in XX byte blocks might be interesting. 
David

________________________________

From: pgsql-hackers-owner@postgresql.org on behalf of mark@mark.mielke.cc
Sent: Fri 9/29/2006 2:59 PM
To: Tom Lane
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Faster StrNCpy



On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote:
> mark@mark.mielke.cc writes:
> > If anybody is curious, here are my numbers for an AMD X2 3800+:
> You did not show your C code, so no one else can reproduce the test on
> other hardware.  However, it looks like your compiler has unrolled the
> memcpy into straight-line 8-byte moves, which makes it pretty hard for
> anything operating byte-wise to compete, and is a bit dubious for the
> general case anyway (since it requires assuming that the size and
> alignment are known at compile time).

I did show the .s code. I call into x_memcpy(a, b), meaning that the
compiler can't assume anything. It may happen to be aligned.

Here are results over 64 Mbytes of memory, to ensure that every call is
a cache miss:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c ; ./x 
NONE:        767243 us
MEMCPY:     6044137 us
STRNCPY:   10741759 us
STRLCPY:   12061630 us
LENCPY:     9459099 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE:        712193 us
MEMCPY:     6072312 us
STRNCPY:    9982983 us
STRLCPY:    6605052 us
LENCPY:     7128258 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE:        708164 us
MEMCPY:     6042817 us
STRNCPY:    8885791 us
STRLCPY:    5592477 us
LENCPY:     6135550 us

At least on my machine, memcpy() still comes out on top. Yes, assuming that
it is aligned correctly for the machine. Here is unaliagned (all arrays are
stored +1 offset in memory):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)"
-DALIGN=1-o x x.c y.c strlcpy.c ; ./x 
NONE:        790932 us
MEMCPY:     6591559 us
STRNCPY:   10622291 us
STRLCPY:   12070007 us
LENCPY:    10322541 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        764577 us
MEMCPY:     6631731 us
STRNCPY:    9513540 us
STRLCPY:    6615345 us
LENCPY:     7263392 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        825689 us
MEMCPY:     6607777 us
STRNCPY:    8976487 us
STRLCPY:    5878088 us
LENCPY:     6180358 us

Alignment looks like it does impact the results for memcpy(). memcpy()
changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is
still the winner in all cases accept for strlcpy(), which beats it on
very short strings ("").

Here is the cache hit case including your strlen+memcpy as 'LENCPY':

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x 
NONE:        696157 us
MEMCPY:      825118 us
STRNCPY:    7983159 us
STRLCPY:   10787462 us
LENCPY:     6048339 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        700201 us
MEMCPY:      593701 us
STRNCPY:    7577380 us
STRLCPY:    3727801 us
LENCPY:     3169783 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        706283 us
MEMCPY:      792719 us
STRNCPY:    7870425 us
STRLCPY:     681334 us
LENCPY:     2062983 us


First call was every call being a cache hit. With this one, every one is
a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of
memory. I've attached the code for your consideration. x.c is the routines
I used to perform the tests. y.c is the main program. strlcpy.c is copied
from the online reference as is without change. The compilation steps
are described above. STRING is the string to try out. N is the number
of 64-byte blocks to allocate. ALIGN is the number of bytes to offset
the array by when storing / reading / writing. ALIGN should be >= 0.

At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of
RAM.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem... 
                          http://mark.mielke.cc/
pgsql-hackers by date:
From: Tom Lane
Date: 02 October 2006, 15:38:58
Subject: Re: [BUGS] Update using sub-select table in schema
From: Josh Berkus
Date: 02 October 2006, 16:25:15
Subject: Re: Select for update with outer join broken?
Re: Faster StrNCpy - Mailing list pgsql-hackers

Previous

Next