Thread: Faster StrNCpy

Faster StrNCpy

From

Tom Lane

Date:

26 September 2006, 17:24:59

David Strong points out here
http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php
that some popular implementations of strncpy(dst,src,n) are quite
inefficient when strlen(src) is much less than n, because they don't
optimize the zero-pad step that is required by the standard.

It looks to me like we have a good number of places that are using
either StrNCpy or strncpy directly to copy into large buffers that
we do not need full zero-padding in, only a single guaranteed null
byte.  While not all of these places are in performance-critical
paths, some are.  David identified set_ps_display, and the other
thing that's probably significant is unnecessary use of strncpy
for keys of string-keyed hash tables.  (We used to actually need
zero padding for string-keyed hash keys, but that was a long time ago.)

I propose adding an additional macro in c.h, along the lines of

#define StrNCopy(dst,src,len) \   do \   { \       char * _dst = (dst); \       Size _len = (len); \
\       if (_len > 0) \       { \           const char * _src = (src); \           Size _src_len = strlen(_src); \
\           if (_src_len < _len) \               memcpy(_dst, _src, _src_len + 1); \           else \           { \
         memcpy(_dst, _src, _len - 1); \               _dst[_len-1] = '\0'; \           } \       } \   } while (0)
 

Unlike StrNCpy, this requires that the source string be null-terminated,
so it would not be a drop-in replacement everywhere.  Also, it could be
a performance loss if strlen(src) is much larger than len ... but that
is not usually the case for the places we'd want to apply it.

Thoughts, objections?  In particular, is the name OK, or do we need
something a bit further away from StrNCpy?
        regards, tom lane

Re: Faster StrNCpy

From

Martijn van Oosterhout

Date:

26 September 2006, 17:40:25

On Tue, Sep 26, 2006 at 04:24:51PM -0400, Tom Lane wrote:
> David Strong points out here
> http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php
> that some popular implementations of strncpy(dst,src,n) are quite
> inefficient when strlen(src) is much less than n, because they don't
> optimize the zero-pad step that is required by the standard.

I think that's why strlcpy was invented, to deal with the issues with
strncpy.

http://www.gratisoft.us/todd/papers/strlcpy.html

There's an implementation here (used in glib), though you could
probably find more.

http://mail.gnome.org/archives/gtk-devel-list/2000-May/msg00029.html

Do you really think it's worth making a macro rather than just a normal
function?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Faster StrNCpy

From

Alvaro Herrera

Date:

26 September 2006, 17:49:56

Martijn van Oosterhout wrote:
> On Tue, Sep 26, 2006 at 04:24:51PM -0400, Tom Lane wrote:
> > David Strong points out here
> > http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php
> > that some popular implementations of strncpy(dst,src,n) are quite
> > inefficient when strlen(src) is much less than n, because they don't
> > optimize the zero-pad step that is required by the standard.
> 
> I think that's why strlcpy was invented, to deal with the issues with
> strncpy.
> 
> http://www.gratisoft.us/todd/papers/strlcpy.html
> 
> There's an implementation here (used in glib), though you could
> probably find more.
> 
> http://mail.gnome.org/archives/gtk-devel-list/2000-May/msg00029.html

That one would be LGPL (glib's license).  Here is OpenBSD's version,
linked from that one:

ftp://ftp.openbsd.org/pub/OpenBSD/src/lib/libc/string/strlcpy.c

You'll notice that it iterates once per char.  Between that and the
strlen() call in Tom's version, not sure which is the lesser evil.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Faster StrNCpy

From

Tom Lane

Date:

26 September 2006, 17:54:06

Martijn van Oosterhout <kleptog@svana.org> writes:
> I think that's why strlcpy was invented, to deal with the issues with
> strncpy.
> http://www.gratisoft.us/todd/papers/strlcpy.html

strlcpy does more than we need (note that none of the existing uses care
about counting the overflowed bytes).  Not sure if it's worth adopting
those semantics when they're not really standard, but if you think a lot
of people would be familiar with strlcpy, maybe we should.

> Do you really think it's worth making a macro rather than just a normal
> function?

Only in that a macro in c.h is less work than a configure test plus a
replacement file in src/port.  But if we want to consider this a
standard function that just doesn't happen to exist everywhere, I
suppose we should use configure.
        regards, tom lane

Re: Faster StrNCpy

From

Neil Conway

Date:

26 September 2006, 18:04:05

On Tue, 2006-09-26 at 16:53 -0400, Tom Lane wrote:
> strlcpy does more than we need (note that none of the existing uses care
> about counting the overflowed bytes).  Not sure if it's worth adopting
> those semantics when they're not really standard, but if you think a lot
> of people would be familiar with strlcpy, maybe we should.

I think we should -- while strlcpy() is not standardized, it is widely
used (in libc on all the BSDs, Solaris and OS X, as well as private
copies in Linux, glib, etc.).

A wholesale replacement of strncpy() calls is probably worth doing --
replacing them with strlcpy() if the source string is NUL-terminated,
and I suppose memcpy() otherwise.

-Neil

Re: Faster StrNCpy

From

Tom Lane

Date:

26 September 2006, 18:04:23

Alvaro Herrera <alvherre@commandprompt.com> writes:
> You'll notice that it iterates once per char.  Between that and the
> strlen() call in Tom's version, not sure which is the lesser evil.

Yeah, I was wondering that too.  My code would require two scans of the
source string (one inside strlen and one in memcpy), but in much of our
usage the source and dest should be reasonably well aligned and one
could expect memcpy to be using word rather than byte operations, so you
might possibly make it back on the strength of fewer write cycles.  And
on the third hand, for short source strings none of this matters and
the extra function call involved for strlen/memcpy probably dominates.

I'm happy to just use the OpenBSD version as a src/port module.
Any objections?
        regards, tom lane

Re: Faster StrNCpy

From

Tom Lane

Date:

26 September 2006, 18:12:33

Neil Conway <neilc@samurai.com> writes:
> A wholesale replacement of strncpy() calls is probably worth doing --
> replacing them with strlcpy() if the source string is NUL-terminated,
> and I suppose memcpy() otherwise.

What I'd like to do immediately is put in strlcpy() and hit the two or
three places I think are performance-relevant.  I agree with trying to
get rid of StrNCpy/strncpy calls over the long run, but it's just code
beautification and probably not appropriate for beta.
        regards, tom lane

Re: Faster StrNCpy

From

Josh Berkus

Date:

26 September 2006, 21:41:12

Tom,

> What I'd like to do immediately is put in strlcpy() and hit the two or
> three places I think are performance-relevant.  I agree with trying to
> get rid of StrNCpy/strncpy calls over the long run, but it's just code
> beautification and probably not appropriate for beta.

Immediately?  Presumably you mean for 8.3?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Faster StrNCpy

From

Tom Lane

Date:

26 September 2006, 23:26:01

Josh Berkus <josh@agliodbs.com> writes:
>> What I'd like to do immediately is put in strlcpy() and hit the two or
>> three places I think are performance-relevant.

> Immediately?  Presumably you mean for 8.3?

No, I mean now.  This is a performance bug and it's still open season on
bugs.  If we were close to having a release-candidate version, I'd hold
off, but the above proposal seems sufficiently low-risk for the current
stage of the cycle.
        regards, tom lane

Re: Faster StrNCpy

From

"Strong, David"

Date:

27 September 2006, 10:05:44

Tom,
Let us know when you've added strlcpy () and we'll be happy to run some tests on the new code.
David

________________________________

From: pgsql-hackers-owner@postgresql.org on behalf of Tom Lane
Sent: Tue 9/26/2006 7:25 PM
To: josh@agliodbs.com
Cc: pgsql-hackers@postgresql.org; Neil Conway; Martijn van Oosterhout
Subject: Re: [HACKERS] Faster StrNCpy



Josh Berkus <josh@agliodbs.com> writes:
>> What I'd like to do immediately is put in strlcpy() and hit the two or
>> three places I think are performance-relevant.

> Immediately?  Presumably you mean for 8.3?

No, I mean now.  This is a performance bug and it's still open season on
bugs.  If we were close to having a release-candidate version, I'd hold
off, but the above proposal seems sufficiently low-risk for the current
stage of the cycle.
                       regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: Faster StrNCpy

From

Andrew Dunstan

Date:

27 September 2006, 10:23:03

Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>   
>>> What I'd like to do immediately is put in strlcpy() and hit the two or
>>> three places I think are performance-relevant.
>>>       
>
>   
>> Immediately?  Presumably you mean for 8.3?
>>     
>
> No, I mean now.  This is a performance bug and it's still open season on
> bugs.  If we were close to having a release-candidate version, I'd hold
> off, but the above proposal seems sufficiently low-risk for the current
> stage of the cycle.
>
>   

What are the other hotspots?

cheers

andrew

Re: Faster StrNCpy

From

Tom Lane

Date:

27 September 2006, 10:49:47

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> What I'd like to do immediately is put in strlcpy() and hit the two or
>> three places I think are performance-relevant.

> What are the other hotspots?

The ones I can think of offhand are set_ps_display and use of strncpy as
a HashCopyFunc.
        regards, tom lane

Re: Faster StrNCpy

From

"Strong, David"

Date:

27 September 2006, 11:08:26

We sometimes see TupleDescInitEntry () taking high CPU times via OProfile. This does include, amongst a lot of other
code,a call to namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a good candidate right now as a name
stringis only 64 bytes. 
David

________________________________

From: pgsql-hackers-owner@postgresql.org on behalf of Tom Lane
Sent: Wed 9/27/2006 6:49 AM
To: Andrew Dunstan
Cc: josh@agliodbs.com; pgsql-hackers@postgresql.org; Neil Conway; Martijn van Oosterhout
Subject: Re: [HACKERS] Faster StrNCpy



Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> What I'd like to do immediately is put in strlcpy() and hit the two or
>> three places I think are performance-relevant.

> What are the other hotspots?

The ones I can think of offhand are set_ps_display and use of strncpy as
a HashCopyFunc.
                       regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

27 September 2006, 20:26:47

On Wed, Sep 27, 2006 at 07:08:05AM -0700, Strong, David wrote:
> We sometimes see TupleDescInitEntry () taking high CPU times via
> OProfile. This does include, amongst a lot of other code, a call to
> namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a
> good candidate right now as a name string is only 64 bytes.

Just wondering - are any of these cases where a memcpy() would work
just as well? Or are you not sure that the source string is at least
64 bytes in length?
   memcpy(&target, &source, sizeof(target));   target[sizeof(target)-1] = '\0';

I imagine any extra checking causes processor stalls, or at least for
the branch prediction to fill up? Straight copies might allow for
maximum parallelism? If it's only 64 bytes, on processors such as
Pentium or Athlon, that's 2 or 4 cache lines, and writes are always
performed as cache lines.

I haven't seen the code that you and Tom are looking at to tell
whether it is safe to do this or not.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Can i see server SQL commands ?

From

"Adnan DURSUN"

Date:

27 September 2006, 22:28:08

       Hi all
       I wanna know what is going on while a DML command works. For example 
;       Which commands are executed by the core when we send an "UPDATE tab 
SET col = val1..."       in case there is a foreing key or an unique constraint on table 
"tab".
       How can i see that ?
       Best regards

Adnan DURSUN
ASRIN Bilişim Ltd.

Re: Can i see server SQL commands ?

From

tomas@tuxteam.de

Date:

28 September 2006, 04:11:55

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Sep 28, 2006 at 04:27:36AM +0300, Adnan DURSUN wrote:
> 
>        Hi all
> 
>        I wanna know what is going on while a DML command works. For example 
> ;
>        Which commands are executed by the core when we send an "UPDATE tab 
> SET col = val1..."

Adnan,

this mailing list is not the right one for such questions. More
appropriate would be <pgsql-novice@postgresql.org> or maybe
<pgsql-general@postgresql.org>.

Having said that, you may set the log level of the server in the
configuration file (whose location depends on your OS and PostgreSQL
version. Look there for a line log_statements = XXX and set XXX to
'all'. Don't forget to restart your server afterwards.

HTH
- -- tomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFG3QyBcgs9XrR2kYRAgdqAJ0VnUw5+Q79HiIwHocHIw4TWHePaQCffBBK
ASn3Z6XpKG91NTrmEaBtz08=
=Ibh3
-----END PGP SIGNATURE-----

Re: Faster StrNCpy

From

"Strong, David"

Date:

28 September 2006, 11:51:50

Mark,

In the specific case of the namestrcpy () function, it will copy a
maximum of 64 bytes, but the length of the source string is unknown. I
would have to think that memcpy () would certainly win if you knew the
source and destination sizes etc. Perhaps there are some places like
that in the code that don't use memcpy () currently?

David

-----Original Message-----
From: mark@mark.mielke.cc [mailto:mark@mark.mielke.cc]
Sent: Wednesday, September 27, 2006 4:27 PM
To: Strong, David
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Faster StrNCpy

On Wed, Sep 27, 2006 at 07:08:05AM -0700, Strong, David wrote:
> We sometimes see TupleDescInitEntry () taking high CPU times via
> OProfile. This does include, amongst a lot of other code, a call to
> namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a
> good candidate right now as a name string is only 64 bytes.

Just wondering - are any of these cases where a memcpy() would work
just as well? Or are you not sure that the source string is at least
64 bytes in length?
   memcpy(&target, &source, sizeof(target));   target[sizeof(target)-1] = '\0';

I imagine any extra checking causes processor stalls, or at least for
the branch prediction to fill up? Straight copies might allow for
maximum parallelism? If it's only 64 bytes, on processors such as
Pentium or Athlon, that's 2 or 4 cache lines, and writes are always
performed as cache lines.

I haven't seen the code that you and Tom are looking at to tell
whether it is safe to do this or not.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com
__________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood
Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario,
Canada
 One ring to rule them all, one ring to find them, one ring to bring
them all                      and in the darkness bind them...
                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

Tom Lane

Date:

28 September 2006, 12:56:16

"Strong, David" <david.strong@unisys.com> writes:
> Just wondering - are any of these cases where a memcpy() would work
> just as well? Or are you not sure that the source string is at least
> 64 bytes in length?

In most cases, we're pretty sure that it's *not* --- it'll just be a
palloc'd C string.

I'm disinclined to fool with the restriction that namestrcpy zero-pad
Name values, because they might end up on disk, and allowing random
memory contents to get written out is ungood from a security point of
view.  However, it's entirely possible that it'd be a bit faster to do
a MemSet followed by strlcpy than to use strncpy for zero-padding.
        regards, tom lane

Re: Faster StrNCpy

From

Markus Schaber

Date:

29 September 2006, 06:21:36

Hi, Tom,

Tom Lane wrote:
> "Strong, David" <david.strong@unisys.com> writes:
>> Just wondering - are any of these cases where a memcpy() would work
>> just as well? Or are you not sure that the source string is at least
>> 64 bytes in length?
>
> In most cases, we're pretty sure that it's *not* --- it'll just be a
> palloc'd C string.
>
> I'm disinclined to fool with the restriction that namestrcpy zero-pad
> Name values, because they might end up on disk, and allowing random
> memory contents to get written out is ungood from a security point of
> view.

There's another disadvantage of always copying 64 bytes:

It may be that the 64-byte range crosses a page boundary. Now guess what
happens when this next page is not mapped -> segfault.

Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Faster StrNCpy

From

Tom Lane

Date:

29 September 2006, 11:59:33

Markus Schaber <schabi@logix-tt.com> writes:
> There's another disadvantage of always copying 64 bytes:
> It may be that the 64-byte range crosses a page boundary. Now guess what
> happens when this next page is not mapped -> segfault.

Irrelevant, because in all interesting cases the Name field is part of a
larger record that would stretch into that other page anyway.
        regards, tom lane

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

29 September 2006, 18:12:02

On Fri, Sep 29, 2006 at 11:21:21AM +0200, Markus Schaber wrote:
> Tom Lane wrote:
> >> Just wondering - are any of these cases where a memcpy() would work
> >> just as well? Or are you not sure that the source string is at least
> >> 64 bytes in length?
> > 
> > In most cases, we're pretty sure that it's *not* --- it'll just be a
> > palloc'd C string.
> > 
> > I'm disinclined to fool with the restriction that namestrcpy zero-pad
> > Name values, because they might end up on disk, and allowing random
> > memory contents to get written out is ungood from a security point of
> > view.
> 
> There's another disadvantage of always copying 64 bytes:
> 
> It may be that the 64-byte range crosses a page boundary. Now guess what
> happens when this next page is not mapped -> segfault.

With strncpy(), this possibility already exists. If it is a real problem,
that stand-alone 64-byte allocations are crossing page boundaries, the
fault is with the memory allocator, not with the user of the memory.

For strlcpy(), my suggestion that Tom quotes was that modern processes
do best when instructions can be fully parallelized. It is a lot
easier to parallelize a 64-byte copy, than a tight loop looking for
'\0' or n >= 64. 64 bytes easily fits into cache memory, and modern
processors write cache memory in blocks of 16, 32, or 64 bytes anyways,
meaning that any savings in terms of not writing are minimal.

But it's only safe if you know that the source string allocation is
>= 64 bytes. Often you don't, therefore it isn't safe, and the suggestion
is unworkable.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

29 September 2006, 18:23:38

If anybody is curious, here are my numbers for an AMD X2 3800+:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be slow."' -o x x.c y.c strlcpy.c ; ./x
NONE:        620268 us
MEMCPY:      683135 us
STRNCPY:    7952930 us
STRLCPY:   10042364 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -o x x.c y.c strlcpy.c ; ./x
NONE:        554694 us
MEMCPY:      691390 us
STRNCPY:    7759933 us
STRLCPY:    3710627 us

$ gcc -O3 -std=c99 -DSTRING='""' -o x x.c y.c strlcpy.c ; ./x
NONE:        631266 us
MEMCPY:      775340 us
STRNCPY:    7789267 us
STRLCPY:     550430 us

Each invocation represents 100 million calls to each of the functions.
Each function accepts a 'dst' and 'src' argument, and assumes that it
is copying 64 bytes from 'src' to 'dst'. The none function does
nothing. The memcpy calls memcpy(), the strncpy calls strncpy(), and
the strlcpy calls the strlcpy() that was posted from the BSD sources.
(GLIBC doesn't have strlcpy() on my machine).

This makes it clear what the overhead of the additional logic involves.
memcpy() is approximately equal to nothing at all. strncpy() is always
expensive. strlcpy() is often more expensive than memcpy(), except in
the empty string case.

These tests do not properly model the effects of real memory, however,
they do model the effects of cache memory. I would suggest that the
results are exaggerated, but not invalid.

For anybody doubting the none vs memcpy, I've included the generated
assembly code. I chalk it entirely up to fully utilizing the
parallelization capability of the CPU. Although 16 movq instructions
are executed, they can be executed fully in parallel.

It almost makes it clear to me that all of these instructions are
pretty fast. Are we sure this is a real bottleneck? Even the slowest
operation above, strlcpy() on a very long string, appears to execute
10 per microsecond? Perhaps my tests are too easy for my CPU and I
need to make it access many different 64-byte blocks? :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

Attachment

Re: Faster StrNCpy

From

Tom Lane

Date:

29 September 2006, 18:34:37

mark@mark.mielke.cc writes:
> If anybody is curious, here are my numbers for an AMD X2 3800+:

You did not show your C code, so no one else can reproduce the test on
other hardware.  However, it looks like your compiler has unrolled the
memcpy into straight-line 8-byte moves, which makes it pretty hard for
anything operating byte-wise to compete, and is a bit dubious for the
general case anyway (since it requires assuming that the size and
alignment are known at compile time).

This does make me wonder about whether we shouldn't try the
strlen+memcpy implementation I proposed earlier ...
        regards, tom lane

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

29 September 2006, 18:59:24

On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote:
> mark@mark.mielke.cc writes:
> > If anybody is curious, here are my numbers for an AMD X2 3800+:
> You did not show your C code, so no one else can reproduce the test on
> other hardware.  However, it looks like your compiler has unrolled the
> memcpy into straight-line 8-byte moves, which makes it pretty hard for
> anything operating byte-wise to compete, and is a bit dubious for the
> general case anyway (since it requires assuming that the size and
> alignment are known at compile time).

I did show the .s code. I call into x_memcpy(a, b), meaning that the
compiler can't assume anything. It may happen to be aligned.

Here are results over 64 Mbytes of memory, to ensure that every call is
a cache miss:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c ; ./x 
NONE:        767243 us
MEMCPY:     6044137 us
STRNCPY:   10741759 us
STRLCPY:   12061630 us
LENCPY:     9459099 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE:        712193 us
MEMCPY:     6072312 us
STRNCPY:    9982983 us
STRLCPY:    6605052 us
LENCPY:     7128258 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE:        708164 us
MEMCPY:     6042817 us
STRNCPY:    8885791 us
STRLCPY:    5592477 us
LENCPY:     6135550 us

At least on my machine, memcpy() still comes out on top. Yes, assuming that
it is aligned correctly for the machine. Here is unaliagned (all arrays are
stored +1 offset in memory):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)"
-DALIGN=1-o x x.c y.c strlcpy.c ; ./x 
NONE:        790932 us
MEMCPY:     6591559 us
STRNCPY:   10622291 us
STRLCPY:   12070007 us
LENCPY:    10322541 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        764577 us
MEMCPY:     6631731 us
STRNCPY:    9513540 us
STRLCPY:    6615345 us
LENCPY:     7263392 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        825689 us
MEMCPY:     6607777 us
STRNCPY:    8976487 us
STRLCPY:    5878088 us
LENCPY:     6180358 us

Alignment looks like it does impact the results for memcpy(). memcpy()
changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is
still the winner in all cases accept for strlcpy(), which beats it on
very short strings ("").

Here is the cache hit case including your strlen+memcpy as 'LENCPY':

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x 
NONE:        696157 us
MEMCPY:      825118 us
STRNCPY:    7983159 us
STRLCPY:   10787462 us
LENCPY:     6048339 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        700201 us
MEMCPY:      593701 us
STRNCPY:    7577380 us
STRLCPY:    3727801 us
LENCPY:     3169783 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        706283 us
MEMCPY:      792719 us
STRNCPY:    7870425 us
STRLCPY:     681334 us
LENCPY:     2062983 us

First call was every call being a cache hit. With this one, every one is
a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of
memory. I've attached the code for your consideration. x.c is the routines
I used to perform the tests. y.c is the main program. strlcpy.c is copied
from the online reference as is without change. The compilation steps
are described above. STRING is the string to try out. N is the number
of 64-byte blocks to allocate. ALIGN is the number of bytes to offset
the array by when storing / reading / writing. ALIGN should be >= 0.

At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of
RAM.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

Attachment

Re: Faster StrNCpy

From

"Strong, David"

Date:

02 October 2006, 13:06:54

Mark,
Thanks for attaching the C code for your test. I ran a few tests on a 3Ghz Intel Xeon Paxville (dual core) system. I
hopethe formatting of this table survives: 
Method  Size  N=1024*1024       N=1
MEMCPY  63     6964927 us     582494 us
MEMCPY  32     7102497 us     582467 us
MEMCPY  16     7116358 us     582538 us
MEMCPY  8      6965239 us     582796 us
MEMCPY  4      6964722 us     583183 us
STRNCPY 63    10131174 us    8843010 us
STRNCPY 32    10648202 us    9563868 us
STRNCPY 16     9187398 us    7969947 us
STRNCPY 8      9275353 us    8042777 us
STRNCPY 4      9067570 us    8058532 us
STRLCPY 63    15045507 us   14379702 us
STRLCPY 32     8960303 us    8120471 us
STRLCPY 16     7393607 us    4915457 us
STRLCPY 8      7222983 us    3211931 us
STRLCPY 4      7181267 us    1725546 us
LENCPY  63     7608932 us    4416602 us
LENCPY  32     7252849 us    3807535 us
LENCPY  16    11680927 us   10331487 us
LENCPY  8     10409685 us    9660616 us
LENCPY  4     10824632 us    9525082 us

The first column is the copy method, the second column is the source string size (size of -DSTRING), the 3rd and 4th
columnsare the different -DN settings. 
The memcpy () call is the clear winner, at all source string sizes. The strncpy () call is better than strlcpy (),
untilthe size of the string decreases. This is probably due to the zero padding effect of strncpy. The lencpy () call
startsout strong, but degrades as the size of the string decreases. This was a little surprising and I don't have an
explanationfor this behavior at this time. 
The AMD optimization manuals have some interesting examples for optimizations for memcpy, along the lines of cache line
copiesand prefetching: 

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF#search=%22amd%20optimization%20manual%22
h
<http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22>
ttp://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22
There also used to be an interesting article on the SGI web site called "Optimizing CPU to Memory Accesses on the SGI
VisualWorkstations 320 and 540", but this seems to have been pulled. I did find a copy of the article here: 
http://eunchul.com/database/board/cat.php?data=Win32_API&board_group=D42a8ff5c3a9b9
Obviously, different copy mechanisms suit different data sizes. So, I added a little debug to the strlcpy () function
thatwas added to Postgres the other day. I ran a test against Postgres for ~15 minutes that used 2 client backends and
theBG writer - 8330804 calls to strlcpy () were generated by the test. 
Out of the 8330804 calls, 6226616 calls used a maximum copy size of 2213 bytes e.g. strlcpy (dest, src, 2213) and
2104074calls used a maximum copy size of 64 bytes. 
I know the 2213 size calls come from the set_ps_display () function. I don't know where the 64 size calls come from,
yet.
In the 64 size case, with the exception of 35 calls, calls for size 64 are only copying 1 byte - I would assume this is
aNULL. 
In the 2213 size case, 1933027 calls copy 20 bytes; 2189415 calls copy 5 bytes; 85550 calls copy 6 bytes and 2018482
callscopy 7 bytes. 
Based on this data, it would seem that either memcpy () or strlcpy () calls would be better due to the source string
size. 
Call originating from the set_ps_display () function might be able to use the memcpy () call as  the size of the source
stringshould be known. The other calls probably need something like strlcpy () as the source string might not be known,
althoughusing memcpy () to copy in XX byte blocks might be interesting. 
David

________________________________

From: pgsql-hackers-owner@postgresql.org on behalf of mark@mark.mielke.cc
Sent: Fri 9/29/2006 2:59 PM
To: Tom Lane
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Faster StrNCpy

On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote:
> mark@mark.mielke.cc writes:
> > If anybody is curious, here are my numbers for an AMD X2 3800+:
> You did not show your C code, so no one else can reproduce the test on
> other hardware.  However, it looks like your compiler has unrolled the
> memcpy into straight-line 8-byte moves, which makes it pretty hard for
> anything operating byte-wise to compete, and is a bit dubious for the
> general case anyway (since it requires assuming that the size and
> alignment are known at compile time).

I did show the .s code. I call into x_memcpy(a, b), meaning that the
compiler can't assume anything. It may happen to be aligned.

Here are results over 64 Mbytes of memory, to ensure that every call is
a cache miss:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c ; ./x 
NONE:        767243 us
MEMCPY:     6044137 us
STRNCPY:   10741759 us
STRLCPY:   12061630 us
LENCPY:     9459099 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE:        712193 us
MEMCPY:     6072312 us
STRNCPY:    9982983 us
STRLCPY:    6605052 us
LENCPY:     7128258 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE:        708164 us
MEMCPY:     6042817 us
STRNCPY:    8885791 us
STRLCPY:    5592477 us
LENCPY:     6135550 us

At least on my machine, memcpy() still comes out on top. Yes, assuming that
it is aligned correctly for the machine. Here is unaliagned (all arrays are
stored +1 offset in memory):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)"
-DALIGN=1-o x x.c y.c strlcpy.c ; ./x 
NONE:        790932 us
MEMCPY:     6591559 us
STRNCPY:   10622291 us
STRLCPY:   12070007 us
LENCPY:    10322541 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        764577 us
MEMCPY:     6631731 us
STRNCPY:    9513540 us
STRLCPY:    6615345 us
LENCPY:     7263392 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        825689 us
MEMCPY:     6607777 us
STRNCPY:    8976487 us
STRLCPY:    5878088 us
LENCPY:     6180358 us

Alignment looks like it does impact the results for memcpy(). memcpy()
changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is
still the winner in all cases accept for strlcpy(), which beats it on
very short strings ("").

Here is the cache hit case including your strlen+memcpy as 'LENCPY':

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x 
NONE:        696157 us
MEMCPY:      825118 us
STRNCPY:    7983159 us
STRLCPY:   10787462 us
LENCPY:     6048339 us

$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        700201 us
MEMCPY:      593701 us
STRNCPY:    7577380 us
STRLCPY:    3727801 us
LENCPY:     3169783 us

$ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x
NONE:        706283 us
MEMCPY:      792719 us
STRNCPY:    7870425 us
STRLCPY:     681334 us
LENCPY:     2062983 us

First call was every call being a cache hit. With this one, every one is
a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of
memory. I've attached the code for your consideration. x.c is the routines
I used to perform the tests. y.c is the main program. strlcpy.c is copied
from the online reference as is without change. The compilation steps
are described above. STRING is the string to try out. N is the number
of 64-byte blocks to allocate. ALIGN is the number of bytes to offset
the array by when storing / reading / writing. ALIGN should be >= 0.

At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of
RAM.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem... 
                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

Tom Lane

Date:

02 October 2006, 15:30:20

mark@mark.mielke.cc writes:
> Here is the cache hit case including your strlen+memcpy as 'LENCPY':

> $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x
 
> NONE:        696157 us
> MEMCPY:      825118 us
> STRNCPY:    7983159 us
> STRLCPY:   10787462 us
> LENCPY:     6048339 us

It appears that these results are a bit platform-dependent; on my x86_64
(Xeon) Fedora 5 box, I get

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 x.c y.c strlcpy.c
$ ./a.out
NONE:        358679 us
MEMCPY:      619255 us
STRNCPY:    8932551 us
STRLCPY:    9212371 us
LENCPY:    13910413 us

I'm not sure why the lencpy method sucks so badly on this machine :-(.

Anyway, I looked at glibc's strncpy and determined that on this machine
the only real optimization that's been done to it is to unroll the data
copying loop four times.  I did the same to strlcpy (attached) and got
numbers like these:

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 x.c y.c strlcpy.c
$ ./a.out
NONE:        359317 us
MEMCPY:      619636 us
STRNCPY:    8933507 us
STRLCPY:    7644576 us
LENCPY:    13917927 us
$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" x.c
y.cstrlcpy.c
 
$ ./a.out
NONE:        502960 us
MEMCPY:     5382528 us
STRNCPY:    9733890 us
STRLCPY:    8740892 us
LENCPY:    15358616 us
$ gcc -O3 -std=c99 -DSTRING='"short"' -DN=1 x.c y.c strlcpy.c
$ ./a.out
NONE:        358426 us
MEMCPY:      618533 us
STRNCPY:    6704926 us
STRLCPY:     867336 us
LENCPY:    10115883 us
$ gcc -O3 -std=c99 -DSTRING='"short"' -DN="(1024*1024)" x.c y.c strlcpy.c
$ ./a.out
NONE:        502746 us
MEMCPY:     5365171 us
STRNCPY:    7983610 us
STRLCPY:    5557277 us
LENCPY:    11533066 us

So the unroll seems to get us to the point of not losing compared to the
original strncpy code for any string length, and so I propose doing
that, if it holds up on other architectures.
        regards, tom lane


size_t
strlcpy(char *dst, const char *src, size_t siz)
{char *d = dst;const char *s = src;size_t n = siz;
/* Copy as many bytes as will fit */if (n != 0) {    while (n > 4) {        if ((*d++ = *s++) == '\0')            goto
done;       if ((*d++ = *s++) == '\0')            goto done;        if ((*d++ = *s++) == '\0')            goto done;
   if ((*d++ = *s++) == '\0')            goto done;        n -= 4;    }    while (--n != 0) {        if ((*d++ = *s++)
=='\0')            goto done;    }}
 
/* Not enough room in dst, add NUL and traverse rest of src */if (siz != 0)    *d = '\0';                /*
NUL-terminatedst */while (*s++)    ;
 

done:return(s - src - 1);        /* count does not include NUL */
}

Re: Faster StrNCpy

From

"Luke Lonergan"

Date:

02 October 2006, 15:39:59

Mark,

On 9/29/06 2:59 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote:

> Here are results over 64 Mbytes of memory, to ensure that every call is
> a cache miss:

On my Mac OSX intel laptop (Core Duo, 2.16 GHz, 2GB RAM, gcc 4.01):

Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='"This is a very long sentence that is expected to be very slow."'
-DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//cc1eMVq7.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:        416677 us
MEMCPY:     5649587 us
STRNCPY:    5806591 us
STRLCPY:   12865010 us
LENCPY:    17801485 us
Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//ccOZl9R6.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:        416652 us
MEMCPY:     5830540 us
STRNCPY:    6207594 us
STRLCPY:    5582607 us
LENCPY:     7887703 us
Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99
-DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
/usr/bin/ld: warning multiple definitions of symbol _strlcpy
/var/tmp//ccBUsIdR.o definition of _strlcpy in section (__TEXT,__text)
/usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So)
definition of _strlcpy
NONE:        417210 us
MEMCPY:     5506346 us
STRNCPY:    5769302 us
STRLCPY:    5424234 us
LENCPY:     5609338 us


- Luke

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

02 October 2006, 16:43:54

Thanks for the results on your machines Dave, Luke, and Tom. Interesting.

On Mon, Oct 02, 2006 at 02:30:11PM -0400, Tom Lane wrote:
> It appears that these results are a bit platform-dependent; on my x86_64
> (Xeon) Fedora 5 box, I get

*nod*

> Anyway, I looked at glibc's strncpy and determined that on this machine
> the only real optimization that's been done to it is to unroll the data
> copying loop four times.  I did the same to strlcpy (attached) and got
> numbers like these:

> So the unroll seems to get us to the point of not losing compared to the
> original strncpy code for any string length, and so I propose doing
> that, if it holds up on other architectures.

Sounds reasonable. strlcpy() is the best match. I'm not sure why GLIBC
doesn't come with strlcpy(). Apparently some philosophical differences
against it by the GLIBC maintainers... :-)

memcpy() makes too many assumptions, that would be difficult to prove
in each case, or may be difficult to ensure that they remain true. I
only mentioned it as a possibility, to keep the options open, and to
show the sort of effects we are dealing with.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

"Sergey E. Koposov"

Date:

02 October 2006, 16:56:24

Mark, Tom,

Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to
be quite different:

math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."'
-DN="(1024*1024)"-o x x.c y.c strlcpy.c ; ./x
 
NONE:        825671 us
MEMCPY:     4533637 us
STRNCPY:   10959380 us
STRLCPY:   34258306 us
LENCPY:    13939373 us
math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="1"
-ox x.c y.c strlcpy.c ; ./x
 
NONE:        938815 us
MEMCPY:     3441330 us
STRNCPY:    5881628 us
STRLCPY:   20167825 us
LENCPY:    11326259 us

math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x
NONE:        831920 us
MEMCPY:     4550683 us
STRNCPY:    9339254 us
STRLCPY:   16871433 us
LENCPY:    21332076 us
math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="1" -o x x.c y.c strlcpy.c ; ./x
NONE:        938104 us
MEMCPY:     3438871 us
STRNCPY:    3438969 us
STRLCPY:    6042829 us
LENCPY:     6022909 us

math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x 
NONE:        825472 us
MEMCPY:     4547456 us
STRNCPY:    8591618 us 
STRLCPY:    6818957 us
LENCPY:     8264340 us
math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='""' -DN="1" -o x x.c y.c strlcpy.c ; ./x 
NONE:        937878 us
MEMCPY:     3439101 us
STRNCPY:    3188791 us 
STRLCPY:     750437 us
LENCPY:     2751238 us


Regards,    Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: Faster StrNCpy

From

Tom Lane

Date:

02 October 2006, 17:06:17

"Sergey E. Koposov" <math@sai.msu.ru> writes:
> Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to
> be quite different:

What libc are you using exactly?  Can you try it with the unrolled
strlcpy I posted?

In glibc-2.4.90, there seem to be out-of-line assembly code
implementations of strncpy for: sparc64 sparc32 s390x s390 alpha ia64
and an inlined assembler version for i386.  So the x86_64 case is
nearly the only popular architecture that doesn't seem to have
a hand-hacked implementation ... which throws some doubt on Mark's and
my results as possibly not being very representative.
        regards, tom lane

Re: Faster StrNCpy

From

"Sergey E. Koposov"

Date:

02 October 2006, 17:18:49

On Mon, 2 Oct 2006, Tom Lane wrote:

> "Sergey E. Koposov" <math@sai.msu.ru> writes:
>> Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to
>> be quite different:
>
> What libc are you using exactly?  Can you try it with the unrolled
> strlcpy I posted?

glibc 2.3.5 , gcc 3.4.4

my results were obtained already with your unrolled version (but when I 
first ran it with the 'default' strlcpy the results were the same).

>
> In glibc-2.4.90, there seem to be out-of-line assembly code
> implementations of strncpy for: sparc64 sparc32 s390x s390 alpha ia64
> and an inlined assembler version for i386.  So the x86_64 case is
> nearly the only popular architecture that doesn't seem to have
> a hand-hacked implementation ... which throws some doubt on Mark's and
> my results as possibly not being very representative.

Regards,    Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: Faster StrNCpy

From

Tom Lane

Date:

02 October 2006, 18:06:59

I did a couple more tests using x86 architectures.  On a rather old
Pentium-4 machine running Fedora 5 (gcc 4.1.1, glibc-2.4-11):

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c  
 
NONE:        786305 us
MEMCPY:     9887843 us
STRNCPY:   15045780 us
STRLCPY:   17555563 us
U_STRLCPY: 14994639 us
LENCPY:    19700346 us

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c
 
NONE:        562001 us
MEMCPY:     2026546 us
STRNCPY:   11149134 us
STRLCPY:   13747827 us
U_STRLCPY: 12467527 us
LENCPY:    17672899 us

(STRLCPY is our CVS HEAD code, U_STRLCPY is the unrolled version)

On a Mac Mini (Intel Core Duo, OS X 10.4.8, gcc 4.0.1), the system has a
version of strlcpy, but it seems to suck :-(

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c ; ./x
 
NONE:        480298 us
MEMCPY:     7857291 us
STRNCPY:   10485948 us
STRLCPY:   16745154 us
U_STRLCPY: 18337286 us
S_STRLCPY: 20920213 us
LENCPY:    22878114 us

$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x
 
NONE:        480269 us
MEMCPY:     1858974 us
STRNCPY:    5405618 us
STRLCPY:   16364452 us
U_STRLCPY: 16439753 us
S_STRLCPY: 19134538 us
LENCPY:    22873141 us

It's interesting that the unrolled version is actually slower here.
I didn't dig into the assembly code, but maybe Apple's compiler isn't
doing a very good job with it?

Anyway, these results make me less excited about the unrolled version.

In any case, I don't think we should put too much emphasis on the
long-source-string case.  In essentially all cases, the true source
string length will be much shorter than the target buffer (were this
not so, we'd probably be needing to make the buffer bigger), and strlcpy
will certainly beat out strncpy in those situations.  The memcpy numbers
look attractive, but they ignore the problem that in practice we usually
don't know the source string length in advance --- so I don't think
those represent something achievable.

One thing that seems real clear is that the LENCPY method loses across
the board, which surprised me, but it's hard to argue with numbers.

I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy,
but given the LENCPY results this may be a loser too.
        regards, tom lane

Re: Faster StrNCpy

From

Tom Lane

Date:

02 October 2006, 18:09:21

"Strong, David" <david.strong@unisys.com> writes:
> Obviously, different copy mechanisms suit different data sizes. So, I
> added a little debug to the strlcpy () function that was added to
> Postgres the other day. I ran a test against Postgres for ~15 minutes
> that used 2 client backends and the BG writer - 8330804 calls to
> strlcpy () were generated by the test.
> Out of the 8330804 calls, 6226616 calls used a maximum copy size of
> 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074 calls used a
> maximum copy size of 64 bytes.
> I know the 2213 size calls come from the set_ps_display () function. I
> don't know where the 64 size calls come from, yet.

Prepared-statement and portal hashtable lookups, likely.  Were your
clients using V3 extended query protocol?
        regards, tom lane

Re: Faster StrNCpy

From

"Strong, David"

Date:

02 October 2006, 18:36:13

Tom,
Yes, the clients are using the V3 protocol and prepared statements.
David

________________________________

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Mon 10/2/2006 2:09 PM
To: Strong, David
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Faster StrNCpy



"Strong, David" <david.strong@unisys.com> writes:
> Obviously, different copy mechanisms suit different data sizes. So, I
> added a little debug to the strlcpy () function that was added to
> Postgres the other day. I ran a test against Postgres for ~15 minutes
> that used 2 client backends and the BG writer - 8330804 calls to
> strlcpy () were generated by the test.

> Out of the 8330804 calls, 6226616 calls used a maximum copy size of
> 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074 calls used a
> maximum copy size of 64 bytes.

> I know the 2213 size calls come from the set_ps_display () function. I
> don't know where the 64 size calls come from, yet.

Prepared-statement and portal hashtable lookups, likely.  Were your
clients using V3 extended query protocol?
                       regards, tom lane

Re: Faster StrNCpy

From

Bruce Momjian

Date:

02 October 2006, 20:26:10

Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > A wholesale replacement of strncpy() calls is probably worth doing --
> > replacing them with strlcpy() if the source string is NUL-terminated,
> > and I suppose memcpy() otherwise.
> 
> What I'd like to do immediately is put in strlcpy() and hit the two or
> three places I think are performance-relevant.  I agree with trying to
> get rid of StrNCpy/strncpy calls over the long run, but it's just code
> beautification and probably not appropriate for beta.

Added to TODO:
* Use strlcpy() rather than our StrNCpy() macro

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Faster StrNCpy

From

Bruce Momjian

Date:

02 October 2006, 21:01:39

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > You'll notice that it iterates once per char.  Between that and the
> > strlen() call in Tom's version, not sure which is the lesser evil.
> 
> Yeah, I was wondering that too.  My code would require two scans of the
> source string (one inside strlen and one in memcpy), but in much of our
> usage the source and dest should be reasonably well aligned and one
> could expect memcpy to be using word rather than byte operations, so you
> might possibly make it back on the strength of fewer write cycles.  And
> on the third hand, for short source strings none of this matters and
> the extra function call involved for strlen/memcpy probably dominates.
> 
> I'm happy to just use the OpenBSD version as a src/port module.
> Any objections?

I found this URL about the function history of strlcpy():
http://www.gratisoft.us/todd/papers/strlcpy.html

I added the URL to port/strlcpy.c.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Faster StrNCpy

From

Josh Berkus

Date:

02 October 2006, 21:03:16

Tom,

FWIW, Tom Daly did some SpecJAppserver runs on the latest snapshot and 
didn't show any reduction in text parsing overhead.  Unfortunately, he's 
gone on vacation now so I can't get details.

I'm going to try to set up some tests using TPCE to see if it's affected.

--Josh

Re: Faster StrNCpy

From

"Zeugswetter Andreas DCP SD"

Date:

03 October 2006, 05:24:20

> I'm still interested to experiment with MemSet-then-strlcpy
> for namestrcpy, but given the LENCPY results this may be a loser too.

Um, why not strlcpy then MemSet the rest ?

Andreas

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

03 October 2006, 06:56:19

On Tue, Oct 03, 2006 at 10:24:10AM +0200, Zeugswetter Andreas DCP SD wrote:
> > I'm still interested to experiment with MemSet-then-strlcpy 
> > for namestrcpy, but given the LENCPY results this may be a loser too.
> Um, why not strlcpy then MemSet the rest ?

That's what strncpy() is supposed to be doing.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

"Zeugswetter Andreas DCP SD"

Date:

03 October 2006, 08:57:10

> > > I'm still interested to experiment with MemSet-then-strlcpy for
> > > namestrcpy, but given the LENCPY results this may be a loser too.
> > Um, why not strlcpy then MemSet the rest ?
>
> That's what strncpy() is supposed to be doing.

Yes, but it obviously does not in some ports, and that was the main
problem
as I interpreted it.

Andreas

Re: Faster StrNCpy

From

mark@mark.mielke.cc

Date:

03 October 2006, 11:13:05

On Tue, Oct 03, 2006 at 01:56:57PM +0200, Zeugswetter Andreas DCP SD wrote:
> 
> > > > I'm still interested to experiment with MemSet-then-strlcpy for 
> > > > namestrcpy, but given the LENCPY results this may be a loser too.
> > > Um, why not strlcpy then MemSet the rest ?
> > That's what strncpy() is supposed to be doing.

> Yes, but it obviously does not in some ports, and that was the main
> problem as I interpreted it.

I think re-writing libc isn't really a PostgreSQL goal.

strlcpy() is theoretically cheaper than strncpy() as it is not
required by any standard to do as much work. memcpy() is cheaper
because it can be more easily parallelized, and can increment
multiple bytes with each loop interval, as it does not need to
slow down to look for a '\0' in each byte.

strlcpy() than memset() the remaining is strncpy(). If you are
suggesting that a moderately tuned version of strncpy() be used
in the ports directory, this is an option, but it can't be
faster than strlcpy(). It's not possible. It would come down to
whether there was a security requirements that the last bytes
were '\0' or not. I haven't seen anybody mention this as a
requirement.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...

                          http://mark.mielke.cc/

Re: Faster StrNCpy

From

Tom Lane

Date:

03 October 2006, 11:20:35

"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes:
>> I'm still interested to experiment with MemSet-then-strlcpy 
>> for namestrcpy, but given the LENCPY results this may be a loser too.

> Um, why not strlcpy then MemSet the rest ?

Two reasons:

* The main point is to do the zeroing using word-wide operations, and if
you do the above then memset will probably be facing a zeroing request
that is neither word-aligned nor word-length.  It may be able to recover
(doing it partly byte-wide and partly word-wide), but this will easily
eat up the time savings of skipping the first couple words.

* On compilers that treat memset as a builtin, there are significant
advantages to doing memset with a constant length: the compiler might
be able to unroll the loop entirely.  (I was annoyed to find that FC5's
gcc on x86_64 seems to understand very well how to inline a constant
length memcpy, but not memset :-(.)

I did actually do some experiments with the above yesterday, and found
that it was a significant win on an old x86 (with about a 10-byte source
string) but about a wash on newer architectures.
        regards, tom lane

Re: Faster StrNCpy

From

Tom Lane

Date:

03 October 2006, 11:24:28

mark@mark.mielke.cc writes:
> ... It would come down to
> whether there was a security requirements that the last bytes
> were '\0' or not. I haven't seen anybody mention this as a
> requirement.

I think it is a requirement for namestrcpy (because the result might end
up on disk), but not elsewhere.
        regards, tom lane

Another aspect of set_ps_display ()

From

"Strong, David"

Date:

04 October 2006, 13:36:23

We were just analyzing some more OProfile and ltrace data against
Postgres 8.2Beta1 and we noticed a number of calls as follows:


strlen("postgres: tpc tpc 192.168.1.200("...)    = 58
memset(0xbffff6b2, '\000', 2344)                 = 0xbffff6b2


We have tracked this down to the following code in the set_ps_display ()
function:


#ifdef PS_USE_CLOBBER_ARGV   {       int         buflen;
       /* pad unused memory */       buflen = strlen(ps_buffer);       MemSet(ps_buffer + buflen, PS_PADDING,
ps_buffer_size- buflen);   } 
#endif   /* PS_USE_CLOBBER_ARGV */


If set_ps_display () moves to use the strlcpy () function call, this
code might be redundant. Even if the StrNCpy () call is kept, this code
may still be redundant as StrNCpy () will zero fill the ps_buffer.

A MemSet () call on the ps_buffer has to be added to the init_ps_display
() function, if this code is removed to clear the buffer before use.

David

Re: Faster StrNCpy

From

Benny Amorsen

Date:

04 October 2006, 16:27:24

>>>>> "ZA" == Zeugswetter Andreas DCP SD <ZeugswetterA@spardat.at> writes:

ZA> Yes, but it obviously does not in some ports, and that was the
ZA> main problem as I interpreted it.

strncpy is part of POSIX; I highly doubt anyone gets it wrong. Getting
sane semantics from it does require manually writing null to the
last location in the buffer.

If you need the null byte padding feature of strncpy, it seems a bit
silly to replace it with strlcpy. Particularly if the only reason is
performance. Down that road lies pglibc.


/Benny

Re: Faster StrNCpy

From

Bruce Momjian

Date:

13 February 2007, 14:10:52

This is from October 2006.  Is there a TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:
> I did a couple more tests using x86 architectures.  On a rather old
> Pentium-4 machine running Fedora 5 (gcc 4.1.1, glibc-2.4-11):
> 
> $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c  
 
> NONE:        786305 us
> MEMCPY:     9887843 us
> STRNCPY:   15045780 us
> STRLCPY:   17555563 us
> U_STRLCPY: 14994639 us
> LENCPY:    19700346 us
> 
> $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c
 
> NONE:        562001 us
> MEMCPY:     2026546 us
> STRNCPY:   11149134 us
> STRLCPY:   13747827 us
> U_STRLCPY: 12467527 us
> LENCPY:    17672899 us
> 
> (STRLCPY is our CVS HEAD code, U_STRLCPY is the unrolled version)
> 
> On a Mac Mini (Intel Core Duo, OS X 10.4.8, gcc 4.0.1), the system has a
> version of strlcpy, but it seems to suck :-(
> 
> $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x
x.cy.c strlcpy.c ; ./x
 
> NONE:        480298 us
> MEMCPY:     7857291 us
> STRNCPY:   10485948 us
> STRLCPY:   16745154 us
> U_STRLCPY: 18337286 us
> S_STRLCPY: 20920213 us
> LENCPY:    22878114 us
> 
> $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c
strlcpy.c; ./x
 
> NONE:        480269 us
> MEMCPY:     1858974 us
> STRNCPY:    5405618 us
> STRLCPY:   16364452 us
> U_STRLCPY: 16439753 us
> S_STRLCPY: 19134538 us
> LENCPY:    22873141 us
> 
> It's interesting that the unrolled version is actually slower here.
> I didn't dig into the assembly code, but maybe Apple's compiler isn't
> doing a very good job with it?
> 
> Anyway, these results make me less excited about the unrolled version.
> 
> In any case, I don't think we should put too much emphasis on the
> long-source-string case.  In essentially all cases, the true source
> string length will be much shorter than the target buffer (were this
> not so, we'd probably be needing to make the buffer bigger), and strlcpy
> will certainly beat out strncpy in those situations.  The memcpy numbers
> look attractive, but they ignore the problem that in practice we usually
> don't know the source string length in advance --- so I don't think
> those represent something achievable.
> 
> One thing that seems real clear is that the LENCPY method loses across
> the board, which surprised me, but it's hard to argue with numbers.
> 
> I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy,
> but given the LENCPY results this may be a loser too.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Faster StrNCpy

From

Tom Lane

Date:

13 February 2007, 14:20:09

Bruce Momjian <bruce@momjian.us> writes:
> This is from October 2006.  Is there a TODO here?

I think we had decided that the code that's in there is fine.
        regards, tom lane