Thread: Faster StrNCpy
David Strong points out here http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php that some popular implementations of strncpy(dst,src,n) are quite inefficient when strlen(src) is much less than n, because they don't optimize the zero-pad step that is required by the standard. It looks to me like we have a good number of places that are using either StrNCpy or strncpy directly to copy into large buffers that we do not need full zero-padding in, only a single guaranteed null byte. While not all of these places are in performance-critical paths, some are. David identified set_ps_display, and the other thing that's probably significant is unnecessary use of strncpy for keys of string-keyed hash tables. (We used to actually need zero padding for string-keyed hash keys, but that was a long time ago.) I propose adding an additional macro in c.h, along the lines of #define StrNCopy(dst,src,len) \ do \ { \ char * _dst = (dst); \ Size _len = (len); \ \ if (_len > 0) \ { \ const char * _src = (src); \ Size _src_len = strlen(_src); \ \ if (_src_len < _len) \ memcpy(_dst, _src, _src_len + 1); \ else \ { \ memcpy(_dst, _src, _len - 1); \ _dst[_len-1] = '\0'; \ } \ } \ } while (0) Unlike StrNCpy, this requires that the source string be null-terminated, so it would not be a drop-in replacement everywhere. Also, it could be a performance loss if strlen(src) is much larger than len ... but that is not usually the case for the places we'd want to apply it. Thoughts, objections? In particular, is the name OK, or do we need something a bit further away from StrNCpy? regards, tom lane
On Tue, Sep 26, 2006 at 04:24:51PM -0400, Tom Lane wrote: > David Strong points out here > http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php > that some popular implementations of strncpy(dst,src,n) are quite > inefficient when strlen(src) is much less than n, because they don't > optimize the zero-pad step that is required by the standard. I think that's why strlcpy was invented, to deal with the issues with strncpy. http://www.gratisoft.us/todd/papers/strlcpy.html There's an implementation here (used in glib), though you could probably find more. http://mail.gnome.org/archives/gtk-devel-list/2000-May/msg00029.html Do you really think it's worth making a macro rather than just a normal function? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Tue, Sep 26, 2006 at 04:24:51PM -0400, Tom Lane wrote: > > David Strong points out here > > http://archives.postgresql.org/pgsql-hackers/2006-09/msg02071.php > > that some popular implementations of strncpy(dst,src,n) are quite > > inefficient when strlen(src) is much less than n, because they don't > > optimize the zero-pad step that is required by the standard. > > I think that's why strlcpy was invented, to deal with the issues with > strncpy. > > http://www.gratisoft.us/todd/papers/strlcpy.html > > There's an implementation here (used in glib), though you could > probably find more. > > http://mail.gnome.org/archives/gtk-devel-list/2000-May/msg00029.html That one would be LGPL (glib's license). Here is OpenBSD's version, linked from that one: ftp://ftp.openbsd.org/pub/OpenBSD/src/lib/libc/string/strlcpy.c You'll notice that it iterates once per char. Between that and the strlen() call in Tom's version, not sure which is the lesser evil. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Martijn van Oosterhout <kleptog@svana.org> writes: > I think that's why strlcpy was invented, to deal with the issues with > strncpy. > http://www.gratisoft.us/todd/papers/strlcpy.html strlcpy does more than we need (note that none of the existing uses care about counting the overflowed bytes). Not sure if it's worth adopting those semantics when they're not really standard, but if you think a lot of people would be familiar with strlcpy, maybe we should. > Do you really think it's worth making a macro rather than just a normal > function? Only in that a macro in c.h is less work than a configure test plus a replacement file in src/port. But if we want to consider this a standard function that just doesn't happen to exist everywhere, I suppose we should use configure. regards, tom lane
On Tue, 2006-09-26 at 16:53 -0400, Tom Lane wrote: > strlcpy does more than we need (note that none of the existing uses care > about counting the overflowed bytes). Not sure if it's worth adopting > those semantics when they're not really standard, but if you think a lot > of people would be familiar with strlcpy, maybe we should. I think we should -- while strlcpy() is not standardized, it is widely used (in libc on all the BSDs, Solaris and OS X, as well as private copies in Linux, glib, etc.). A wholesale replacement of strncpy() calls is probably worth doing -- replacing them with strlcpy() if the source string is NUL-terminated, and I suppose memcpy() otherwise. -Neil
Alvaro Herrera <alvherre@commandprompt.com> writes: > You'll notice that it iterates once per char. Between that and the > strlen() call in Tom's version, not sure which is the lesser evil. Yeah, I was wondering that too. My code would require two scans of the source string (one inside strlen and one in memcpy), but in much of our usage the source and dest should be reasonably well aligned and one could expect memcpy to be using word rather than byte operations, so you might possibly make it back on the strength of fewer write cycles. And on the third hand, for short source strings none of this matters and the extra function call involved for strlen/memcpy probably dominates. I'm happy to just use the OpenBSD version as a src/port module. Any objections? regards, tom lane
Neil Conway <neilc@samurai.com> writes: > A wholesale replacement of strncpy() calls is probably worth doing -- > replacing them with strlcpy() if the source string is NUL-terminated, > and I suppose memcpy() otherwise. What I'd like to do immediately is put in strlcpy() and hit the two or three places I think are performance-relevant. I agree with trying to get rid of StrNCpy/strncpy calls over the long run, but it's just code beautification and probably not appropriate for beta. regards, tom lane
Tom, > What I'd like to do immediately is put in strlcpy() and hit the two or > three places I think are performance-relevant. I agree with trying to > get rid of StrNCpy/strncpy calls over the long run, but it's just code > beautification and probably not appropriate for beta. Immediately? Presumably you mean for 8.3? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> What I'd like to do immediately is put in strlcpy() and hit the two or >> three places I think are performance-relevant. > Immediately? Presumably you mean for 8.3? No, I mean now. This is a performance bug and it's still open season on bugs. If we were close to having a release-candidate version, I'd hold off, but the above proposal seems sufficiently low-risk for the current stage of the cycle. regards, tom lane
Tom, Let us know when you've added strlcpy () and we'll be happy to run some tests on the new code. David ________________________________ From: pgsql-hackers-owner@postgresql.org on behalf of Tom Lane Sent: Tue 9/26/2006 7:25 PM To: josh@agliodbs.com Cc: pgsql-hackers@postgresql.org; Neil Conway; Martijn van Oosterhout Subject: Re: [HACKERS] Faster StrNCpy Josh Berkus <josh@agliodbs.com> writes: >> What I'd like to do immediately is put in strlcpy() and hit the two or >> three places I think are performance-relevant. > Immediately? Presumably you mean for 8.3? No, I mean now. This is a performance bug and it's still open season on bugs. If we were close to having a release-candidate version, I'd hold off, but the above proposal seems sufficiently low-risk for the current stage of the cycle. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >>> What I'd like to do immediately is put in strlcpy() and hit the two or >>> three places I think are performance-relevant. >>> > > >> Immediately? Presumably you mean for 8.3? >> > > No, I mean now. This is a performance bug and it's still open season on > bugs. If we were close to having a release-candidate version, I'd hold > off, but the above proposal seems sufficiently low-risk for the current > stage of the cycle. > > What are the other hotspots? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> What I'd like to do immediately is put in strlcpy() and hit the two or >> three places I think are performance-relevant. > What are the other hotspots? The ones I can think of offhand are set_ps_display and use of strncpy as a HashCopyFunc. regards, tom lane
We sometimes see TupleDescInitEntry () taking high CPU times via OProfile. This does include, amongst a lot of other code,a call to namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a good candidate right now as a name stringis only 64 bytes. David ________________________________ From: pgsql-hackers-owner@postgresql.org on behalf of Tom Lane Sent: Wed 9/27/2006 6:49 AM To: Andrew Dunstan Cc: josh@agliodbs.com; pgsql-hackers@postgresql.org; Neil Conway; Martijn van Oosterhout Subject: Re: [HACKERS] Faster StrNCpy Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> What I'd like to do immediately is put in strlcpy() and hit the two or >> three places I think are performance-relevant. > What are the other hotspots? The ones I can think of offhand are set_ps_display and use of strncpy as a HashCopyFunc. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
On Wed, Sep 27, 2006 at 07:08:05AM -0700, Strong, David wrote: > We sometimes see TupleDescInitEntry () taking high CPU times via > OProfile. This does include, amongst a lot of other code, a call to > namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a > good candidate right now as a name string is only 64 bytes. Just wondering - are any of these cases where a memcpy() would work just as well? Or are you not sure that the source string is at least 64 bytes in length? memcpy(&target, &source, sizeof(target)); target[sizeof(target)-1] = '\0'; I imagine any extra checking causes processor stalls, or at least for the branch prediction to fill up? Straight copies might allow for maximum parallelism? If it's only 64 bytes, on processors such as Pentium or Athlon, that's 2 or 4 cache lines, and writes are always performed as cache lines. I haven't seen the code that you and Tom are looking at to tell whether it is safe to do this or not. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Hi all I wanna know what is going on while a DML command works. For example ; Which commands are executed by the core when we send an "UPDATE tab SET col = val1..." in case there is a foreing key or an unique constraint on table "tab". How can i see that ? Best regards Adnan DURSUN ASRIN Bilişim Ltd.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, Sep 28, 2006 at 04:27:36AM +0300, Adnan DURSUN wrote: > > Hi all > > I wanna know what is going on while a DML command works. For example > ; > Which commands are executed by the core when we send an "UPDATE tab > SET col = val1..." Adnan, this mailing list is not the right one for such questions. More appropriate would be <pgsql-novice@postgresql.org> or maybe <pgsql-general@postgresql.org>. Having said that, you may set the log level of the server in the configuration file (whose location depends on your OS and PostgreSQL version. Look there for a line log_statements = XXX and set XXX to 'all'. Don't forget to restart your server afterwards. HTH - -- tomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFFG3QyBcgs9XrR2kYRAgdqAJ0VnUw5+Q79HiIwHocHIw4TWHePaQCffBBK ASn3Z6XpKG91NTrmEaBtz08= =Ibh3 -----END PGP SIGNATURE-----
Mark, In the specific case of the namestrcpy () function, it will copy a maximum of 64 bytes, but the length of the source string is unknown. I would have to think that memcpy () would certainly win if you knew the source and destination sizes etc. Perhaps there are some places like that in the code that don't use memcpy () currently? David -----Original Message----- From: mark@mark.mielke.cc [mailto:mark@mark.mielke.cc] Sent: Wednesday, September 27, 2006 4:27 PM To: Strong, David Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Faster StrNCpy On Wed, Sep 27, 2006 at 07:08:05AM -0700, Strong, David wrote: > We sometimes see TupleDescInitEntry () taking high CPU times via > OProfile. This does include, amongst a lot of other code, a call to > namestrcpy () which in turn calls StrNCpy (). Perhaps this is not a > good candidate right now as a name string is only 64 bytes. Just wondering - are any of these cases where a memcpy() would work just as well? Or are you not sure that the source string is at least 64 bytes in length? memcpy(&target, &source, sizeof(target)); target[sizeof(target)-1] = '\0'; I imagine any extra checking causes processor stalls, or at least for the branch prediction to fill up? Straight copies might allow for maximum parallelism? If it's only 64 bytes, on processors such as Pentium or Athlon, that's 2 or 4 cache lines, and writes are always performed as cache lines. I haven't seen the code that you and Tom are looking at to tell whether it is safe to do this or not. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
"Strong, David" <david.strong@unisys.com> writes: > Just wondering - are any of these cases where a memcpy() would work > just as well? Or are you not sure that the source string is at least > 64 bytes in length? In most cases, we're pretty sure that it's *not* --- it'll just be a palloc'd C string. I'm disinclined to fool with the restriction that namestrcpy zero-pad Name values, because they might end up on disk, and allowing random memory contents to get written out is ungood from a security point of view. However, it's entirely possible that it'd be a bit faster to do a MemSet followed by strlcpy than to use strncpy for zero-padding. regards, tom lane
Hi, Tom, Tom Lane wrote: > "Strong, David" <david.strong@unisys.com> writes: >> Just wondering - are any of these cases where a memcpy() would work >> just as well? Or are you not sure that the source string is at least >> 64 bytes in length? > > In most cases, we're pretty sure that it's *not* --- it'll just be a > palloc'd C string. > > I'm disinclined to fool with the restriction that namestrcpy zero-pad > Name values, because they might end up on disk, and allowing random > memory contents to get written out is ungood from a security point of > view. There's another disadvantage of always copying 64 bytes: It may be that the 64-byte range crosses a page boundary. Now guess what happens when this next page is not mapped -> segfault. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org
Markus Schaber <schabi@logix-tt.com> writes: > There's another disadvantage of always copying 64 bytes: > It may be that the 64-byte range crosses a page boundary. Now guess what > happens when this next page is not mapped -> segfault. Irrelevant, because in all interesting cases the Name field is part of a larger record that would stretch into that other page anyway. regards, tom lane
On Fri, Sep 29, 2006 at 11:21:21AM +0200, Markus Schaber wrote: > Tom Lane wrote: > >> Just wondering - are any of these cases where a memcpy() would work > >> just as well? Or are you not sure that the source string is at least > >> 64 bytes in length? > > > > In most cases, we're pretty sure that it's *not* --- it'll just be a > > palloc'd C string. > > > > I'm disinclined to fool with the restriction that namestrcpy zero-pad > > Name values, because they might end up on disk, and allowing random > > memory contents to get written out is ungood from a security point of > > view. > > There's another disadvantage of always copying 64 bytes: > > It may be that the 64-byte range crosses a page boundary. Now guess what > happens when this next page is not mapped -> segfault. With strncpy(), this possibility already exists. If it is a real problem, that stand-alone 64-byte allocations are crossing page boundaries, the fault is with the memory allocator, not with the user of the memory. For strlcpy(), my suggestion that Tom quotes was that modern processes do best when instructions can be fully parallelized. It is a lot easier to parallelize a 64-byte copy, than a tight loop looking for '\0' or n >= 64. 64 bytes easily fits into cache memory, and modern processors write cache memory in blocks of 16, 32, or 64 bytes anyways, meaning that any savings in terms of not writing are minimal. But it's only safe if you know that the source string allocation is >= 64 bytes. Often you don't, therefore it isn't safe, and the suggestion is unworkable. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
If anybody is curious, here are my numbers for an AMD X2 3800+: $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be slow."' -o x x.c y.c strlcpy.c ; ./x NONE: 620268 us MEMCPY: 683135 us STRNCPY: 7952930 us STRLCPY: 10042364 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -o x x.c y.c strlcpy.c ; ./x NONE: 554694 us MEMCPY: 691390 us STRNCPY: 7759933 us STRLCPY: 3710627 us $ gcc -O3 -std=c99 -DSTRING='""' -o x x.c y.c strlcpy.c ; ./x NONE: 631266 us MEMCPY: 775340 us STRNCPY: 7789267 us STRLCPY: 550430 us Each invocation represents 100 million calls to each of the functions. Each function accepts a 'dst' and 'src' argument, and assumes that it is copying 64 bytes from 'src' to 'dst'. The none function does nothing. The memcpy calls memcpy(), the strncpy calls strncpy(), and the strlcpy calls the strlcpy() that was posted from the BSD sources. (GLIBC doesn't have strlcpy() on my machine). This makes it clear what the overhead of the additional logic involves. memcpy() is approximately equal to nothing at all. strncpy() is always expensive. strlcpy() is often more expensive than memcpy(), except in the empty string case. These tests do not properly model the effects of real memory, however, they do model the effects of cache memory. I would suggest that the results are exaggerated, but not invalid. For anybody doubting the none vs memcpy, I've included the generated assembly code. I chalk it entirely up to fully utilizing the parallelization capability of the CPU. Although 16 movq instructions are executed, they can be executed fully in parallel. It almost makes it clear to me that all of these instructions are pretty fast. Are we sure this is a real bottleneck? Even the slowest operation above, strlcpy() on a very long string, appears to execute 10 per microsecond? Perhaps my tests are too easy for my CPU and I need to make it access many different 64-byte blocks? :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Attachment
mark@mark.mielke.cc writes: > If anybody is curious, here are my numbers for an AMD X2 3800+: You did not show your C code, so no one else can reproduce the test on other hardware. However, it looks like your compiler has unrolled the memcpy into straight-line 8-byte moves, which makes it pretty hard for anything operating byte-wise to compete, and is a bit dubious for the general case anyway (since it requires assuming that the size and alignment are known at compile time). This does make me wonder about whether we shouldn't try the strlen+memcpy implementation I proposed earlier ... regards, tom lane
On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote: > mark@mark.mielke.cc writes: > > If anybody is curious, here are my numbers for an AMD X2 3800+: > You did not show your C code, so no one else can reproduce the test on > other hardware. However, it looks like your compiler has unrolled the > memcpy into straight-line 8-byte moves, which makes it pretty hard for > anything operating byte-wise to compete, and is a bit dubious for the > general case anyway (since it requires assuming that the size and > alignment are known at compile time). I did show the .s code. I call into x_memcpy(a, b), meaning that the compiler can't assume anything. It may happen to be aligned. Here are results over 64 Mbytes of memory, to ensure that every call is a cache miss: $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c ; ./x NONE: 767243 us MEMCPY: 6044137 us STRNCPY: 10741759 us STRLCPY: 12061630 us LENCPY: 9459099 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 712193 us MEMCPY: 6072312 us STRNCPY: 9982983 us STRLCPY: 6605052 us LENCPY: 7128258 us $ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 708164 us MEMCPY: 6042817 us STRNCPY: 8885791 us STRLCPY: 5592477 us LENCPY: 6135550 us At least on my machine, memcpy() still comes out on top. Yes, assuming that it is aligned correctly for the machine. Here is unaliagned (all arrays are stored +1 offset in memory): $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -DALIGN=1-o x x.c y.c strlcpy.c ; ./x NONE: 790932 us MEMCPY: 6591559 us STRNCPY: 10622291 us STRLCPY: 12070007 us LENCPY: 10322541 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 764577 us MEMCPY: 6631731 us STRNCPY: 9513540 us STRLCPY: 6615345 us LENCPY: 7263392 us $ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 825689 us MEMCPY: 6607777 us STRNCPY: 8976487 us STRLCPY: 5878088 us LENCPY: 6180358 us Alignment looks like it does impact the results for memcpy(). memcpy() changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is still the winner in all cases accept for strlcpy(), which beats it on very short strings (""). Here is the cache hit case including your strlen+memcpy as 'LENCPY': $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c; ./x NONE: 696157 us MEMCPY: 825118 us STRNCPY: 7983159 us STRLCPY: 10787462 us LENCPY: 6048339 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 700201 us MEMCPY: 593701 us STRNCPY: 7577380 us STRLCPY: 3727801 us LENCPY: 3169783 us $ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 706283 us MEMCPY: 792719 us STRNCPY: 7870425 us STRLCPY: 681334 us LENCPY: 2062983 us First call was every call being a cache hit. With this one, every one is a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of memory. I've attached the code for your consideration. x.c is the routines I used to perform the tests. y.c is the main program. strlcpy.c is copied from the online reference as is without change. The compilation steps are described above. STRING is the string to try out. N is the number of 64-byte blocks to allocate. ALIGN is the number of bytes to offset the array by when storing / reading / writing. ALIGN should be >= 0. At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of RAM. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
Attachment
Mark, Thanks for attaching the C code for your test. I ran a few tests on a 3Ghz Intel Xeon Paxville (dual core) system. I hopethe formatting of this table survives: Method Size N=1024*1024 N=1 MEMCPY 63 6964927 us 582494 us MEMCPY 32 7102497 us 582467 us MEMCPY 16 7116358 us 582538 us MEMCPY 8 6965239 us 582796 us MEMCPY 4 6964722 us 583183 us STRNCPY 63 10131174 us 8843010 us STRNCPY 32 10648202 us 9563868 us STRNCPY 16 9187398 us 7969947 us STRNCPY 8 9275353 us 8042777 us STRNCPY 4 9067570 us 8058532 us STRLCPY 63 15045507 us 14379702 us STRLCPY 32 8960303 us 8120471 us STRLCPY 16 7393607 us 4915457 us STRLCPY 8 7222983 us 3211931 us STRLCPY 4 7181267 us 1725546 us LENCPY 63 7608932 us 4416602 us LENCPY 32 7252849 us 3807535 us LENCPY 16 11680927 us 10331487 us LENCPY 8 10409685 us 9660616 us LENCPY 4 10824632 us 9525082 us The first column is the copy method, the second column is the source string size (size of -DSTRING), the 3rd and 4th columnsare the different -DN settings. The memcpy () call is the clear winner, at all source string sizes. The strncpy () call is better than strlcpy (), untilthe size of the string decreases. This is probably due to the zero padding effect of strncpy. The lencpy () call startsout strong, but degrades as the size of the string decreases. This was a little surprising and I don't have an explanationfor this behavior at this time. The AMD optimization manuals have some interesting examples for optimizations for memcpy, along the lines of cache line copiesand prefetching: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF#search=%22amd%20optimization%20manual%22 h <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22> ttp://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf#search=%22amd%20optimization%20manual%22 There also used to be an interesting article on the SGI web site called "Optimizing CPU to Memory Accesses on the SGI VisualWorkstations 320 and 540", but this seems to have been pulled. I did find a copy of the article here: http://eunchul.com/database/board/cat.php?data=Win32_API&board_group=D42a8ff5c3a9b9 Obviously, different copy mechanisms suit different data sizes. So, I added a little debug to the strlcpy () function thatwas added to Postgres the other day. I ran a test against Postgres for ~15 minutes that used 2 client backends and theBG writer - 8330804 calls to strlcpy () were generated by the test. Out of the 8330804 calls, 6226616 calls used a maximum copy size of 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074calls used a maximum copy size of 64 bytes. I know the 2213 size calls come from the set_ps_display () function. I don't know where the 64 size calls come from, yet. In the 64 size case, with the exception of 35 calls, calls for size 64 are only copying 1 byte - I would assume this is aNULL. In the 2213 size case, 1933027 calls copy 20 bytes; 2189415 calls copy 5 bytes; 85550 calls copy 6 bytes and 2018482 callscopy 7 bytes. Based on this data, it would seem that either memcpy () or strlcpy () calls would be better due to the source string size. Call originating from the set_ps_display () function might be able to use the memcpy () call as the size of the source stringshould be known. The other calls probably need something like strlcpy () as the source string might not be known, althoughusing memcpy () to copy in XX byte blocks might be interesting. David ________________________________ From: pgsql-hackers-owner@postgresql.org on behalf of mark@mark.mielke.cc Sent: Fri 9/29/2006 2:59 PM To: Tom Lane Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Faster StrNCpy On Fri, Sep 29, 2006 at 05:34:30PM -0400, Tom Lane wrote: > mark@mark.mielke.cc writes: > > If anybody is curious, here are my numbers for an AMD X2 3800+: > You did not show your C code, so no one else can reproduce the test on > other hardware. However, it looks like your compiler has unrolled the > memcpy into straight-line 8-byte moves, which makes it pretty hard for > anything operating byte-wise to compete, and is a bit dubious for the > general case anyway (since it requires assuming that the size and > alignment are known at compile time). I did show the .s code. I call into x_memcpy(a, b), meaning that the compiler can't assume anything. It may happen to be aligned. Here are results over 64 Mbytes of memory, to ensure that every call is a cache miss: $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c ; ./x NONE: 767243 us MEMCPY: 6044137 us STRNCPY: 10741759 us STRLCPY: 12061630 us LENCPY: 9459099 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 712193 us MEMCPY: 6072312 us STRNCPY: 9982983 us STRLCPY: 6605052 us LENCPY: 7128258 us $ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 708164 us MEMCPY: 6042817 us STRNCPY: 8885791 us STRLCPY: 5592477 us LENCPY: 6135550 us At least on my machine, memcpy() still comes out on top. Yes, assuming that it is aligned correctly for the machine. Here is unaliagned (all arrays are stored +1 offset in memory): $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -DALIGN=1-o x x.c y.c strlcpy.c ; ./x NONE: 790932 us MEMCPY: 6591559 us STRNCPY: 10622291 us STRLCPY: 12070007 us LENCPY: 10322541 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 764577 us MEMCPY: 6631731 us STRNCPY: 9513540 us STRLCPY: 6615345 us LENCPY: 7263392 us $ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -DALIGN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 825689 us MEMCPY: 6607777 us STRNCPY: 8976487 us STRLCPY: 5878088 us LENCPY: 6180358 us Alignment looks like it does impact the results for memcpy(). memcpy() changes from around 6.0 seconds to 6.6 seconds. Overall, though, it is still the winner in all cases accept for strlcpy(), which beats it on very short strings (""). Here is the cache hit case including your strlen+memcpy as 'LENCPY': $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c; ./x NONE: 696157 us MEMCPY: 825118 us STRNCPY: 7983159 us STRLCPY: 10787462 us LENCPY: 6048339 us $ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 700201 us MEMCPY: 593701 us STRNCPY: 7577380 us STRLCPY: 3727801 us LENCPY: 3169783 us $ gcc -O3 -std=c99 -DSTRING='""' -DN=1 -o x x.c y.c strlcpy.c ; ./x NONE: 706283 us MEMCPY: 792719 us STRNCPY: 7870425 us STRLCPY: 681334 us LENCPY: 2062983 us First call was every call being a cache hit. With this one, every one is a cache miss, and the 64-byte blocks are spread equally over 64 Mbytes of memory. I've attached the code for your consideration. x.c is the routines I used to perform the tests. y.c is the main program. strlcpy.c is copied from the online reference as is without change. The compilation steps are described above. STRING is the string to try out. N is the number of 64-byte blocks to allocate. ALIGN is the number of bytes to offset the array by when storing / reading / writing. ALIGN should be >= 0. At N=1, it's all in cache. At N=1024*1024 it is taking up 64 Mbytes of RAM. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > Here is the cache hit case including your strlen+memcpy as 'LENCPY': > $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c; ./x > NONE: 696157 us > MEMCPY: 825118 us > STRNCPY: 7983159 us > STRLCPY: 10787462 us > LENCPY: 6048339 us It appears that these results are a bit platform-dependent; on my x86_64 (Xeon) Fedora 5 box, I get $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 x.c y.c strlcpy.c $ ./a.out NONE: 358679 us MEMCPY: 619255 us STRNCPY: 8932551 us STRLCPY: 9212371 us LENCPY: 13910413 us I'm not sure why the lencpy method sucks so badly on this machine :-(. Anyway, I looked at glibc's strncpy and determined that on this machine the only real optimization that's been done to it is to unroll the data copying loop four times. I did the same to strlcpy (attached) and got numbers like these: $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 x.c y.c strlcpy.c $ ./a.out NONE: 359317 us MEMCPY: 619636 us STRNCPY: 8933507 us STRLCPY: 7644576 us LENCPY: 13917927 us $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" x.c y.cstrlcpy.c $ ./a.out NONE: 502960 us MEMCPY: 5382528 us STRNCPY: 9733890 us STRLCPY: 8740892 us LENCPY: 15358616 us $ gcc -O3 -std=c99 -DSTRING='"short"' -DN=1 x.c y.c strlcpy.c $ ./a.out NONE: 358426 us MEMCPY: 618533 us STRNCPY: 6704926 us STRLCPY: 867336 us LENCPY: 10115883 us $ gcc -O3 -std=c99 -DSTRING='"short"' -DN="(1024*1024)" x.c y.c strlcpy.c $ ./a.out NONE: 502746 us MEMCPY: 5365171 us STRNCPY: 7983610 us STRLCPY: 5557277 us LENCPY: 11533066 us So the unroll seems to get us to the point of not losing compared to the original strncpy code for any string length, and so I propose doing that, if it holds up on other architectures. regards, tom lane size_t strlcpy(char *dst, const char *src, size_t siz) {char *d = dst;const char *s = src;size_t n = siz; /* Copy as many bytes as will fit */if (n != 0) { while (n > 4) { if ((*d++ = *s++) == '\0') goto done; if ((*d++ = *s++) == '\0') goto done; if ((*d++ = *s++) == '\0') goto done; if ((*d++ = *s++) == '\0') goto done; n -= 4; } while (--n != 0) { if ((*d++ = *s++) =='\0') goto done; }} /* Not enough room in dst, add NUL and traverse rest of src */if (siz != 0) *d = '\0'; /* NUL-terminatedst */while (*s++) ; done:return(s - src - 1); /* count does not include NUL */ }
Mark, On 9/29/06 2:59 PM, "mark@mark.mielke.cc" <mark@mark.mielke.cc> wrote: > Here are results over 64 Mbytes of memory, to ensure that every call is > a cache miss: On my Mac OSX intel laptop (Core Duo, 2.16 GHz, 2GB RAM, gcc 4.01): Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x /usr/bin/ld: warning multiple definitions of symbol _strlcpy /var/tmp//cc1eMVq7.o definition of _strlcpy in section (__TEXT,__text) /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So) definition of _strlcpy NONE: 416677 us MEMCPY: 5649587 us STRNCPY: 5806591 us STRLCPY: 12865010 us LENCPY: 17801485 us Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x /usr/bin/ld: warning multiple definitions of symbol _strlcpy /var/tmp//ccOZl9R6.o definition of _strlcpy in section (__TEXT,__text) /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So) definition of _strlcpy NONE: 416652 us MEMCPY: 5830540 us STRNCPY: 6207594 us STRLCPY: 5582607 us LENCPY: 7887703 us Luke-Lonergans-Computer:~/strNcpy-perf-test lukelonergan$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x /usr/bin/ld: warning multiple definitions of symbol _strlcpy /var/tmp//ccBUsIdR.o definition of _strlcpy in section (__TEXT,__text) /usr/lib/gcc/i686-apple-darwin8/4.0.1/../../../libSystem.dylib(strlcpy.So) definition of _strlcpy NONE: 417210 us MEMCPY: 5506346 us STRNCPY: 5769302 us STRLCPY: 5424234 us LENCPY: 5609338 us - Luke
Thanks for the results on your machines Dave, Luke, and Tom. Interesting. On Mon, Oct 02, 2006 at 02:30:11PM -0400, Tom Lane wrote: > It appears that these results are a bit platform-dependent; on my x86_64 > (Xeon) Fedora 5 box, I get *nod* > Anyway, I looked at glibc's strncpy and determined that on this machine > the only real optimization that's been done to it is to unroll the data > copying loop four times. I did the same to strlcpy (attached) and got > numbers like these: > So the unroll seems to get us to the point of not losing compared to the > original strncpy code for any string length, and so I propose doing > that, if it holds up on other architectures. Sounds reasonable. strlcpy() is the best match. I'm not sure why GLIBC doesn't come with strlcpy(). Apparently some philosophical differences against it by the GLIBC maintainers... :-) memcpy() makes too many assumptions, that would be difficult to prove in each case, or may be difficult to ensure that they remain true. I only mentioned it as a possibility, to keep the options open, and to show the sort of effects we are dealing with. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
Mark, Tom, Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to be quite different: math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)"-o x x.c y.c strlcpy.c ; ./x NONE: 825671 us MEMCPY: 4533637 us STRNCPY: 10959380 us STRLCPY: 34258306 us LENCPY: 13939373 us math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="1" -ox x.c y.c strlcpy.c ; ./x NONE: 938815 us MEMCPY: 3441330 us STRNCPY: 5881628 us STRLCPY: 20167825 us LENCPY: 11326259 us math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 831920 us MEMCPY: 4550683 us STRNCPY: 9339254 us STRLCPY: 16871433 us LENCPY: 21332076 us math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='"Short sentence."' -DN="1" -o x x.c y.c strlcpy.c ; ./x NONE: 938104 us MEMCPY: 3438871 us STRNCPY: 3438969 us STRLCPY: 6042829 us LENCPY: 6022909 us math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='""' -DN="(1024*1024)" -o x x.c y.c strlcpy.c ; ./x NONE: 825472 us MEMCPY: 4547456 us STRNCPY: 8591618 us STRLCPY: 6818957 us LENCPY: 8264340 us math@ptah:~/test$ gcc -O3 -std=c99 -DSTRING='""' -DN="1" -o x x.c y.c strlcpy.c ; ./x NONE: 937878 us MEMCPY: 3439101 us STRNCPY: 3188791 us STRLCPY: 750437 us LENCPY: 2751238 us Regards, Sergey ******************************************************************* Sergey E. Koposov Max Planck Institute for Astronomy/Sternberg Astronomical Institute Tel: +49-6221-528-349 Web: http://lnfm1.sai.msu.ru/~math E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes: > Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to > be quite different: What libc are you using exactly? Can you try it with the unrolled strlcpy I posted? In glibc-2.4.90, there seem to be out-of-line assembly code implementations of strncpy for: sparc64 sparc32 s390x s390 alpha ia64 and an inlined assembler version for i386. So the x86_64 case is nearly the only popular architecture that doesn't seem to have a hand-hacked implementation ... which throws some doubt on Mark's and my results as possibly not being very representative. regards, tom lane
On Mon, 2 Oct 2006, Tom Lane wrote: > "Sergey E. Koposov" <math@sai.msu.ru> writes: >> Just the test on IA64 (Itanium2, 1.6Ghz, 8Gb memory). The results seem to >> be quite different: > > What libc are you using exactly? Can you try it with the unrolled > strlcpy I posted? glibc 2.3.5 , gcc 3.4.4 my results were obtained already with your unrolled version (but when I first ran it with the 'default' strlcpy the results were the same). > > In glibc-2.4.90, there seem to be out-of-line assembly code > implementations of strncpy for: sparc64 sparc32 s390x s390 alpha ia64 > and an inlined assembler version for i386. So the x86_64 case is > nearly the only popular architecture that doesn't seem to have > a hand-hacked implementation ... which throws some doubt on Mark's and > my results as possibly not being very representative. Regards, Sergey ******************************************************************* Sergey E. Koposov Max Planck Institute for Astronomy/Sternberg Astronomical Institute Tel: +49-6221-528-349 Web: http://lnfm1.sai.msu.ru/~math E-mail: math@sai.msu.ru
I did a couple more tests using x86 architectures. On a rather old Pentium-4 machine running Fedora 5 (gcc 4.1.1, glibc-2.4-11): $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c NONE: 786305 us MEMCPY: 9887843 us STRNCPY: 15045780 us STRLCPY: 17555563 us U_STRLCPY: 14994639 us LENCPY: 19700346 us $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c NONE: 562001 us MEMCPY: 2026546 us STRNCPY: 11149134 us STRLCPY: 13747827 us U_STRLCPY: 12467527 us LENCPY: 17672899 us (STRLCPY is our CVS HEAD code, U_STRLCPY is the unrolled version) On a Mac Mini (Intel Core Duo, OS X 10.4.8, gcc 4.0.1), the system has a version of strlcpy, but it seems to suck :-( $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c ; ./x NONE: 480298 us MEMCPY: 7857291 us STRNCPY: 10485948 us STRLCPY: 16745154 us U_STRLCPY: 18337286 us S_STRLCPY: 20920213 us LENCPY: 22878114 us $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c; ./x NONE: 480269 us MEMCPY: 1858974 us STRNCPY: 5405618 us STRLCPY: 16364452 us U_STRLCPY: 16439753 us S_STRLCPY: 19134538 us LENCPY: 22873141 us It's interesting that the unrolled version is actually slower here. I didn't dig into the assembly code, but maybe Apple's compiler isn't doing a very good job with it? Anyway, these results make me less excited about the unrolled version. In any case, I don't think we should put too much emphasis on the long-source-string case. In essentially all cases, the true source string length will be much shorter than the target buffer (were this not so, we'd probably be needing to make the buffer bigger), and strlcpy will certainly beat out strncpy in those situations. The memcpy numbers look attractive, but they ignore the problem that in practice we usually don't know the source string length in advance --- so I don't think those represent something achievable. One thing that seems real clear is that the LENCPY method loses across the board, which surprised me, but it's hard to argue with numbers. I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy, but given the LENCPY results this may be a loser too. regards, tom lane
"Strong, David" <david.strong@unisys.com> writes: > Obviously, different copy mechanisms suit different data sizes. So, I > added a little debug to the strlcpy () function that was added to > Postgres the other day. I ran a test against Postgres for ~15 minutes > that used 2 client backends and the BG writer - 8330804 calls to > strlcpy () were generated by the test. > Out of the 8330804 calls, 6226616 calls used a maximum copy size of > 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074 calls used a > maximum copy size of 64 bytes. > I know the 2213 size calls come from the set_ps_display () function. I > don't know where the 64 size calls come from, yet. Prepared-statement and portal hashtable lookups, likely. Were your clients using V3 extended query protocol? regards, tom lane
Tom, Yes, the clients are using the V3 protocol and prepared statements. David ________________________________ From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Mon 10/2/2006 2:09 PM To: Strong, David Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Faster StrNCpy "Strong, David" <david.strong@unisys.com> writes: > Obviously, different copy mechanisms suit different data sizes. So, I > added a little debug to the strlcpy () function that was added to > Postgres the other day. I ran a test against Postgres for ~15 minutes > that used 2 client backends and the BG writer - 8330804 calls to > strlcpy () were generated by the test. > Out of the 8330804 calls, 6226616 calls used a maximum copy size of > 2213 bytes e.g. strlcpy (dest, src, 2213) and 2104074 calls used a > maximum copy size of 64 bytes. > I know the 2213 size calls come from the set_ps_display () function. I > don't know where the 64 size calls come from, yet. Prepared-statement and portal hashtable lookups, likely. Were your clients using V3 extended query protocol? regards, tom lane
Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > A wholesale replacement of strncpy() calls is probably worth doing -- > > replacing them with strlcpy() if the source string is NUL-terminated, > > and I suppose memcpy() otherwise. > > What I'd like to do immediately is put in strlcpy() and hit the two or > three places I think are performance-relevant. I agree with trying to > get rid of StrNCpy/strncpy calls over the long run, but it's just code > beautification and probably not appropriate for beta. Added to TODO: * Use strlcpy() rather than our StrNCpy() macro -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > You'll notice that it iterates once per char. Between that and the > > strlen() call in Tom's version, not sure which is the lesser evil. > > Yeah, I was wondering that too. My code would require two scans of the > source string (one inside strlen and one in memcpy), but in much of our > usage the source and dest should be reasonably well aligned and one > could expect memcpy to be using word rather than byte operations, so you > might possibly make it back on the strength of fewer write cycles. And > on the third hand, for short source strings none of this matters and > the extra function call involved for strlen/memcpy probably dominates. > > I'm happy to just use the OpenBSD version as a src/port module. > Any objections? I found this URL about the function history of strlcpy(): http://www.gratisoft.us/todd/papers/strlcpy.html I added the URL to port/strlcpy.c. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom, FWIW, Tom Daly did some SpecJAppserver runs on the latest snapshot and didn't show any reduction in text parsing overhead. Unfortunately, he's gone on vacation now so I can't get details. I'm going to try to set up some tests using TPCE to see if it's affected. --Josh
> I'm still interested to experiment with MemSet-then-strlcpy > for namestrcpy, but given the LENCPY results this may be a loser too. Um, why not strlcpy then MemSet the rest ? Andreas
On Tue, Oct 03, 2006 at 10:24:10AM +0200, Zeugswetter Andreas DCP SD wrote: > > I'm still interested to experiment with MemSet-then-strlcpy > > for namestrcpy, but given the LENCPY results this may be a loser too. > Um, why not strlcpy then MemSet the rest ? That's what strncpy() is supposed to be doing. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
> > > I'm still interested to experiment with MemSet-then-strlcpy for > > > namestrcpy, but given the LENCPY results this may be a loser too. > > Um, why not strlcpy then MemSet the rest ? > > That's what strncpy() is supposed to be doing. Yes, but it obviously does not in some ports, and that was the main problem as I interpreted it. Andreas
On Tue, Oct 03, 2006 at 01:56:57PM +0200, Zeugswetter Andreas DCP SD wrote: > > > > > I'm still interested to experiment with MemSet-then-strlcpy for > > > > namestrcpy, but given the LENCPY results this may be a loser too. > > > Um, why not strlcpy then MemSet the rest ? > > That's what strncpy() is supposed to be doing. > Yes, but it obviously does not in some ports, and that was the main > problem as I interpreted it. I think re-writing libc isn't really a PostgreSQL goal. strlcpy() is theoretically cheaper than strncpy() as it is not required by any standard to do as much work. memcpy() is cheaper because it can be more easily parallelized, and can increment multiple bytes with each loop interval, as it does not need to slow down to look for a '\0' in each byte. strlcpy() than memset() the remaining is strncpy(). If you are suggesting that a moderately tuned version of strncpy() be used in the ports directory, this is an option, but it can't be faster than strlcpy(). It's not possible. It would come down to whether there was a security requirements that the last bytes were '\0' or not. I haven't seen anybody mention this as a requirement. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
"Zeugswetter Andreas DCP SD" <ZeugswetterA@spardat.at> writes: >> I'm still interested to experiment with MemSet-then-strlcpy >> for namestrcpy, but given the LENCPY results this may be a loser too. > Um, why not strlcpy then MemSet the rest ? Two reasons: * The main point is to do the zeroing using word-wide operations, and if you do the above then memset will probably be facing a zeroing request that is neither word-aligned nor word-length. It may be able to recover (doing it partly byte-wide and partly word-wide), but this will easily eat up the time savings of skipping the first couple words. * On compilers that treat memset as a builtin, there are significant advantages to doing memset with a constant length: the compiler might be able to unroll the loop entirely. (I was annoyed to find that FC5's gcc on x86_64 seems to understand very well how to inline a constant length memcpy, but not memset :-(.) I did actually do some experiments with the above yesterday, and found that it was a significant win on an old x86 (with about a 10-byte source string) but about a wash on newer architectures. regards, tom lane
mark@mark.mielke.cc writes: > ... It would come down to > whether there was a security requirements that the last bytes > were '\0' or not. I haven't seen anybody mention this as a > requirement. I think it is a requirement for namestrcpy (because the result might end up on disk), but not elsewhere. regards, tom lane
We were just analyzing some more OProfile and ltrace data against Postgres 8.2Beta1 and we noticed a number of calls as follows: strlen("postgres: tpc tpc 192.168.1.200("...) = 58 memset(0xbffff6b2, '\000', 2344) = 0xbffff6b2 We have tracked this down to the following code in the set_ps_display () function: #ifdef PS_USE_CLOBBER_ARGV { int buflen; /* pad unused memory */ buflen = strlen(ps_buffer); MemSet(ps_buffer + buflen, PS_PADDING, ps_buffer_size- buflen); } #endif /* PS_USE_CLOBBER_ARGV */ If set_ps_display () moves to use the strlcpy () function call, this code might be redundant. Even if the StrNCpy () call is kept, this code may still be redundant as StrNCpy () will zero fill the ps_buffer. A MemSet () call on the ps_buffer has to be added to the init_ps_display () function, if this code is removed to clear the buffer before use. David
>>>>> "ZA" == Zeugswetter Andreas DCP SD <ZeugswetterA@spardat.at> writes: ZA> Yes, but it obviously does not in some ports, and that was the ZA> main problem as I interpreted it. strncpy is part of POSIX; I highly doubt anyone gets it wrong. Getting sane semantics from it does require manually writing null to the last location in the buffer. If you need the null byte padding feature of strncpy, it seems a bit silly to replace it with strlcpy. Particularly if the only reason is performance. Down that road lies pglibc. /Benny
This is from October 2006. Is there a TODO here? --------------------------------------------------------------------------- Tom Lane wrote: > I did a couple more tests using x86 architectures. On a rather old > Pentium-4 machine running Fedora 5 (gcc 4.1.1, glibc-2.4-11): > > $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c > NONE: 786305 us > MEMCPY: 9887843 us > STRNCPY: 15045780 us > STRLCPY: 17555563 us > U_STRLCPY: 14994639 us > LENCPY: 19700346 us > > $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c > NONE: 562001 us > MEMCPY: 2026546 us > STRNCPY: 11149134 us > STRLCPY: 13747827 us > U_STRLCPY: 12467527 us > LENCPY: 17672899 us > > (STRLCPY is our CVS HEAD code, U_STRLCPY is the unrolled version) > > On a Mac Mini (Intel Core Duo, OS X 10.4.8, gcc 4.0.1), the system has a > version of strlcpy, but it seems to suck :-( > > $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN="(1024*1024)" -o x x.cy.c strlcpy.c ; ./x > NONE: 480298 us > MEMCPY: 7857291 us > STRNCPY: 10485948 us > STRLCPY: 16745154 us > U_STRLCPY: 18337286 us > S_STRLCPY: 20920213 us > LENCPY: 22878114 us > > $ gcc -O3 -std=c99 -DSTRING='"This is a very long sentence that is expected to be very slow."' -DN=1 -o x x.c y.c strlcpy.c; ./x > NONE: 480269 us > MEMCPY: 1858974 us > STRNCPY: 5405618 us > STRLCPY: 16364452 us > U_STRLCPY: 16439753 us > S_STRLCPY: 19134538 us > LENCPY: 22873141 us > > It's interesting that the unrolled version is actually slower here. > I didn't dig into the assembly code, but maybe Apple's compiler isn't > doing a very good job with it? > > Anyway, these results make me less excited about the unrolled version. > > In any case, I don't think we should put too much emphasis on the > long-source-string case. In essentially all cases, the true source > string length will be much shorter than the target buffer (were this > not so, we'd probably be needing to make the buffer bigger), and strlcpy > will certainly beat out strncpy in those situations. The memcpy numbers > look attractive, but they ignore the problem that in practice we usually > don't know the source string length in advance --- so I don't think > those represent something achievable. > > One thing that seems real clear is that the LENCPY method loses across > the board, which surprised me, but it's hard to argue with numbers. > > I'm still interested to experiment with MemSet-then-strlcpy for namestrcpy, > but given the LENCPY results this may be a loser too. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > This is from October 2006. Is there a TODO here? I think we had decided that the code that's in there is fine. regards, tom lane