Home > mailing lists

Re: CopyReadLineText optimization - Mailing list pgsql-patches

From	Heikki Linnakangas
Subject	Re: CopyReadLineText optimization
Date	March 11, 2008 07:26:06
Msg-id	47D65E0E.8060109@enterprisedb.com Whole thread Raw
In response to	Re: CopyReadLineText optimization (Andrew Dunstan <andrew@dunslane.net>)
List	pgsql-patches

Tree view

Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> It looks like strpbrk() performs poorly:
>
> Yes, not surprising. I just looked at the implementation in glibc, which
> I assume you are using, and it seemed rather basic. The one in NetBSD's
> libc looks much more efficient.
>
> See
>
>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc

There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

> Not that what you've done isn't good, if a little obscure (as is the
> NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
  only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.

3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.

There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

From: "Zeugswetter Andreas OSB SD"
Date: 11 March 2008, 05:51:35
Subject: Re: [HACKERS] Fix for large file support (nonsegment mode support)

From: Zoltan Boszormenyi
Date: 11 March 2008, 07:40:59
Subject: Fix HAVE_LONG[_LONG]_INT_64 to really define to 1

Re: CopyReadLineText optimization - Mailing list pgsql-patches

Previous

Next