Re: CopyReadLineText optimization - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: CopyReadLineText optimization
Date
Msg-id 47D65E0E.8060109@enterprisedb.com
Whole thread Raw
In response to Re: CopyReadLineText optimization  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-patches
Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> It looks like strpbrk() performs poorly:
>
> Yes, not surprising. I just looked at the implementation in glibc, which
> I assume you are using, and it seemed rather basic. The one in NetBSD's
> libc looks much more efficient.
>
> See
>
>
http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc


There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

> Not that what you've done isn't good, if a little obscure (as is the
> NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
  only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.

3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.

There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

pgsql-patches by date:

Previous
From: "Zeugswetter Andreas OSB SD"
Date:
Subject: Re: [HACKERS] Fix for large file support (nonsegment mode support)
Next
From: Zoltan Boszormenyi
Date:
Subject: Fix HAVE_LONG[_LONG]_INT_64 to really define to 1