Home > mailing lists

Re: CopyReadLineText optimization revisited - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: CopyReadLineText optimization revisited
Date	February 18, 2011 22:39:03
Msg-id	201102190235.p1J2ZAN03163@momjian.us Whole thread Raw
In response to	CopyReadLineText optimization revisited (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses	Re: CopyReadLineText optimization revisited
List	pgsql-hackers

Tree view

Was this implemented?  Is it a TODO?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> I'm reviving the effort I started a while back to make COPY faster:
> 
> http://archives.postgresql.org/pgsql-patches/2008-02/msg00100.php
> http://archives.postgresql.org/pgsql-patches/2008-03/msg00015.php
> 
> The patch I now have is based on using memchr() to search end-of-line. 
> In a nutshell:
> 
> * we perform possible encoding conversion early, one input block at a 
> time, rather than after splitting the input into lines. This allows us 
> to assume in the later stages that the data is in server encoding, 
> allowing us to search for the '\n' byte without worrying about 
> multi-byte characters.
> 
> * instead of the byte-at-a-time loop in CopyReadLineText(), use memchr() 
> to find the next NL/CR character. This is where the speedup comes from. 
> Unfortunately we can't do that in the CSV codepath, because newlines can 
> be embedded in quoted, so that's unchanged.
> 
> These changes seem to give an overall speedup of between 0-10%, 
> depending on the shape of the table. I tested various tables from the 
> TPC-H schema, and a narrow table consisting of just one short text column.
> 
> I can't think of a case where these changes would be a net loss in 
> performance, and it didn't perform worse on any of the cases I tested 
> either.
> 
> There's a small fly in the ointment: the patch won't recognize backslash 
> followed by a linefeed as an escaped linefeed. I think we should simply 
> drop support for that. The docs already say:
> 
> > It is strongly recommended that applications generating COPY data convert data newlines and carriage returns to the
\nand \r sequences respectively. At present it is possible to represent a data carriage return by a backslash and
carriagereturn, and to represent a data newline by a backslash and newline. However, these representations might not be
acceptedin future releases. They are also highly vulnerable to corruption if the COPY file is transferred across
differentmachines (for example, from Unix to Windows or vice versa).
 
> 
> I vaguely recall that we discussed this some time ago already and agreed 
> that we can drop it if it makes life easier.
> 
> This patch is in pretty good shape, however it needs to be tested with 
> different exotic input formats. Also, the loop in CopyReadLineText could 
> probaby be cleaned up a bit, some of the uglifications that were done 
> for performance reasons in the old code are no longer necessary, as 
> memchr() is doing the heavy-lifting and the loop only iterates 1-2 times 
> per line in typical cases.
> 
> 
> It's not strictly necessary, but how about dropping support for the old 
> COPY protocol, and the EOF marker \. while we're at it? It would allow 
> us to drop some code, making the remaining code simpler, and reduce the 
> testing effort. Thoughts on that?
> 
> -- 
>    Heikki Linnakangas
>    EnterpriseDB   http://www.enterprisedb.com

[ Attachment, skipping... ]

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

pgsql-hackers by date:

From: Robert Haas
Date: 18 February 2011, 21:45:57
Subject: Re: Sync Rep v17

From: Bruce Momjian
Date: 18 February 2011, 22:39:35
Subject: Re: Sync Rep v17

Re: CopyReadLineText optimization revisited - Mailing list pgsql-hackers

Previous

Next