Re: Optimizing COPY - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Optimizing COPY
Date
Msg-id 603c8f070811081745v410d2ff5o4bd107d7fbb440f3@mail.gmail.com
Whole thread Raw
In response to Optimizing COPY  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Optimizing COPY
List pgsql-hackers
Heikki,

I was assigned as a round-robin reviewer for this patch, but it looks
to me like it is still WIP, so I'm not sure how much effort it's worth
putting in at this point.  Do you plan to finish this for 8.4, and if
so, should I wait for the next version before reviewing further?

Thanks,

...Robert

On Thu, Oct 30, 2008 at 8:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Back in March, I played around with various hacks to improve COPY FROM
> performance:
> http://archives.postgresql.org/pgsql-patches/2008-03/msg00145.php
>
> I got busy with other stuff, but I now got around to try what I planned back
> then. I don't know if I have the time to finish this for 8.4, but might as
> well post what I've got.
>
> The basic idea is to replace the custom loop in CopyReadLineText with
> memchr(), because memchr() is very fast. To make that possible, perform the
> client-server encoding conversion on each raw block that we read in, before
> splitting it into lines. That way CopyReadLineText only needs to deal with
> server encodings, which is required for the memchr() to be safe.
>
> Attached is a quick patch for that. Think of it as a prototype; I haven't
> tested it much, and I feel that it needs some further cleanup. Quick testing
> suggests that it gives 0-20% speedup, depending on the data. Very narrow
> tables don't benefit much, but the wider the table, the bigger the gain. I
> haven't found a case where it performs worse.
>
> I'm only using memchr() with non-csv format at the moment. It could be used
> for CSV as well, but it's more complicated because in CSV mode we need to
> keep track of the escapes.
>
> Some collateral damage:
> \. no longer works. If we only accept \. on a new line, like we already do
> in CSV mode, it shouldn't be hard or expensive to make it work again. The
> manual already suggests that we only accept it on a single line: "End of
> data can be represented by a single line containing just backslash-period
> (\.)."
>
> Escaping a linefeed or carriage return by prepending it with a backslash no
> longer works. You have to use \n and \r. The manual already warns against
> doing that, so I think we could easily drop support for it. If we wanted, it
> wouldn't be very hard to make it work, though: after hitting a newline, scan
> back and count how many backslashes there is before the newline. An odd
> number means that it's an escaped newline, even number (including 0) means
> it's a real newline.
>
> For the sake of simplifying the code, would anyone care if we removed
> support for COPY with protocol version 2?
>
> --
>  Heikki Linnakangas
>  EnterpriseDB   http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: WIP: Page space reservation (pgupgrade)
Next
From: "Jonah H. Harris"
Date:
Subject: Re: WIP: Page space reservation (pgupgrade)