Thread: COPY fast parse patch

COPY fast parse patch

From
"Alon Goldshuv"
Date:
Here is the patch I was talking about in my message to the "NOLOGGING
option, or ?" thread. I would like to indicate some important points about
it:

1) The patch includes 2 parallel parsing code paths. One is the regular COPY
path that we all know, and the other is the improved one that I wrote. This
is only temporary, as there is a lot of code duplication, but I left it as
such for several purposes:

- The improved path for now supports only ASCII delimited text format with a
condition that client and server encodings are identical. If this is not the
case when running COPY the old path will take place. In other words, under
no condition a "not supported" error will be raised.

- Having both code paths allows for easy performance comparison between the
two. To run the regular COPY parsing call CopyFrom() from DoCopy() and to
run the improved parsing COPY call FastCopyFrom() from DoCopy(). Right now
FastCopyFrom() will be called automatically if all conditions explained in
previous point are met, but it's easy to change (need to re-compile).

* NOTE: the function names Fast*() as ugly as they are, are there to
emphasize the differences between the old and the improved path (i.e:
CopyReadLine() vs. FastReadLine()... ). They are not intended to stay this
way. This is not elegant (yet)!

2) There are some utilities such as bytebuffer and strchrlen that are at the
bottom of the file. This is probably not the right home for them, but for
now to simplify things they are included in copy.c

3) EOL is assumed NL. I raised a point about EOL's in COPY in my previous
thread, and it explains it.

4) Performance numbers could be viewed at
http://lists.pgfoundry.org/pipermail/bizgres-general/2005-May/000135.html
Some numbers include:
8.7MB/sec -> 11.8MB/sec on 15 column (Mixed) table.
12.1MB/sec -> 21MB/sec on 1 column (TEXT) table.

5) Data integrity and escaping improvements. Treats all characters as data
(unless it's an escaped delim or EOL) and therefore data
integrity is preserved. However, some people that already got
used to the postgres COPY escaping way may want to keep it. They could do so
by still using the old COPY.

As a part of submitting this patch I also presented an argument for a use of
a LOAD DATA command (in the NOLOGGING option thread). The points I made
there are closely related to this message. There may be a valid argument
that most of the points I raised could be implemented in the COPY code
instead of a LOAD DATA command, but that requires a great deal of
flexibility to add features and adding them to the COPY syntax. But that may
not be a good idea for some and will also be problematic for backwards
compatiability.

Thx,
Alon.


Attachment

Re: COPY fast parse patch

From
Bruce Momjian
Date:
Alon Goldshuv wrote:
> 5) Data integrity and escaping improvements. Treats all characters as data
> (unless it's an escaped delim or EOL) and therefore data
> integrity is preserved. However, some people that already got
> used to the postgres COPY escaping way may want to keep it. They could do so
> by still using the old COPY.

OK, if the COPY file contains:

    \n\||\r\\a|\N

and the delimiter is '|', what does that represent?  What if there were
a literal '\' followed by a '|' in the data table being dumped?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: COPY fast parse patch

From
Neil Conway
Date:
On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
> 1) The patch includes 2 parallel parsing code paths. One is the regular COPY
> path that we all know, and the other is the improved one that I wrote. This
> is only temporary, as there is a lot of code duplication

Right; I really dislike the idea of having two separate code paths for
COPY. When you say this approach is "temporary", are you suggesting that
you intend to reimplement your changes as improvements/replacements of
the existing COPY code path rather than as a parallel code path?

> As a part of submitting this patch I also presented an argument for a use of
> a LOAD DATA command (in the NOLOGGING option thread). The points I made
> there are closely related to this message. There may be a valid argument
> that most of the points I raised could be implemented in the COPY code
> instead of a LOAD DATA command

I'm definitely not keen to see a new "LOAD DATA" command.

> But that may not be a good idea for some and will also be problematic
> for backwards compatiability.

In what way would the performance improvements to COPY be backward
incompatible with the existing COPY behaviour?

-Neil



Re: COPY fast parse patch

From
"Andrew Dunstan"
Date:
Neil Conway said:
> On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
>> 1) The patch includes 2 parallel parsing code paths. One is the
>> regular COPY path that we all know, and the other is the improved one
>> that I wrote. This is only temporary, as there is a lot of code
>> duplication
>
> Right; I really dislike the idea of having two separate code paths for
> COPY. When you say this approach is "temporary", are you suggesting
> that you intend to reimplement your changes as
> improvements/replacements of the existing COPY code path rather than as
> a parallel code path?
>

It's not an all or nothing deal. When  we put in CSV handling, we introduced
two new routines for attribute input/output and otherwise used the rest of
the COPY code. When I did a fix for the multiline problem, it was originally
done with a separate read line function for CSV mode - Bruce didn't like
that so I merged it back into the existing code. In restrospect, given this
discussion, that might not have been an optimal choice. But the point is
that you can break out at several levels.

Incidentally, there might be a good case for allowing the user to set the
line end explicitly, but you can't just hardwire it - we will get massive
Windows breakage. What is more, in CSV mode line end sequences can occur
within logical lines. You need to take that into account. It's tricky and
easy to get badly wrong.

I will be the first to admit that there are probably some very good
possibilities for optimisation of this code. My impression though has been
that in almost all cases it's fast enough anyway. I know that on some very
modest hardware I have managed to load a 6m row TPC line-items table in just
a few minutes. Before we start getting too hung up, I'd be interested to
know just how much data people want to load and how fast they want it to be.
If people have massive data loads that take hours, days or weeks then it's
obviously worth improving if we can. I'm curious to know what size datasets
people are really handling this way.

cheers

andrew



Re: COPY fast parse patch

From
"Luke Lonergan"
Date:
Andrew,

> I will be the first to admit that there are probably some very good
> possibilities for optimisation of this code. My impression though has been
> that in almost all cases it's fast enough anyway. I know that on some very
> modest hardware I have managed to load a 6m row TPC line-items table in just
> a few minutes. Before we start getting too hung up, I'd be interested to
> know just how much data people want to load and how fast they want it to be.
> If people have massive data loads that take hours, days or weeks then it's
> obviously worth improving if we can. I'm curious to know what size datasets
> people are really handling this way.

x0+ GB files are common in data warehousing.  The issue is often "can we
load our data within the time allotted for the batch window", usually a
matter of an hour or two.

Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7
MB/s.  To load a 10GB file at that rate takes about 2/3 hour.  If one were
to restore a 300GB database, it would take 18 hours.  Maintenance operations
are impractical after a few hours, 18 is a non-starter.

In practice, we're usually replacing an Oracle system with PostgreSQL, and
the load speed difference between the two is currently embarrassing and
makes the work impractical.

- Luke



Re: COPY fast parse patch

From
"Alon Goldshuv"
Date:
Neil,

Right; I really dislike the idea of having two separate code paths for
COPY. When you say this approach is "temporary", are you suggesting that
you intend to reimplement your changes as improvements/replacements of
the existing COPY code path rather than as a parallel code path?

====
My thoughts were -- see how the responses are, and if people think that this is a better way to go than replace the
COPYparsing logic to the new one. The whole escaping discussion that goes on is something else, escapes could be
implementedin either way, but the important thing I am trying to show is that there is a much faster way to parse the
datainstead of doing a char-by-char petch-examine-load. 


> As a part of submitting this patch I also presented an argument for a use of
> a LOAD DATA command (in the NOLOGGING option thread). The points I made
> there are closely related to this message. There may be a valid argument
> that most of the points I raised could be implemented in the COPY code
> instead of a LOAD DATA command

I'm definitely not keen to see a new "LOAD DATA" command.

====
It seems that most people don't :-) I can see valid arguments to both having it and not having it.



> But that may not be a good idea for some and will also be problematic
> for backwards compatiability.

In what way would the performance improvements to COPY be backward
incompatible with the existing COPY behaviour?

====
That comment was in respect to the escape logic. You can regard it as irrelevant for now as long as the escape
discussiongoes on in parallel. 

Alon.


-Neil






Re: COPY fast parse patch

From
Neil Conway
Date:
On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote:
> My thoughts were -- see how the responses are, and if people think
> that this is a better way to go than replace the COPY parsing logic to
> the new one.

If the new approach to parsing can handle all the cases that the old
approach can handle (WRT CSV, binary, encoding conversion and the like)
and is faster to boot, I don't see a reason not to replace the current
approach with something derived from your patch.

-Neil



Re: COPY fast parse patch

From
"Andrew Dunstan"
Date:
Luke Lonergan said:
> Andrew,
>
>> I will be the first to admit that there are probably some very good
>> possibilities for optimisation of this code. My impression though has
>> been that in almost all cases it's fast enough anyway. I know that on
>> some very modest hardware I have managed to load a 6m row TPC
>> line-items table in just a few minutes. Before we start getting too
>> hung up, I'd be interested to know just how much data people want to
>> load and how fast they want it to be. If people have massive data
>> loads that take hours, days or weeks then it's obviously worth
>> improving if we can. I'm curious to know what size datasets people are
>> really handling this way.
>
> x0+ GB files are common in data warehousing.  The issue is often "can
> we load our data within the time allotted for the batch window",
> usually a matter of an hour or two.
>
> Assuming that TPC lineitem is 140Bytes/row, 6M rows in 3 minutes is 4.7
> MB/s.  To load a 10GB file at that rate takes about 2/3 hour.  If one
> were to restore a 300GB database, it would take 18 hours.  Maintenance
> operations are impractical after a few hours, 18 is a non-starter.
>
> In practice, we're usually replacing an Oracle system with PostgreSQL,
> and the load speed difference between the two is currently embarrassing
> and makes the work impractical.
>


OK ... that seems fair enough. The next question is where the data being
loaded comes from? pg_dump? How does load speed compare with using COPY's
binary mode?

cheers

andrew



Re: COPY fast parse patch

From
"Luke Lonergan"
Date:
Andrew,

> OK ... that seems fair enough. The next question is where the data being
> loaded comes from? pg_dump? How does load speed compare with using COPY's
> binary mode?

Oddly, our tests in the past have shown that binary is actually slower.

Luke



Re: COPY fast parse patch

From
"Luke Lonergan"
Date:
FYI - we're prepping a proposal for just that.  Stay tuned for Alon's
proposal.

Luke


On 6/1/05 11:43 PM, "Neil Conway" <neilc@samurai.com> wrote:

> On Thu, 2005-06-02 at 00:53 -0400, Alon Goldshuv wrote:
>> My thoughts were -- see how the responses are, and if people think
>> that this is a better way to go than replace the COPY parsing logic to
>> the new one.
>
> If the new approach to parsing can handle all the cases that the old
> approach can handle (WRT CSV, binary, encoding conversion and the like)
> and is faster to boot, I don't see a reason not to replace the current
> approach with something derived from your patch.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>