Re: COPY fast parse patch - Mailing list pgsql-patches

From Andrew Dunstan
Subject Re: COPY fast parse patch
Date
Msg-id 18637.203.26.206.129.1117678347.squirrel@www.dunslane.net
Whole thread Raw
In response to Re: COPY fast parse patch  (Neil Conway <neilc@samurai.com>)
Responses Re: COPY fast parse patch  ("Luke Lonergan" <llonergan@greenplum.com>)
List pgsql-patches
Neil Conway said:
> On Wed, 2005-06-01 at 16:34 -0700, Alon Goldshuv wrote:
>> 1) The patch includes 2 parallel parsing code paths. One is the
>> regular COPY path that we all know, and the other is the improved one
>> that I wrote. This is only temporary, as there is a lot of code
>> duplication
>
> Right; I really dislike the idea of having two separate code paths for
> COPY. When you say this approach is "temporary", are you suggesting
> that you intend to reimplement your changes as
> improvements/replacements of the existing COPY code path rather than as
> a parallel code path?
>

It's not an all or nothing deal. When  we put in CSV handling, we introduced
two new routines for attribute input/output and otherwise used the rest of
the COPY code. When I did a fix for the multiline problem, it was originally
done with a separate read line function for CSV mode - Bruce didn't like
that so I merged it back into the existing code. In restrospect, given this
discussion, that might not have been an optimal choice. But the point is
that you can break out at several levels.

Incidentally, there might be a good case for allowing the user to set the
line end explicitly, but you can't just hardwire it - we will get massive
Windows breakage. What is more, in CSV mode line end sequences can occur
within logical lines. You need to take that into account. It's tricky and
easy to get badly wrong.

I will be the first to admit that there are probably some very good
possibilities for optimisation of this code. My impression though has been
that in almost all cases it's fast enough anyway. I know that on some very
modest hardware I have managed to load a 6m row TPC line-items table in just
a few minutes. Before we start getting too hung up, I'd be interested to
know just how much data people want to load and how fast they want it to be.
If people have massive data loads that take hours, days or weeks then it's
obviously worth improving if we can. I'm curious to know what size datasets
people are really handling this way.

cheers

andrew



pgsql-patches by date:

Previous
From: Neil Conway
Date:
Subject: Re: COPY fast parse patch
Next
From: "Luke Lonergan"
Date:
Subject: Re: COPY fast parse patch