Re: Ragged CSV import - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Ragged CSV import
Date
Msg-id 603c8f070909100656p4d6c8828wbfbacff8ffa5f89d@mail.gmail.com
Whole thread Raw
In response to Re: Ragged CSV import  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Ragged CSV import
List pgsql-hackers
On Wed, Sep 9, 2009 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> The thought that comes to mind for me is something "in front of" copy,
>>> that is, give it the text of each line and let it do a text-to-text
>>> transformation before COPY chews on it.
>
>> That seems to me to be a whole lot less useful.  As I see it, the
>> whole point of any enhancement in this area is to reuse the parsing
>> code.   If I have a CSV file (or some other format COPY understands),
>> I don't want to have to write my own parser for that format in order
>> to do some simple data transformation (like dropping columns >n).
>
> I'm unconvinced.  A significant fraction of the COPY customization
> requests that I can remember had to do with cleaning up input that
> didn't entirely match COPY's idea of what valid CSV is.  If it's close
> enough that COPY will successfully parse it as N columns, you can stick
> it into an N-column text table and do your transformations afterwards.
>
> Thinking a bit more generally, it seems to me that as COPY IN is
> currently constituted, there are really four distinguishable bits
> of functionality in a pipeline:
>
> 1. Reading from FE/BE protocol (or file) and conversion to database
> character set.
>
> 2. Parsing the text into a set of de-escaped field values.
>
> 3. Converting the field values to internal Datum form (requires
> knowledge of a set of target data types).
>
> 4. Bulk insertion into the target table.
>
> It is not hard to come up with problems that are most easily solved
> by injecting some sort of filtering between any pair of these steps.
> You're suggesting a filter between 2 and 3, I suggested one between
> 1 and 2, and the well-known trick of inserting into a temp table
> and then filtering to a final table amounts to a filter between
> 3 and 4.
>
> We've had proposals before to come up with a form of COPY that can be
> used as a data source in an INSERT/SELECT context.  That would
> automatically provide a way to filter between steps 3 and 4 --- or maybe
> better, insist that the COPY-thingy produces just text fields, and leave
> both steps 3 and 4 to be done by the INSERT/SELECT.  With that, the only
> missing functionality would be a textual filter between steps 1 and 2.

That would be swell.  Being able to use COPY as an expression
returning text[] would, I think, cater to an extremely wide variety of
use cases, although it sounds like Andrew is not interested in
implementing it at this point.  If we want to add on a filter between
steps 1 and 2, that sounds potentially useful as well.

I am fuzzy on the implementation details for making COPY act as a data
source for INSERT/SELECT though.  I had thought to make EXPLAIN a data
source, but it turned out not to be possible (as far as I could tell)
without making EXPLAIN a fully-reserved word, which you vetoed.  It
seems likely that COPY will present similar issues, though I haven't
tried.

I am also wondering what happens when someone embeds multiple COPY
statements in a single query, or sticks one inside of a CTE or on the
inner side of a left join.  I am inclined to think that we should
cause COPY to materialize its input (so rescans will return the same
data already read) and throw an error if more than one is present in
the same query (unless you have a better idea?).

...Robert


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Ragged CSV import
Next
From: Tom Lane
Date:
Subject: Re: Ragged CSV import