Home > mailing lists

Re: Ragged CSV import - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: Ragged CSV import
Date	September 10, 2009 13:56:39
Msg-id	603c8f070909100656p4d6c8828wbfbacff8ffa5f89d@mail.gmail.com Whole thread Raw
In response to	Re: Ragged CSV import (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Ragged CSV import
List	pgsql-hackers

Tree view

On Wed, Sep 9, 2009 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> The thought that comes to mind for me is something "in front of" copy,
>>> that is, give it the text of each line and let it do a text-to-text
>>> transformation before COPY chews on it.
>
>> That seems to me to be a whole lot less useful.  As I see it, the
>> whole point of any enhancement in this area is to reuse the parsing
>> code.   If I have a CSV file (or some other format COPY understands),
>> I don't want to have to write my own parser for that format in order
>> to do some simple data transformation (like dropping columns >n).
>
> I'm unconvinced.  A significant fraction of the COPY customization
> requests that I can remember had to do with cleaning up input that
> didn't entirely match COPY's idea of what valid CSV is.  If it's close
> enough that COPY will successfully parse it as N columns, you can stick
> it into an N-column text table and do your transformations afterwards.
>
> Thinking a bit more generally, it seems to me that as COPY IN is
> currently constituted, there are really four distinguishable bits
> of functionality in a pipeline:
>
> 1. Reading from FE/BE protocol (or file) and conversion to database
> character set.
>
> 2. Parsing the text into a set of de-escaped field values.
>
> 3. Converting the field values to internal Datum form (requires
> knowledge of a set of target data types).
>
> 4. Bulk insertion into the target table.
>
> It is not hard to come up with problems that are most easily solved
> by injecting some sort of filtering between any pair of these steps.
> You're suggesting a filter between 2 and 3, I suggested one between
> 1 and 2, and the well-known trick of inserting into a temp table
> and then filtering to a final table amounts to a filter between
> 3 and 4.
>
> We've had proposals before to come up with a form of COPY that can be
> used as a data source in an INSERT/SELECT context.  That would
> automatically provide a way to filter between steps 3 and 4 --- or maybe
> better, insist that the COPY-thingy produces just text fields, and leave
> both steps 3 and 4 to be done by the INSERT/SELECT.  With that, the only
> missing functionality would be a textual filter between steps 1 and 2.

That would be swell.  Being able to use COPY as an expression
returning text[] would, I think, cater to an extremely wide variety of
use cases, although it sounds like Andrew is not interested in
implementing it at this point.  If we want to add on a filter between
steps 1 and 2, that sounds potentially useful as well.

I am fuzzy on the implementation details for making COPY act as a data
source for INSERT/SELECT though.  I had thought to make EXPLAIN a data
source, but it turned out not to be possible (as far as I could tell)
without making EXPLAIN a fully-reserved word, which you vetoed.  It
seems likely that COPY will present similar issues, though I haven't
tried.

I am also wondering what happens when someone embeds multiple COPY
statements in a single query, or sticks one inside of a CTE or on the
inner side of a left join.  I am inclined to think that we should
cause COPY to materialize its input (so rescans will return the same
data already read) and throw an error if more than one is present in
the same query (unless you have a better idea?).

...Robert

pgsql-hackers by date:

From: Alvaro Herrera
Date: 10 September 2009, 13:32:46
Subject: Re: Ragged CSV import

From: Tom Lane
Date: 10 September 2009, 13:57:59
Subject: Re: Ragged CSV import

Re: Ragged CSV import - Mailing list pgsql-hackers

Previous

Next