Re: COPY Transform support - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: COPY Transform support
Date
Msg-id 87sky2g8ea.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: COPY Transform support  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: COPY Transform support  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Here's a proposal for COPY to support the T part of an ETL, that is adding the 
>> capability for COPY FROM to Transform the data it gets.
>
>> The idea is quite simple: adding to COPY FROM the option to run a function on 
>> the data before to call datatype_in functions.
>
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.
>
> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything.  The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.  What
> would makes sense in my mind is a single function taking and returning
> text, which is invoked once on each complete input line before it is
> broken into fields.

I think not having to deal with separating fields is actually one of the few
reasons to do this within COPY. If you can separate out yourself or need to do
something more clever than COPY is capable of to split the columns then you're
better off preprocessing it with perl or something anyways.

To that end all the other use cases you describe could be handled with his
plan. There's nothing stopping you from doing

CREATE READER foo (a integer, b integer)
INSERT INTO b (SELECT a+b FROM foo);
or
INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo)

However I'm not sure we even need new syntax for CREATE READER. I would think
something like this would make more sense:

CREATE FUNCTION transform(integer, integer) RETURNS SETOF b;

COPY b FROM 'foo' USING transform(integer,integer);

> So the whole thing seems just marginally attractive to me.

Everything about ETL is only marginally attractive, but it's something people
spend a lot of time doing. Nobody's come up with any particularly clean
solutions I think.

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [GENERAL] SHA1 on postgres 8.3
Next
From: "D'Arcy J.M. Cain"
Date:
Subject: Re: modules