Thread: COPY Transform support

COPY Transform support

From
Dimitri Fontaine
Date:
Hi,

Here's a proposal for COPY to support the T part of an ETL, that is adding the
capability for COPY FROM to Transform the data it gets.

The idea is quite simple: adding to COPY FROM the option to run a function on
the data before to call datatype_in functions. This needs some syntax
addition to be worked out at the COPY side, then the COPY code will have to
run the given function on the read data and consider giving the output of it
to current COPY code (datatype input function).

The function could either get the data as text or bytea, and would have to
return either text or bytea. bytea seems the more sensible choice, as long as
we don't lose encoding information there, which I'm not sure about.

The syntax could be something like:COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc;

I tried to only add keywords already present in [1], while getting something
meaningfull... and x is intended to be the column number, counting from 1.[1]
http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html

Comments?
--
dim

Re: COPY Transform support

From
NikhilS
Date:
Hi, <br /><br /><br /><div class="gmail_quote">On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine <<a
href="mailto:dfontaine@hi-media.com">dfontaine@hi-media.com</a>>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Here's a proposal for
COPYto support the T part of an ETL, that is adding the<br /> capability for COPY FROM to Transform the data it
gets.<br/><br /> The idea is quite simple: adding to COPY FROM the option to run a function on<br /> the data before to
calldatatype_in functions. This needs some syntax<br /> addition to be worked out at the COPY side, then the COPY code
willhave to<br /> run the given function on the read data and consider giving the output of it<br /> to current COPY
code(datatype input function).<br /><br /> The function could either get the data as text or bytea, and would have
to<br/> return either text or bytea. bytea seems the more sensible choice, as long as<br /> we don't lose encoding
informationthere, which I'm not sure about.<br /><br /> The syntax could be something like:<br />  COPY mytable FROM
'/my/file.txt'WITH COLUMN x CONVERT USING myfunc;<br /><br /> I tried to only add keywords already present in [1],
whilegetting something<br /> meaningfull... and x is intended to be the column number, counting from 1.<br />  [1] <a
href="http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html"
target="_blank">http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html</a><br/><br /> Comments?<br
/><fontcolor="#888888">--<br /> dim<br /></font></blockquote></div><br />+1<br /><br />Data transformation while doing
adata load is a requirement now and then. Considering that users will have to do mass updates *after* the load
completesto mend the data to their liking should be reason enough to do this while the loading is happening. I think to
goabout it the right way we should support the following:<br /><br /> <br />* The ability to provide per-column
transformationexpressions<br /><br /> COPY mytable (col1 transform to "col1 + 10", col2 transform to "'Post' ||
'greSQL'",col3...) FROM ..<br /><br />* The ability to use any kind of expressions while doing the transformation <br
/>The transformation expression should be any expression (basically ExecEvalExpr) that can be evaluated to give a
resultingvalue and obviously a corresponding is_null value too. It should and could be system in-built functions (e.g.
UPPER,TRIM, TO_CHAR, TO_NUMBER etc.) or user defined functions too<br /><br clear="all" />* The transformation
expressioncan refer to other columns involved in the load. So that when the current row is extracted from the input
file,the current values should be used to generate the new resultant values before doing a heap_form_tuple. E.g.<br
/><br/>(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform "UPPER(col1 || col3)",...)<br /><br
/>Ihave spent some thoughts on how to do this and will be happy to share the same if the list is interested.
Personally,I think data transformation using such expressions is a pretty powerful and important activity while doing
thedata load itself.<br /><br />Regards,<br />Nikhils <br />-- <br />EnterpriseDB <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a>

Re: COPY Transform support

From
PFC
Date:
> Data transformation while doing a data load is a requirement now and  
> then.
> Considering that users will have to do mass updates *after* the load
> completes to mend the data to their liking should be reason enough to do
> this while the loading is happening. I think to go about it the right  
> way we
> should support the following:

> * The ability to provide per-column transformation expressions
> * The ability to use any kind of expressions while doing the  
> transformation
> The transformation expression should be any expression (basically
> ExecEvalExpr) that can be evaluated to give a resulting value and  
> obviously
> a corresponding is_null value too. It should and could be system in-built
> functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
> functions too
> * The transformation expression can refer to other columns involved in  
> the
> load. So that when the current row is extracted from the input file, the
> current values should be used to generate the new resultant values before
> doing a heap_form_tuple. E.g.
> (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
> "UPPER(col1 || col3)",...)
> I have spent some thoughts on how to do this and will be happy to share  
> the
> same if the list is interested. Personally, I think data transformation
> using such expressions is a pretty powerful and important activity while
> doing the data load itself.

Well, since COPY is about as fast as INSERT INTO ... SELECT plus the  
parsing overead, I suggest adding a special SELECT form that can read from  
a file instead of a table, which returns tuples, and which therefore can  
be used and abused to the user's liking. This is a much more powerful  
feature because :
- there is almost no new syntax- it is much simpler for the user- lots of existing stuff can be leveraged
EXAMPLE :
Suppose I want to import a MySQL dump file (gasp !) which obviously  
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken  
foreign keys, etc.
Let's have a new command :

CREATE FLATFILE READER mydump (id    INTEGER,date    TEXT,...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;
This command would create a set-returning function which is basically a  
wrapper around the existing parser in COPY.Column definition gives a name and type to the fields in the text file,  
and tells the parser what to expect and what to return.It looks like a table definition, and this is actually pretty
normal: it  
 
is, after all, very close to a table.
INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'  
), ... FROM mydump WHERE (FKs check and drop the borken records);
Now I can import data and transform it at will using a simple SELECT. The  
advantage is that everybody will know what to do without learning a new  
command, no awkward syntax (transform...), you can combine columns in  
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,  
anything you want, without any extension to the Postgres engine besides  
the creation of this file-parsing set-returning function, which should be  
pretty simple.
Or, if I have a few gigabytes of logs, but I am absolutely not interested  
in inserting them into a table, instead I want to make some statistics, or  
perhaps I want to insert into my table some aggregate computation from  
this data, I would just :

CREATE FLATFILE READER accesses_dump (date    TEXT,ip    INET,...
) FROM file 'web_server_logtxt';
And I can do some stats without even loading the data :
SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)  
HAVING count(*) > 1000;
Much better than having to load those gigabytes just to make a query on  
them...
































Re: COPY Transform support

From
Tom Lane
Date:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Here's a proposal for COPY to support the T part of an ETL, that is adding the 
> capability for COPY FROM to Transform the data it gets.

> The idea is quite simple: adding to COPY FROM the option to run a function on 
> the data before to call datatype_in functions.

The major concern I have about this is to ensure that no detectable
overhead is added to COPY when the feature isn't being used.

I am not actually convinced that the column-by-column design you seem to
have in mind is worth anything.  The examples that I remember seeing
often involve removing columns, generating one column from multiple ones
or vice versa, dealing with nonstandard column delimiters, etc.  What
would makes sense in my mind is a single function taking and returning
text, which is invoked once on each complete input line before it is
broken into fields.

This is, of course, just a substitute for running a sed or perl or
similar script over the data before feeding it to COPY --- and probably
not an amazingly good substitute at that.  For instance, assuming you
like perl for text-wrangling, I'd fully expect the function approach
to be slower than an external script because of the large overhead of
getting into and out of libperl for each line,

In situations where it's actually useful to apply SQL functions rather
than text-mangling operations to the data, you always have the option to
COPY into a temp table and then do INSERT/SELECT from there.

So the whole thing seems just marginally attractive to me.
        regards, tom lane


Re: COPY Transform support

From
Dimitri Fontaine
Date:
Le jeudi 03 avril 2008, PFC a écrit :
> CREATE FLATFILE READER mydump (
>         id      INTEGER,
>         date    TEXT,
>         ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;
[...]
>         INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
> '0000-00-00'   ), ... FROM mydump WHERE (FKs check and drop the borken
> records);

What do we gain against current way of doing it, which is: COPY loadtable FROM 'dump.txt' WITH ... INSERT INTO
destination_table(...)SELECT ... FROM loadtable;  

--
dim

Re: COPY Transform support

From
PFC
Date:
On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <nagy@ecircle-ag.com> wrote:

> On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
>> CREATE FLATFILE READER mydump (
>>     id    INTEGER,
>>     date    TEXT,
>>     ...
>> ) FROM file 'dump.txt'
>> (followed by delimiter specification syntax identical to COPY, etc)
>> ;
>
> Very cool idea, but why would you need to create a reader object
> first ? You should be able to use COPY directly with the target table
> being omitted,  meaning the copy will not pump it's result in the target
> but be equivalent to a select... and use it in any place where a select
> can be used. This would have absolutely no new  syntax, just the rules
> changed...
>
> Now that I had a second look you actually need the field definitions to
> meaningfully interpret the file,
Yeah, you need to tell Postgres the field names, types, and NULLness  
before it can parse them... or else it's just a plain flat text file which  
makes no sense...
> but then why not use a record
> specification instead of the table in the normal COPY command ? I'm not
> sure if there's any existing syntax for that but I would guess yes...
Hm, yeah, that's even simpler, just create a type for the row (or just  
use table%ROWTYPE if you have a table that fits the description), and tell  
COPY to parse according to the row type definition... smart...
Like :

CREATE TYPE import_rowtype AS (id INTEGER, date TEXT);
INSERT INTO mytable (id, date, ...)  SELECT id, NULLIF( date, '0000-00-00' )::DATE  FROM (COPY AS import_rowtype FROM
'mysql_trash.txt')AS foo  WHERE (FKs check and drop the borken records);
 
Looks clean...
Obviously, in this case (and also in my proposal's case) you must use  
COPY and not \copy since it is the database server which will be reading  
the file.This could probably be hacked so the client sends the file via the \copy  
interface, too...

> In any case, such a feature would help a lot in processing input files
> based also on other existing data in the DB.
Yeah, it would be cool.Also, since COPY TO can use a SELECT as a data source, you could use  
postgres to read from a file/pipe, process data, and write to a file/pipe  
(kinda better than sed, lol)


Re: COPY Transform support

From
Csaba Nagy
Date:
On Thu, 2008-04-03 at 16:44 +0200, PFC wrote:
> CREATE FLATFILE READER mydump (
>     id    INTEGER,
>     date    TEXT,
>     ...
> ) FROM file 'dump.txt'
> (followed by delimiter specification syntax identical to COPY, etc)
> ;

Very cool idea, but why would you need to create a reader object
first ? You should be able to use COPY directly with the target table
being omitted,  meaning the copy will not pump it's result in the target
but be equivalent to a select... and use it in any place where a select
can be used. This would have absolutely no new  syntax, just the rules
changed... 

Now that I had a second look you actually need the field definitions to
meaningfully interpret the file, but then why not use a record
specification instead of the table in the normal COPY command ? I'm not
sure if there's any existing syntax for that but I would guess yes...

In any case, such a feature would help a lot in processing input files
based also on other existing data in the DB.

Cheers,
Csaba.




Re: COPY Transform support

From
PFC
Date:
>>         INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date,
>> '0000-00-00'   ), ... FROM mydump WHERE (FKs check and drop the borken
>> records);
>
> What do we gain against current way of doing it, which is:
>   COPY loadtable FROM 'dump.txt' WITH ...
>   INSERT INTO destination_table(...) SELECT ... FROM loadtable;
You read and write the data only once instead of twice (faster) if you  
want to import all of it.If you just want to compute some aggregates and store the results in a  
table, you just read the data once and don't write it at all.
The advantages are the same than your proposed transformations to COPY,  
except I feel this way of doing it opens more options (like, you can  
combine columns, check FKs at load, do queries on data without loading it,  
don't necessarily have to insert the data in a table, don't have to invent  
a new syntax to express the transformations, etc).






Re: COPY Transform support

From
Dimitri Fontaine
Date:
Le jeudi 03 avril 2008, Tom Lane a écrit :
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.

Well, when COLUMN x CONVERT USING or whatever syntax we choose is not used, we
default to current code path, that is we do not mess with data content at all
before to consider it's valid input syntax for target table datatypes.

And the syntax check is done only once, before beginning to read the data
lines from the file.

> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything.  The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.

Yes, this is another need, but actually better solved, in my opinion, with
loading data into a (temp) loadtable then process it with SQL: INSERT INTO destination_table SELECT whatever FROM
loadtable;

The problem I'm trying to solve is not this one, I'm trying to have COPY able
to load data into a table when the representation of it we have into the file
does not match what datatype input function expects.

An example might help us talking about the same thing. mysqldump CSV outputs
timestamp sometimes (depending on server version) as '20041002152952' when
PostgreSQL expects '2004-10-02 15:29:52'. I'd like COPY to be able to cope
with this situation.

Now, another syntax proposal could have both the needs solved. We basically
need to be able to transform input fields and process them into input
columns, in a way that N input fields (found in the data file) will get us M
input columns:
 COPY destination_table(col1, col2, col3, col4) USING (field1, field2 || field3, myfunc(field4, field5)) FROM
'file.txt'WITH ... 

This could get better than preprocessing then COPY then INSERT INTO ... SELECT
because we don't need a temp table (don't need to care about its name being
unique, nor to mess up with temp_buffers), etc.
You're the one able to tell why it'll be better to have one COPY command
instead of a two table steps load, I'm just guessing ;)

And if it's better for the user to preprocess in perl then COPY, he still has
the option.
--
dim

Re: COPY Transform support

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Here's a proposal for COPY to support the T part of an ETL, that is adding the 
>> capability for COPY FROM to Transform the data it gets.
>
>> The idea is quite simple: adding to COPY FROM the option to run a function on 
>> the data before to call datatype_in functions.
>
> The major concern I have about this is to ensure that no detectable
> overhead is added to COPY when the feature isn't being used.
>
> I am not actually convinced that the column-by-column design you seem to
> have in mind is worth anything.  The examples that I remember seeing
> often involve removing columns, generating one column from multiple ones
> or vice versa, dealing with nonstandard column delimiters, etc.  What
> would makes sense in my mind is a single function taking and returning
> text, which is invoked once on each complete input line before it is
> broken into fields.

I think not having to deal with separating fields is actually one of the few
reasons to do this within COPY. If you can separate out yourself or need to do
something more clever than COPY is capable of to split the columns then you're
better off preprocessing it with perl or something anyways.

To that end all the other use cases you describe could be handled with his
plan. There's nothing stopping you from doing

CREATE READER foo (a integer, b integer)
INSERT INTO b (SELECT a+b FROM foo);
or
INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo)

However I'm not sure we even need new syntax for CREATE READER. I would think
something like this would make more sense:

CREATE FUNCTION transform(integer, integer) RETURNS SETOF b;

COPY b FROM 'foo' USING transform(integer,integer);

> So the whole thing seems just marginally attractive to me.

Everything about ETL is only marginally attractive, but it's something people
spend a lot of time doing. Nobody's come up with any particularly clean
solutions I think.

AFAIK the state of the art is actually to load the data into a table which
closely matches the source material, sometimes just columns of text. Then copy
it all to another table doing transformations. Not impressed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: COPY Transform support

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> AFAIK the state of the art is actually to load the data into a table which
> closely matches the source material, sometimes just columns of text. Then copy
> it all to another table doing transformations. Not impressed.

I liked the idea of allowing COPY FROM to act as a table source in a
larger SELECT or INSERT...SELECT.  Not at all sure what would be
involved to implement that, but it seems a lot more flexible than
any other approach.
        regards, tom lane


Re: COPY Transform support

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>   
>> AFAIK the state of the art is actually to load the data into a table which
>> closely matches the source material, sometimes just columns of text. Then copy
>> it all to another table doing transformations. Not impressed.
>>     
>
> I liked the idea of allowing COPY FROM to act as a table source in a
> larger SELECT or INSERT...SELECT.  Not at all sure what would be
> involved to implement that, but it seems a lot more flexible than
> any other approach.
>
>             
>   

Several years ago Bruce and I discussed the then theoretical use of a 
SELECT query as the source for COPY TO, and we agreed that the sane 
analog would be to have an INSERT query as the target of COPY FROM.

This idea seems to take that rather further. If doable I think it would 
be cool, as long as people don't try using it as an alternative storage 
engine. I can just imagine people creating views over such SELECT 
statements ...

cheers

andrew


Re: COPY Transform support

From
Sam Mason
Date:
On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > AFAIK the state of the art is actually to load the data into a table which
> > closely matches the source material, sometimes just columns of text. Then copy
> > it all to another table doing transformations. Not impressed.
> 
> I liked the idea of allowing COPY FROM to act as a table source in a
> larger SELECT or INSERT...SELECT.  Not at all sure what would be
> involved to implement that, but it seems a lot more flexible than
> any other approach.

I'm not sure why new syntax is needed, what's wrong with having a simple
set of procedures like:
readtsv(filename TEXT) AS SETOF RECORD

You'd then be free to do whatever "transformations" you wanted:
 INSERT INTO table (i,j)   SELECT i, MIN(j::INTEGER)   FROM readtsv("file.dat") x(i INTEGER, j TEXT)   WHERE j ~
'^[0-9]+$'  GROUP BY i;
 

You could even have a readlines(filename) procedure that just gives you
back a SETOF TEXT and you can do the parsing yourself.  An associated
regexp split to RECORD would be nice then.

 Sam


Re: COPY Transform support

From
Tom Lane
Date:
Sam Mason <sam@samason.me.uk> writes:
> On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
>> I liked the idea of allowing COPY FROM to act as a table source in a
>> larger SELECT or INSERT...SELECT.  Not at all sure what would be
>> involved to implement that, but it seems a lot more flexible than
>> any other approach.

> I'm not sure why new syntax is needed, what's wrong with having a simple
> set of procedures like:
>  readtsv(filename TEXT) AS SETOF RECORD

Yeah, I was thinking about that too.  The main stumbling block is that
you need to somehow expose all of COPY's options for parsing an input
line (CSV vs default mode, quote and delimiter characters, etc).
It's surely doable but it might be pretty ugly compared to bespoke
syntax.

Another thing is that nodeFunctionScan.c is not really designed for
enormous function result sets --- it dumps the results into a tuplestore
whether that's needed or not.  This is a performance bug that we ought
to address anyway, but we'd really have to fix it if we want to approach
the COPY problem this way.  Just sayin'.
        regards, tom lane


Re: COPY Transform support

From
Sam Mason
Date:
On Thu, Apr 03, 2008 at 09:38:42PM -0400, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote:
> >> I liked the idea of allowing COPY FROM to act as a table source in a
> >> larger SELECT or INSERT...SELECT.  Not at all sure what would be
> >> involved to implement that, but it seems a lot more flexible than
> >> any other approach.
> 
> > I'm not sure why new syntax is needed, what's wrong with having a simple
> > set of procedures like:
> >  readtsv(filename TEXT) AS SETOF RECORD
> 
> Yeah, I was thinking about that too.  The main stumbling block is that
> you need to somehow expose all of COPY's options for parsing an input
> line (CSV vs default mode, quote and delimiter characters, etc).

Guess why I chose a nice simple example!

> It's surely doable but it might be pretty ugly compared to bespoke
> syntax.

Yes, that's an easy way to get it looking pretty.

As an alternative solution, how about having some datatype that stores
these parameters.  E.g:
 CREATE TYPE copyoptions (   delimiter TEXT CHECK (delimiter <> ""),   nullstr   TEXT,   hasheader BOOLEAN,   quote
TEXT,  escape    TEXT );
 

And have the input_function understand the current PG syntax for COPY
options.  You'd then be able to do:
 copyfrom('dummy.csv',$$ DELIMITER ';' CSV HEADER $$)

And the procedure would be able to pull out what it wanted from the
options.

> Another thing is that nodeFunctionScan.c is not really designed for
> enormous function result sets --- it dumps the results into a tuplestore
> whether that's needed or not.  This is a performance bug that we ought
> to address anyway, but we'd really have to fix it if we want to approach
> the COPY problem this way.  Just sayin'.

So you'd end up with something resembling a coroutine?  When would it
be good to actually dump everything into a tuplestore as it does at the
moment?

It'll be fun to see how much code breaks because it relies on the
current behaviour of a SRF running to completion without other activity
happening between!

 Sam


Re: COPY Transform support

From
Decibel!
Date:
On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
> Several years ago Bruce and I discussed the then theoretical use of  
> a SELECT query as the source for COPY TO, and we agreed that the  
> sane analog would be to have an INSERT query as the target of COPY  
> FROM.
>
> This idea seems to take that rather further. If doable I think it  
> would be cool, as long as people don't try using it as an  
> alternative storage engine. I can just imagine people creating  
> views over such SELECT statements ...


Why not? There's certainly cases where doing just that could be very  
valuable. Storing older information that you're less likely to query  
comes to mind... in those cases you're going to be seqscanning  
anyway, so being able to read off a compact on-disk form is likely to  
be a win performance-wise. It could certainly be a win storage-wise.

If someone wants to look at syntax options, I'm pretty certain that  
Oracle supports this. IIRC you actually create what appears to the  
database to be a real table, except for restrictions on what you can  
actually do with it (for example, IIRC it's read-only).
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: COPY Transform support

From
Andrew Dunstan
Date:

Decibel! wrote:
> On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote:
>> Several years ago Bruce and I discussed the then theoretical use of a 
>> SELECT query as the source for COPY TO, and we agreed that the sane 
>> analog would be to have an INSERT query as the target of COPY FROM.
>>
>> This idea seems to take that rather further. If doable I think it 
>> would be cool, as long as people don't try using it as an alternative 
>> storage engine. I can just imagine people creating views over such 
>> SELECT statements ...
>
>
> Why not? There's certainly cases where doing just that could be very 
> valuable. Storing older information that you're less likely to query 
> comes to mind... in those cases you're going to be seqscanning anyway, 
> so being able to read off a compact on-disk form is likely to be a win 
> performance-wise. It could certainly be a win storage-wise.
>
> If someone wants to look at syntax options, I'm pretty certain that 
> Oracle supports this. IIRC you actually create what appears to the 
> database to be a real table, except for restrictions on what you can 
> actually do with it (for example, IIRC it's read-only).
>

You're serious aren't you?

Quite apart from any other reason why not, this would be a horrid hack 
and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is 
designed as a bulk load/unload facility. It's fragile enough in that role.

If we really want to support an alternative storage engine then we 
should tackle that front on and not via a back door like this.

cheers

andrew


Re: COPY Transform support

From
Dimitri Fontaine
Date:
Le Monday 07 April 2008 21:04:26 Andrew Dunstan, vous avez écrit :
> Quite apart from any other reason why not, this would be a horrid hack
> and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is
> designed as a bulk load/unload facility. It's fragile enough in that role.

And my main concern would still be left as-is, COPY wouldn't have any facility
to cope with data representation not matching what datatype input functions
want to read. More often than not, you get this kind of data from another
system, so in a deterministic way, and you want a chance to transform their
representation just before PostgreSQL parses it.

May I try to insist on my second syntax proposal form:

  COPY destination_table(col1, col2, col3)
 USING (field1, field2 || field3, myfunc(field4, field5))
  FROM 'file.txt'
  WITH ...

This allows for the case I'm concerned with to be taken care of, AND the other
case pointed out by several posters on this thread too. N input fields, M
stored columns, any line to row transformation (after same column splitting
as of now), any column pre-parsing through SQL callable functions --- myfunc
is called before feeding col3%ROWTYPE input function, e.g.
And no support (that I see) for optional second storage system back door.

Comments?
--
dim

Re: COPY Transform support

From
Tom Lane
Date:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> And my main concern would still be left as-is, COPY wouldn't have any facility 
> to cope with data representation not matching what datatype input functions 
> want to read.

That's sufficiently covered by the proposal to allow a COPY FROM as a
table source within SELECT, no?
        regards, tom lane


Re: COPY Transform support

From
Dimitri Fontaine
Date:
Le mardi 08 avril 2008, Tom Lane a écrit :
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
> > And my main concern would still be left as-is, COPY wouldn't have any
> > facility to cope with data representation not matching what datatype
> > input functions want to read.
>
> That's sufficiently covered by the proposal to allow a COPY FROM as a
> table source within SELECT, no?

Well, yes, the table source has text as datatypes and the select expression on
the column will call whatever function/cast etc to do the work. But that
opens the door to second class citizen storage solution for PostgreSQL, by
letting the user CREATE VIEW atop of it:
 CREATE VIEW csv_storage AS   SELECT a, myfunc(b)::timestamp, c::int+3     FROM (COPY ... FROM '/tmp/file.csv' ...) AS
x(a,b, c)    WHERE c ~ '^[0-9]+$'; 

What happens to the view when /tmp/file.csv is changed (new lines appended, or
complete rewrite by another application, etc)?

Andrew comment is clear about it: he does not want PostgreSQL to offer this
kind of support. I suppose it would be possible to stop CREATE VIEW to accept
any form of SELECT, but I was hoping for my idea to get back some
attractiveness at this point :)

At least I tried ;)
--
dim

Re: COPY Transform support

From
Tom Lane
Date:
Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Le mardi 08 avril 2008, Tom Lane a écrit :
>> That's sufficiently covered by the proposal to allow a COPY FROM as a
>> table source within SELECT, no?

> Well, yes, the table source has text as datatypes and the select expression on 
> the column will call whatever function/cast etc to do the work. But that 
> opens the door to second class citizen storage solution for PostgreSQL, by 
> letting the user CREATE VIEW atop of it:

[ shrug... ]  I don't actually have a problem with that.  If we did want
to prohibit that, we'd have to somehow prohibit SRFs from reading files,
because you can do it today with any untrusted PL.

I note also that presumably COPY FROM 'file' would still be restricted
to superusers, and only COPY FROM STDIN would be available to those
without a license to shoot themselves in the foot.  So the opportunity
to do anything view-like would be restricted to adults(?) anyhow.

(One of the issues that'd have to be addressed to allow a table source
syntax is whether it's sane to allow multiple COPY FROM STDIN in a
single query.  If so, how does it work; if not, how do we prevent it?)
        regards, tom lane


Re: COPY Transform support

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>   
>> Le mardi 08 avril 2008, Tom Lane a écrit :
>>     
>>> That's sufficiently covered by the proposal to allow a COPY FROM as a
>>> table source within SELECT, no?
>>>       
>
>   
>> Well, yes, the table source has text as datatypes and the select expression on 
>> the column will call whatever function/cast etc to do the work. But that 
>> opens the door to second class citizen storage solution for PostgreSQL, by 
>> letting the user CREATE VIEW atop of it:
>>     
>
> [ shrug... ]  I don't actually have a problem with that.  If we did want
> to prohibit that, we'd have to somehow prohibit SRFs from reading files,
> because you can do it today with any untrusted PL.
>
> I note also that presumably COPY FROM 'file' would still be restricted
> to superusers, and only COPY FROM STDIN would be available to those
> without a license to shoot themselves in the foot.  So the opportunity
> to do anything view-like would be restricted to adults(?) anyhow.
>   


Yeah, maybe. I will suspend my doubts for now.

> (One of the issues that'd have to be addressed to allow a table source
> syntax is whether it's sane to allow multiple COPY FROM STDIN in a
> single query.  If so, how does it work; if not, how do we prevent it?)
>
>             
>   

I don't see why it shouldn't work. I see that copy.c now looks like it's 
reentrant, unlike the bad days of old. Could we make each COPY target 
behave like an SRF, stashing its data in a tuplestore?

cheers

andrew


Re: COPY Transform support

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> (One of the issues that'd have to be addressed to allow a table source
>> syntax is whether it's sane to allow multiple COPY FROM STDIN in a
>> single query.  If so, how does it work; if not, how do we prevent it?)

> I don't see why it shouldn't work. I see that copy.c now looks like it's 
> reentrant, unlike the bad days of old. Could we make each COPY target 
> behave like an SRF, stashing its data in a tuplestore?

The first question is what is the wire-protocol definition.  In
particular, how would the client know what order to send the COPY
datasets in, if a single query might include multiple COPY FROM STDIN
segments?

Another point is that we surely don't want the implementation to force
use of a tuplestore all the time, so I'm not sure I buy that we can
prevent interleaving of multiple datasets on the wire that way.
        regards, tom lane


Re: COPY Transform support

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Could we make each COPY target 
>> behave like an SRF, stashing its data in a tuplestore?
>>     
>
> The first question is what is the wire-protocol definition.  In
> particular, how would the client know what order to send the COPY
> datasets in, if a single query might include multiple COPY FROM STDIN
> segments?
>
> Another point is that we surely don't want the implementation to force
> use of a tuplestore all the time, so I'm not sure I buy that we can
> prevent interleaving of multiple datasets on the wire that way.
>
>             
>   

Is there a big demand for multiple datasets on the wire in a situation 
like this? How about if we allow multiple COPY targets but at most one 
from STDIN, at least for one go round?

cheers

andrew


Re: COPY Transform support

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Is there a big demand for multiple datasets on the wire in a situation 
> like this? How about if we allow multiple COPY targets but at most one 
> from STDIN, at least for one go round?

That's exactly what I was saying (or at least trying to imply) as the
fallback position.  But you still need a way to enforce that.
        regards, tom lane