Re: COPY Transform support - Mailing list pgsql-hackers

From PFC
Subject Re: COPY Transform support
Date
Msg-id op.t81hw6jmcigqcu@apollo13.peufeu.com
Whole thread Raw
In response to Re: COPY Transform support  (NikhilS <nikkhils@gmail.com>)
Responses Re: COPY Transform support  (Dimitri Fontaine <dfontaine@hi-media.com>)
Re: COPY Transform support  (Csaba Nagy <nagy@ecircle-ag.com>)
List pgsql-hackers
> Data transformation while doing a data load is a requirement now and  
> then.
> Considering that users will have to do mass updates *after* the load
> completes to mend the data to their liking should be reason enough to do
> this while the loading is happening. I think to go about it the right  
> way we
> should support the following:

> * The ability to provide per-column transformation expressions
> * The ability to use any kind of expressions while doing the  
> transformation
> The transformation expression should be any expression (basically
> ExecEvalExpr) that can be evaluated to give a resulting value and  
> obviously
> a corresponding is_null value too. It should and could be system in-built
> functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined
> functions too
> * The transformation expression can refer to other columns involved in  
> the
> load. So that when the current row is extracted from the input file, the
> current values should be used to generate the new resultant values before
> doing a heap_form_tuple. E.g.
> (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform
> "UPPER(col1 || col3)",...)
> I have spent some thoughts on how to do this and will be happy to share  
> the
> same if the list is interested. Personally, I think data transformation
> using such expressions is a pretty powerful and important activity while
> doing the data load itself.

Well, since COPY is about as fast as INSERT INTO ... SELECT plus the  
parsing overead, I suggest adding a special SELECT form that can read from  
a file instead of a table, which returns tuples, and which therefore can  
be used and abused to the user's liking. This is a much more powerful  
feature because :
- there is almost no new syntax- it is much simpler for the user- lots of existing stuff can be leveraged
EXAMPLE :
Suppose I want to import a MySQL dump file (gasp !) which obviously  
contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken  
foreign keys, etc.
Let's have a new command :

CREATE FLATFILE READER mydump (id    INTEGER,date    TEXT,...
) FROM file 'dump.txt'
(followed by delimiter specification syntax identical to COPY, etc)
;
This command would create a set-returning function which is basically a  
wrapper around the existing parser in COPY.Column definition gives a name and type to the fields in the text file,  
and tells the parser what to expect and what to return.It looks like a table definition, and this is actually pretty
normal: it  
 
is, after all, very close to a table.
INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00'  
), ... FROM mydump WHERE (FKs check and drop the borken records);
Now I can import data and transform it at will using a simple SELECT. The  
advantage is that everybody will know what to do without learning a new  
command, no awkward syntax (transform...), you can combine columns in  
expressions, JOIN to ckeck FKs, use ORDER to get a clustered table,  
anything you want, without any extension to the Postgres engine besides  
the creation of this file-parsing set-returning function, which should be  
pretty simple.
Or, if I have a few gigabytes of logs, but I am absolutely not interested  
in inserting them into a table, instead I want to make some statistics, or  
perhaps I want to insert into my table some aggregate computation from  
this data, I would just :

CREATE FLATFILE READER accesses_dump (date    TEXT,ip    INET,...
) FROM file 'web_server_logtxt';
And I can do some stats without even loading the data :
SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*)  
HAVING count(*) > 1000;
Much better than having to load those gigabytes just to make a query on  
them...
































pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql \G command -- send query and output using extended format
Next
From: Tom Lane
Date:
Subject: Re: Patch queue -> wiki (was varadic patch)