Re: COPY Transform support - Mailing list pgsql-hackers
From | PFC |
---|---|
Subject | Re: COPY Transform support |
Date | |
Msg-id | op.t81hw6jmcigqcu@apollo13.peufeu.com Whole thread Raw |
In response to | Re: COPY Transform support (NikhilS <nikkhils@gmail.com>) |
Responses |
Re: COPY Transform support
Re: COPY Transform support |
List | pgsql-hackers |
> Data transformation while doing a data load is a requirement now and > then. > Considering that users will have to do mass updates *after* the load > completes to mend the data to their liking should be reason enough to do > this while the loading is happening. I think to go about it the right > way we > should support the following: > * The ability to provide per-column transformation expressions > * The ability to use any kind of expressions while doing the > transformation > The transformation expression should be any expression (basically > ExecEvalExpr) that can be evaluated to give a resulting value and > obviously > a corresponding is_null value too. It should and could be system in-built > functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined > functions too > * The transformation expression can refer to other columns involved in > the > load. So that when the current row is extracted from the input file, the > current values should be used to generate the new resultant values before > doing a heap_form_tuple. E.g. > (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform > "UPPER(col1 || col3)",...) > I have spent some thoughts on how to do this and will be happy to share > the > same if the list is interested. Personally, I think data transformation > using such expressions is a pretty powerful and important activity while > doing the data load itself. Well, since COPY is about as fast as INSERT INTO ... SELECT plus the parsing overead, I suggest adding a special SELECT form that can read from a file instead of a table, which returns tuples, and which therefore can be used and abused to the user's liking. This is a much more powerful feature because : - there is almost no new syntax- it is much simpler for the user- lots of existing stuff can be leveraged EXAMPLE : Suppose I want to import a MySQL dump file (gasp !) which obviously contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken foreign keys, etc. Let's have a new command : CREATE FLATFILE READER mydump (id INTEGER,date TEXT,... ) FROM file 'dump.txt' (followed by delimiter specification syntax identical to COPY, etc) ; This command would create a set-returning function which is basically a wrapper around the existing parser in COPY.Column definition gives a name and type to the fields in the text file, and tells the parser what to expect and what to return.It looks like a table definition, and this is actually pretty normal: it is, after all, very close to a table. INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken records); Now I can import data and transform it at will using a simple SELECT. The advantage is that everybody will know what to do without learning a new command, no awkward syntax (transform...), you can combine columns in expressions, JOIN to ckeck FKs, use ORDER to get a clustered table, anything you want, without any extension to the Postgres engine besides the creation of this file-parsing set-returning function, which should be pretty simple. Or, if I have a few gigabytes of logs, but I am absolutely not interested in inserting them into a table, instead I want to make some statistics, or perhaps I want to insert into my table some aggregate computation from this data, I would just : CREATE FLATFILE READER accesses_dump (date TEXT,ip INET,... ) FROM file 'web_server_logtxt'; And I can do some stats without even loading the data : SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*) HAVING count(*) > 1000; Much better than having to load those gigabytes just to make a query on them...
pgsql-hackers by date: