Thread: COPY Transform support
Hi, Here's a proposal for COPY to support the T part of an ETL, that is adding the capability for COPY FROM to Transform the data it gets. The idea is quite simple: adding to COPY FROM the option to run a function on the data before to call datatype_in functions. This needs some syntax addition to be worked out at the COPY side, then the COPY code will have to run the given function on the read data and consider giving the output of it to current COPY code (datatype input function). The function could either get the data as text or bytea, and would have to return either text or bytea. bytea seems the more sensible choice, as long as we don't lose encoding information there, which I'm not sure about. The syntax could be something like:COPY mytable FROM '/my/file.txt' WITH COLUMN x CONVERT USING myfunc; I tried to only add keywords already present in [1], while getting something meaningfull... and x is intended to be the column number, counting from 1.[1] http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html Comments? -- dim
Hi, <br /><br /><br /><div class="gmail_quote">On Thu, Apr 3, 2008 at 6:47 PM, Dimitri Fontaine <<a href="mailto:dfontaine@hi-media.com">dfontaine@hi-media.com</a>>wrote:<br /><blockquote class="gmail_quote" style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Here's a proposal for COPYto support the T part of an ETL, that is adding the<br /> capability for COPY FROM to Transform the data it gets.<br/><br /> The idea is quite simple: adding to COPY FROM the option to run a function on<br /> the data before to calldatatype_in functions. This needs some syntax<br /> addition to be worked out at the COPY side, then the COPY code willhave to<br /> run the given function on the read data and consider giving the output of it<br /> to current COPY code(datatype input function).<br /><br /> The function could either get the data as text or bytea, and would have to<br/> return either text or bytea. bytea seems the more sensible choice, as long as<br /> we don't lose encoding informationthere, which I'm not sure about.<br /><br /> The syntax could be something like:<br /> COPY mytable FROM '/my/file.txt'WITH COLUMN x CONVERT USING myfunc;<br /><br /> I tried to only add keywords already present in [1], whilegetting something<br /> meaningfull... and x is intended to be the column number, counting from 1.<br /> [1] <a href="http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html" target="_blank">http://www.postgresql.org/docs/8.3/static/sql-keywords-appendix.html</a><br/><br /> Comments?<br /><fontcolor="#888888">--<br /> dim<br /></font></blockquote></div><br />+1<br /><br />Data transformation while doing adata load is a requirement now and then. Considering that users will have to do mass updates *after* the load completesto mend the data to their liking should be reason enough to do this while the loading is happening. I think to goabout it the right way we should support the following:<br /><br /> <br />* The ability to provide per-column transformationexpressions<br /><br /> COPY mytable (col1 transform to "col1 + 10", col2 transform to "'Post' || 'greSQL'",col3...) FROM ..<br /><br />* The ability to use any kind of expressions while doing the transformation <br />The transformation expression should be any expression (basically ExecEvalExpr) that can be evaluated to give a resultingvalue and obviously a corresponding is_null value too. It should and could be system in-built functions (e.g. UPPER,TRIM, TO_CHAR, TO_NUMBER etc.) or user defined functions too<br /><br clear="all" />* The transformation expressioncan refer to other columns involved in the load. So that when the current row is extracted from the input file,the current values should be used to generate the new resultant values before doing a heap_form_tuple. E.g.<br /><br/>(col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform "UPPER(col1 || col3)",...)<br /><br />Ihave spent some thoughts on how to do this and will be happy to share the same if the list is interested. Personally,I think data transformation using such expressions is a pretty powerful and important activity while doing thedata load itself.<br /><br />Regards,<br />Nikhils <br />-- <br />EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a>
> Data transformation while doing a data load is a requirement now and > then. > Considering that users will have to do mass updates *after* the load > completes to mend the data to their liking should be reason enough to do > this while the loading is happening. I think to go about it the right > way we > should support the following: > * The ability to provide per-column transformation expressions > * The ability to use any kind of expressions while doing the > transformation > The transformation expression should be any expression (basically > ExecEvalExpr) that can be evaluated to give a resulting value and > obviously > a corresponding is_null value too. It should and could be system in-built > functions (e.g. UPPER, TRIM, TO_CHAR, TO_NUMBER etc.) or user defined > functions too > * The transformation expression can refer to other columns involved in > the > load. So that when the current row is extracted from the input file, the > current values should be used to generate the new resultant values before > doing a heap_form_tuple. E.g. > (col1 transform "col1 + 10", col2 transform "col1 * col2", col3 transform > "UPPER(col1 || col3)",...) > I have spent some thoughts on how to do this and will be happy to share > the > same if the list is interested. Personally, I think data transformation > using such expressions is a pretty powerful and important activity while > doing the data load itself. Well, since COPY is about as fast as INSERT INTO ... SELECT plus the parsing overead, I suggest adding a special SELECT form that can read from a file instead of a table, which returns tuples, and which therefore can be used and abused to the user's liking. This is a much more powerful feature because : - there is almost no new syntax- it is much simpler for the user- lots of existing stuff can be leveraged EXAMPLE : Suppose I want to import a MySQL dump file (gasp !) which obviously contains lots of crap like 0000-00-00 dates, '' instead of NULL, borken foreign keys, etc. Let's have a new command : CREATE FLATFILE READER mydump (id INTEGER,date TEXT,... ) FROM file 'dump.txt' (followed by delimiter specification syntax identical to COPY, etc) ; This command would create a set-returning function which is basically a wrapper around the existing parser in COPY.Column definition gives a name and type to the fields in the text file, and tells the parser what to expect and what to return.It looks like a table definition, and this is actually pretty normal: it is, after all, very close to a table. INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken records); Now I can import data and transform it at will using a simple SELECT. The advantage is that everybody will know what to do without learning a new command, no awkward syntax (transform...), you can combine columns in expressions, JOIN to ckeck FKs, use ORDER to get a clustered table, anything you want, without any extension to the Postgres engine besides the creation of this file-parsing set-returning function, which should be pretty simple. Or, if I have a few gigabytes of logs, but I am absolutely not interested in inserting them into a table, instead I want to make some statistics, or perhaps I want to insert into my table some aggregate computation from this data, I would just : CREATE FLATFILE READER accesses_dump (date TEXT,ip INET,... ) FROM file 'web_server_logtxt'; And I can do some stats without even loading the data : SELECT ip, count(*) FROM accesses_dump GROUP BY ip ORDER BY count(*) HAVING count(*) > 1000; Much better than having to load those gigabytes just to make a query on them...
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Here's a proposal for COPY to support the T part of an ETL, that is adding the > capability for COPY FROM to Transform the data it gets. > The idea is quite simple: adding to COPY FROM the option to run a function on > the data before to call datatype_in functions. The major concern I have about this is to ensure that no detectable overhead is added to COPY when the feature isn't being used. I am not actually convinced that the column-by-column design you seem to have in mind is worth anything. The examples that I remember seeing often involve removing columns, generating one column from multiple ones or vice versa, dealing with nonstandard column delimiters, etc. What would makes sense in my mind is a single function taking and returning text, which is invoked once on each complete input line before it is broken into fields. This is, of course, just a substitute for running a sed or perl or similar script over the data before feeding it to COPY --- and probably not an amazingly good substitute at that. For instance, assuming you like perl for text-wrangling, I'd fully expect the function approach to be slower than an external script because of the large overhead of getting into and out of libperl for each line, In situations where it's actually useful to apply SQL functions rather than text-mangling operations to the data, you always have the option to COPY into a temp table and then do INSERT/SELECT from there. So the whole thing seems just marginally attractive to me. regards, tom lane
Le jeudi 03 avril 2008, PFC a écrit : > CREATE FLATFILE READER mydump ( > id INTEGER, > date TEXT, > ... > ) FROM file 'dump.txt' > (followed by delimiter specification syntax identical to COPY, etc) > ; [...] > INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, > '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken > records); What do we gain against current way of doing it, which is: COPY loadtable FROM 'dump.txt' WITH ... INSERT INTO destination_table(...)SELECT ... FROM loadtable; -- dim
On Thu, 03 Apr 2008 16:57:53 +0200, Csaba Nagy <nagy@ecircle-ag.com> wrote: > On Thu, 2008-04-03 at 16:44 +0200, PFC wrote: >> CREATE FLATFILE READER mydump ( >> id INTEGER, >> date TEXT, >> ... >> ) FROM file 'dump.txt' >> (followed by delimiter specification syntax identical to COPY, etc) >> ; > > Very cool idea, but why would you need to create a reader object > first ? You should be able to use COPY directly with the target table > being omitted, meaning the copy will not pump it's result in the target > but be equivalent to a select... and use it in any place where a select > can be used. This would have absolutely no new syntax, just the rules > changed... > > Now that I had a second look you actually need the field definitions to > meaningfully interpret the file, Yeah, you need to tell Postgres the field names, types, and NULLness before it can parse them... or else it's just a plain flat text file which makes no sense... > but then why not use a record > specification instead of the table in the normal COPY command ? I'm not > sure if there's any existing syntax for that but I would guess yes... Hm, yeah, that's even simpler, just create a type for the row (or just use table%ROWTYPE if you have a table that fits the description), and tell COPY to parse according to the row type definition... smart... Like : CREATE TYPE import_rowtype AS (id INTEGER, date TEXT); INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, '0000-00-00' )::DATE FROM (COPY AS import_rowtype FROM 'mysql_trash.txt')AS foo WHERE (FKs check and drop the borken records); Looks clean... Obviously, in this case (and also in my proposal's case) you must use COPY and not \copy since it is the database server which will be reading the file.This could probably be hacked so the client sends the file via the \copy interface, too... > In any case, such a feature would help a lot in processing input files > based also on other existing data in the DB. Yeah, it would be cool.Also, since COPY TO can use a SELECT as a data source, you could use postgres to read from a file/pipe, process data, and write to a file/pipe (kinda better than sed, lol)
On Thu, 2008-04-03 at 16:44 +0200, PFC wrote: > CREATE FLATFILE READER mydump ( > id INTEGER, > date TEXT, > ... > ) FROM file 'dump.txt' > (followed by delimiter specification syntax identical to COPY, etc) > ; Very cool idea, but why would you need to create a reader object first ? You should be able to use COPY directly with the target table being omitted, meaning the copy will not pump it's result in the target but be equivalent to a select... and use it in any place where a select can be used. This would have absolutely no new syntax, just the rules changed... Now that I had a second look you actually need the field definitions to meaningfully interpret the file, but then why not use a record specification instead of the table in the normal COPY command ? I'm not sure if there's any existing syntax for that but I would guess yes... In any case, such a feature would help a lot in processing input files based also on other existing data in the DB. Cheers, Csaba.
>> INSERT INTO mytable (id, date, ...) SELECT id, NULLIF( date, >> '0000-00-00' ), ... FROM mydump WHERE (FKs check and drop the borken >> records); > > What do we gain against current way of doing it, which is: > COPY loadtable FROM 'dump.txt' WITH ... > INSERT INTO destination_table(...) SELECT ... FROM loadtable; You read and write the data only once instead of twice (faster) if you want to import all of it.If you just want to compute some aggregates and store the results in a table, you just read the data once and don't write it at all. The advantages are the same than your proposed transformations to COPY, except I feel this way of doing it opens more options (like, you can combine columns, check FKs at load, do queries on data without loading it, don't necessarily have to insert the data in a table, don't have to invent a new syntax to express the transformations, etc).
Le jeudi 03 avril 2008, Tom Lane a écrit : > The major concern I have about this is to ensure that no detectable > overhead is added to COPY when the feature isn't being used. Well, when COLUMN x CONVERT USING or whatever syntax we choose is not used, we default to current code path, that is we do not mess with data content at all before to consider it's valid input syntax for target table datatypes. And the syntax check is done only once, before beginning to read the data lines from the file. > I am not actually convinced that the column-by-column design you seem to > have in mind is worth anything. The examples that I remember seeing > often involve removing columns, generating one column from multiple ones > or vice versa, dealing with nonstandard column delimiters, etc. Yes, this is another need, but actually better solved, in my opinion, with loading data into a (temp) loadtable then process it with SQL: INSERT INTO destination_table SELECT whatever FROM loadtable; The problem I'm trying to solve is not this one, I'm trying to have COPY able to load data into a table when the representation of it we have into the file does not match what datatype input function expects. An example might help us talking about the same thing. mysqldump CSV outputs timestamp sometimes (depending on server version) as '20041002152952' when PostgreSQL expects '2004-10-02 15:29:52'. I'd like COPY to be able to cope with this situation. Now, another syntax proposal could have both the needs solved. We basically need to be able to transform input fields and process them into input columns, in a way that N input fields (found in the data file) will get us M input columns: COPY destination_table(col1, col2, col3, col4) USING (field1, field2 || field3, myfunc(field4, field5)) FROM 'file.txt'WITH ... This could get better than preprocessing then COPY then INSERT INTO ... SELECT because we don't need a temp table (don't need to care about its name being unique, nor to mess up with temp_buffers), etc. You're the one able to tell why it'll be better to have one COPY command instead of a two table steps load, I'm just guessing ;) And if it's better for the user to preprocess in perl then COPY, he still has the option. -- dim
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Dimitri Fontaine <dfontaine@hi-media.com> writes: >> Here's a proposal for COPY to support the T part of an ETL, that is adding the >> capability for COPY FROM to Transform the data it gets. > >> The idea is quite simple: adding to COPY FROM the option to run a function on >> the data before to call datatype_in functions. > > The major concern I have about this is to ensure that no detectable > overhead is added to COPY when the feature isn't being used. > > I am not actually convinced that the column-by-column design you seem to > have in mind is worth anything. The examples that I remember seeing > often involve removing columns, generating one column from multiple ones > or vice versa, dealing with nonstandard column delimiters, etc. What > would makes sense in my mind is a single function taking and returning > text, which is invoked once on each complete input line before it is > broken into fields. I think not having to deal with separating fields is actually one of the few reasons to do this within COPY. If you can separate out yourself or need to do something more clever than COPY is capable of to split the columns then you're better off preprocessing it with perl or something anyways. To that end all the other use cases you describe could be handled with his plan. There's nothing stopping you from doing CREATE READER foo (a integer, b integer) INSERT INTO b (SELECT a+b FROM foo); or INSERT INTO b (SELECT 1, a, b, greatest(a,b) FROM foo) However I'm not sure we even need new syntax for CREATE READER. I would think something like this would make more sense: CREATE FUNCTION transform(integer, integer) RETURNS SETOF b; COPY b FROM 'foo' USING transform(integer,integer); > So the whole thing seems just marginally attractive to me. Everything about ETL is only marginally attractive, but it's something people spend a lot of time doing. Nobody's come up with any particularly clean solutions I think. AFAIK the state of the art is actually to load the data into a table which closely matches the source material, sometimes just columns of text. Then copy it all to another table doing transformations. Not impressed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes: > AFAIK the state of the art is actually to load the data into a table which > closely matches the source material, sometimes just columns of text. Then copy > it all to another table doing transformations. Not impressed. I liked the idea of allowing COPY FROM to act as a table source in a larger SELECT or INSERT...SELECT. Not at all sure what would be involved to implement that, but it seems a lot more flexible than any other approach. regards, tom lane
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: > >> AFAIK the state of the art is actually to load the data into a table which >> closely matches the source material, sometimes just columns of text. Then copy >> it all to another table doing transformations. Not impressed. >> > > I liked the idea of allowing COPY FROM to act as a table source in a > larger SELECT or INSERT...SELECT. Not at all sure what would be > involved to implement that, but it seems a lot more flexible than > any other approach. > > > Several years ago Bruce and I discussed the then theoretical use of a SELECT query as the source for COPY TO, and we agreed that the sane analog would be to have an INSERT query as the target of COPY FROM. This idea seems to take that rather further. If doable I think it would be cool, as long as people don't try using it as an alternative storage engine. I can just imagine people creating views over such SELECT statements ... cheers andrew
On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: > > AFAIK the state of the art is actually to load the data into a table which > > closely matches the source material, sometimes just columns of text. Then copy > > it all to another table doing transformations. Not impressed. > > I liked the idea of allowing COPY FROM to act as a table source in a > larger SELECT or INSERT...SELECT. Not at all sure what would be > involved to implement that, but it seems a lot more flexible than > any other approach. I'm not sure why new syntax is needed, what's wrong with having a simple set of procedures like: readtsv(filename TEXT) AS SETOF RECORD You'd then be free to do whatever "transformations" you wanted: INSERT INTO table (i,j) SELECT i, MIN(j::INTEGER) FROM readtsv("file.dat") x(i INTEGER, j TEXT) WHERE j ~ '^[0-9]+$' GROUP BY i; You could even have a readlines(filename) procedure that just gives you back a SETOF TEXT and you can do the parsing yourself. An associated regexp split to RECORD would be nice then. Sam
Sam Mason <sam@samason.me.uk> writes: > On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote: >> I liked the idea of allowing COPY FROM to act as a table source in a >> larger SELECT or INSERT...SELECT. Not at all sure what would be >> involved to implement that, but it seems a lot more flexible than >> any other approach. > I'm not sure why new syntax is needed, what's wrong with having a simple > set of procedures like: > readtsv(filename TEXT) AS SETOF RECORD Yeah, I was thinking about that too. The main stumbling block is that you need to somehow expose all of COPY's options for parsing an input line (CSV vs default mode, quote and delimiter characters, etc). It's surely doable but it might be pretty ugly compared to bespoke syntax. Another thing is that nodeFunctionScan.c is not really designed for enormous function result sets --- it dumps the results into a tuplestore whether that's needed or not. This is a performance bug that we ought to address anyway, but we'd really have to fix it if we want to approach the COPY problem this way. Just sayin'. regards, tom lane
On Thu, Apr 03, 2008 at 09:38:42PM -0400, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > On Thu, Apr 03, 2008 at 03:57:38PM -0400, Tom Lane wrote: > >> I liked the idea of allowing COPY FROM to act as a table source in a > >> larger SELECT or INSERT...SELECT. Not at all sure what would be > >> involved to implement that, but it seems a lot more flexible than > >> any other approach. > > > I'm not sure why new syntax is needed, what's wrong with having a simple > > set of procedures like: > > readtsv(filename TEXT) AS SETOF RECORD > > Yeah, I was thinking about that too. The main stumbling block is that > you need to somehow expose all of COPY's options for parsing an input > line (CSV vs default mode, quote and delimiter characters, etc). Guess why I chose a nice simple example! > It's surely doable but it might be pretty ugly compared to bespoke > syntax. Yes, that's an easy way to get it looking pretty. As an alternative solution, how about having some datatype that stores these parameters. E.g: CREATE TYPE copyoptions ( delimiter TEXT CHECK (delimiter <> ""), nullstr TEXT, hasheader BOOLEAN, quote TEXT, escape TEXT ); And have the input_function understand the current PG syntax for COPY options. You'd then be able to do: copyfrom('dummy.csv',$$ DELIMITER ';' CSV HEADER $$) And the procedure would be able to pull out what it wanted from the options. > Another thing is that nodeFunctionScan.c is not really designed for > enormous function result sets --- it dumps the results into a tuplestore > whether that's needed or not. This is a performance bug that we ought > to address anyway, but we'd really have to fix it if we want to approach > the COPY problem this way. Just sayin'. So you'd end up with something resembling a coroutine? When would it be good to actually dump everything into a tuplestore as it does at the moment? It'll be fun to see how much code breaks because it relies on the current behaviour of a SRF running to completion without other activity happening between! Sam
On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote: > Several years ago Bruce and I discussed the then theoretical use of > a SELECT query as the source for COPY TO, and we agreed that the > sane analog would be to have an INSERT query as the target of COPY > FROM. > > This idea seems to take that rather further. If doable I think it > would be cool, as long as people don't try using it as an > alternative storage engine. I can just imagine people creating > views over such SELECT statements ... Why not? There's certainly cases where doing just that could be very valuable. Storing older information that you're less likely to query comes to mind... in those cases you're going to be seqscanning anyway, so being able to read off a compact on-disk form is likely to be a win performance-wise. It could certainly be a win storage-wise. If someone wants to look at syntax options, I'm pretty certain that Oracle supports this. IIRC you actually create what appears to the database to be a real table, except for restrictions on what you can actually do with it (for example, IIRC it's read-only). -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Decibel! wrote: > On Apr 3, 2008, at 4:51 PM, Andrew Dunstan wrote: >> Several years ago Bruce and I discussed the then theoretical use of a >> SELECT query as the source for COPY TO, and we agreed that the sane >> analog would be to have an INSERT query as the target of COPY FROM. >> >> This idea seems to take that rather further. If doable I think it >> would be cool, as long as people don't try using it as an alternative >> storage engine. I can just imagine people creating views over such >> SELECT statements ... > > > Why not? There's certainly cases where doing just that could be very > valuable. Storing older information that you're less likely to query > comes to mind... in those cases you're going to be seqscanning anyway, > so being able to read off a compact on-disk form is likely to be a win > performance-wise. It could certainly be a win storage-wise. > > If someone wants to look at syntax options, I'm pretty certain that > Oracle supports this. IIRC you actually create what appears to the > database to be a real table, except for restrictions on what you can > actually do with it (for example, IIRC it's read-only). > You're serious aren't you? Quite apart from any other reason why not, this would be a horrid hack and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is designed as a bulk load/unload facility. It's fragile enough in that role. If we really want to support an alternative storage engine then we should tackle that front on and not via a back door like this. cheers andrew
Le Monday 07 April 2008 21:04:26 Andrew Dunstan, vous avez écrit : > Quite apart from any other reason why not, this would be a horrid hack > and is just the sort of "feature" we rightly eschew, IMNSHO. COPY is > designed as a bulk load/unload facility. It's fragile enough in that role. And my main concern would still be left as-is, COPY wouldn't have any facility to cope with data representation not matching what datatype input functions want to read. More often than not, you get this kind of data from another system, so in a deterministic way, and you want a chance to transform their representation just before PostgreSQL parses it. May I try to insist on my second syntax proposal form: COPY destination_table(col1, col2, col3) USING (field1, field2 || field3, myfunc(field4, field5)) FROM 'file.txt' WITH ... This allows for the case I'm concerned with to be taken care of, AND the other case pointed out by several posters on this thread too. N input fields, M stored columns, any line to row transformation (after same column splitting as of now), any column pre-parsing through SQL callable functions --- myfunc is called before feeding col3%ROWTYPE input function, e.g. And no support (that I see) for optional second storage system back door. Comments? -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > And my main concern would still be left as-is, COPY wouldn't have any facility > to cope with data representation not matching what datatype input functions > want to read. That's sufficiently covered by the proposal to allow a COPY FROM as a table source within SELECT, no? regards, tom lane
Le mardi 08 avril 2008, Tom Lane a écrit : > Dimitri Fontaine <dfontaine@hi-media.com> writes: > > And my main concern would still be left as-is, COPY wouldn't have any > > facility to cope with data representation not matching what datatype > > input functions want to read. > > That's sufficiently covered by the proposal to allow a COPY FROM as a > table source within SELECT, no? Well, yes, the table source has text as datatypes and the select expression on the column will call whatever function/cast etc to do the work. But that opens the door to second class citizen storage solution for PostgreSQL, by letting the user CREATE VIEW atop of it: CREATE VIEW csv_storage AS SELECT a, myfunc(b)::timestamp, c::int+3 FROM (COPY ... FROM '/tmp/file.csv' ...) AS x(a,b, c) WHERE c ~ '^[0-9]+$'; What happens to the view when /tmp/file.csv is changed (new lines appended, or complete rewrite by another application, etc)? Andrew comment is clear about it: he does not want PostgreSQL to offer this kind of support. I suppose it would be possible to stop CREATE VIEW to accept any form of SELECT, but I was hoping for my idea to get back some attractiveness at this point :) At least I tried ;) -- dim
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Le mardi 08 avril 2008, Tom Lane a écrit : >> That's sufficiently covered by the proposal to allow a COPY FROM as a >> table source within SELECT, no? > Well, yes, the table source has text as datatypes and the select expression on > the column will call whatever function/cast etc to do the work. But that > opens the door to second class citizen storage solution for PostgreSQL, by > letting the user CREATE VIEW atop of it: [ shrug... ] I don't actually have a problem with that. If we did want to prohibit that, we'd have to somehow prohibit SRFs from reading files, because you can do it today with any untrusted PL. I note also that presumably COPY FROM 'file' would still be restricted to superusers, and only COPY FROM STDIN would be available to those without a license to shoot themselves in the foot. So the opportunity to do anything view-like would be restricted to adults(?) anyhow. (One of the issues that'd have to be addressed to allow a table source syntax is whether it's sane to allow multiple COPY FROM STDIN in a single query. If so, how does it work; if not, how do we prevent it?) regards, tom lane
Tom Lane wrote: > Dimitri Fontaine <dfontaine@hi-media.com> writes: > >> Le mardi 08 avril 2008, Tom Lane a écrit : >> >>> That's sufficiently covered by the proposal to allow a COPY FROM as a >>> table source within SELECT, no? >>> > > >> Well, yes, the table source has text as datatypes and the select expression on >> the column will call whatever function/cast etc to do the work. But that >> opens the door to second class citizen storage solution for PostgreSQL, by >> letting the user CREATE VIEW atop of it: >> > > [ shrug... ] I don't actually have a problem with that. If we did want > to prohibit that, we'd have to somehow prohibit SRFs from reading files, > because you can do it today with any untrusted PL. > > I note also that presumably COPY FROM 'file' would still be restricted > to superusers, and only COPY FROM STDIN would be available to those > without a license to shoot themselves in the foot. So the opportunity > to do anything view-like would be restricted to adults(?) anyhow. > Yeah, maybe. I will suspend my doubts for now. > (One of the issues that'd have to be addressed to allow a table source > syntax is whether it's sane to allow multiple COPY FROM STDIN in a > single query. If so, how does it work; if not, how do we prevent it?) > > > I don't see why it shouldn't work. I see that copy.c now looks like it's reentrant, unlike the bad days of old. Could we make each COPY target behave like an SRF, stashing its data in a tuplestore? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> (One of the issues that'd have to be addressed to allow a table source >> syntax is whether it's sane to allow multiple COPY FROM STDIN in a >> single query. If so, how does it work; if not, how do we prevent it?) > I don't see why it shouldn't work. I see that copy.c now looks like it's > reentrant, unlike the bad days of old. Could we make each COPY target > behave like an SRF, stashing its data in a tuplestore? The first question is what is the wire-protocol definition. In particular, how would the client know what order to send the COPY datasets in, if a single query might include multiple COPY FROM STDIN segments? Another point is that we surely don't want the implementation to force use of a tuplestore all the time, so I'm not sure I buy that we can prevent interleaving of multiple datasets on the wire that way. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Could we make each COPY target >> behave like an SRF, stashing its data in a tuplestore? >> > > The first question is what is the wire-protocol definition. In > particular, how would the client know what order to send the COPY > datasets in, if a single query might include multiple COPY FROM STDIN > segments? > > Another point is that we surely don't want the implementation to force > use of a tuplestore all the time, so I'm not sure I buy that we can > prevent interleaving of multiple datasets on the wire that way. > > > Is there a big demand for multiple datasets on the wire in a situation like this? How about if we allow multiple COPY targets but at most one from STDIN, at least for one go round? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Is there a big demand for multiple datasets on the wire in a situation > like this? How about if we allow multiple COPY targets but at most one > from STDIN, at least for one go round? That's exactly what I was saying (or at least trying to imply) as the fallback position. But you still need a way to enforce that. regards, tom lane