Thread: Ragged CSV import
I have received a requirement for the ability to import ragged CSV files, i.e. files that contain variable numbers of columns per row. The requirement is that extra columns would be ignored and missing columns filled with NULL. The client wanting this has wrestled with some preprocessors to try to get what they want, but they would feel happier with this built in. This isn't the first time I have received this request since we implemented CSV import. People have complained on numerous occasions about the strictness of the import routines w.r.t. the number of columns. I don't think that providing a CSV import mode for reading these files will be very difficult or invasive. If it's not acceptable, I will provide it to the client as a patch, and I will be retrofitting it to 8.4 anyway. But I think it's not an unreasonable request. Thoughts? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I have received a requirement for the ability to import ragged CSV > files, i.e. files that contain variable numbers of columns per row. The > requirement is that extra columns would be ignored and missing columns > filled with NULL. The client wanting this has wrestled with some > preprocessors to try to get what they want, but they would feel happier > with this built in. This isn't the first time I have received this > request since we implemented CSV import. People have complained on > numerous occasions about the strictness of the import routines w.r.t. > the number of columns. Hmm. Accepting too few columns and filling with nulls isn't any different than what INSERT has always done. But ignoring extra columns seems like a different ballgame. Can you talk your client out of that one? It just seems like a bad idea. As for the "numerous occasions", maybe I've not been paying attention, but I don't recall any ... regards, tom lane
Tom Lane wrote: > As for the "numerous occasions", maybe I've not been paying attention, > but I don't recall any ... I don't know about numerous, but I've seen it in the spanish list; for example: http://archives.postgresql.org/pgsql-es-ayuda/2007-03/msg00901.php -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Andrew Dunstan <andrew@dunslane.net> writes: >> I have received a requirement for the ability to import ragged CSV >> files, i.e. files that contain variable numbers of columns per row. BTW, one other thought about this: I think the historical reason for COPY being strict about the number of incoming columns was that it provided a useful cross-check that the parsing hadn't gone off into the weeds. We have certainly seen enough examples where the reported manifestation of, say, an escaping mistake was that COPY saw the row as having too many or too few columns. So being permissive about it would lose some error detection capability. I am not clear about whether CSV format is sufficiently more robust than the traditional COPY format to render this an acceptable loss. Comments? (One possible answer to this is to make the behavior optional, though surely COPY has got too many options already :-() regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> I have received a requirement for the ability to import ragged CSV > >> files, i.e. files that contain variable numbers of columns per row. > > BTW, one other thought about this: I think the historical reason for > COPY being strict about the number of incoming columns was that it > provided a useful cross-check that the parsing hadn't gone off into > the weeds. We have certainly seen enough examples where the reported > manifestation of, say, an escaping mistake was that COPY saw the row > as having too many or too few columns. So being permissive about it > would lose some error detection capability. I am not clear about > whether CSV format is sufficiently more robust than the traditional > COPY format to render this an acceptable loss. Comments? I think accepting less columns and filling with nulls should be protected enough for this not to be a problem; if the parser goes nuts, it will die eventually. Silently dropping excessive trailing columns does not seem acceptable though; you could lose entire rows and not notice. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On 9/9/09 12:59 PM, Andrew Dunstan wrote: > > I have received a requirement for the ability to import ragged CSV > files, i.e. files that contain variable numbers of columns per row. The > requirement is that extra columns would be ignored and missing columns > filled with NULL. The client wanting this has wrestled with some > preprocessors to try to get what they want, but they would feel happier > with this built in. This isn't the first time I have received this > request since we implemented CSV import. People have complained on > numerous occasions about the strictness of the import routines w.r.t. > the number of columns. Would this just work on columns on the end, or would it work on the basis of parsing the CSV header and matching columns? While the former functionality would be relatively simple, I think the latter is what people really want. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> I have received a requirement for the ability to import ragged CSV >> files, i.e. files that contain variable numbers of columns per row. The >> requirement is that extra columns would be ignored and missing columns >> filled with NULL. The client wanting this has wrestled with some >> preprocessors to try to get what they want, but they would feel happier >> with this built in. This isn't the first time I have received this >> request since we implemented CSV import. People have complained on >> numerous occasions about the strictness of the import routines w.r.t. >> the number of columns. >> > > Hmm. Accepting too few columns and filling with nulls isn't any > different than what INSERT has always done. But ignoring extra columns > seems like a different ballgame. Can you talk your client out of that > one? It just seems like a bad idea. > No, that's critical. The application this is wanted for uploads data that users put in spreadsheets. The users apparently expect that they will be able to put comments on some rows off to the right of the data they want loaded, and have it ignored. To answer your other point made later, my intention was to make this optional behaviour, not default behaviour. I agree that it would be too slack for default behaviour. Yes, we have quite a few options, but that's not surprising in dealing with a format that is at best ill-defined and which we do not control. > As for the "numerous occasions", maybe I've not been paying attention, > but I don't recall any ... > > The requests have been made on IRC, at conferences, in private emails. cheers andrew
Josh Berkus <josh@agliodbs.com> wrote: > Would this just work on columns on the end, or would it work on the > basis of parsing the CSV header and matching columns? > > While the former functionality would be relatively simple, I think the > latter is what people really want. It's been a while since I've had a need for something like this, but of the copy features not currently available in PostgreSQL, the two most useful are to read in only some of the defined columns, and to output to a separate disk file any rows which failed to match the expected format. The latter would not cause the copy to fail unless the count of such rows exceeded a user-specified threshold. -Kevin
Josh Berkus wrote: > On 9/9/09 12:59 PM, Andrew Dunstan wrote: > >> I have received a requirement for the ability to import ragged CSV >> files, i.e. files that contain variable numbers of columns per row. The >> requirement is that extra columns would be ignored and missing columns >> filled with NULL. The client wanting this has wrestled with some >> preprocessors to try to get what they want, but they would feel happier >> with this built in. This isn't the first time I have received this >> request since we implemented CSV import. People have complained on >> numerous occasions about the strictness of the import routines w.r.t. >> the number of columns. >> > > Would this just work on columns on the end, or would it work on the > basis of parsing the CSV header and matching columns? > > While the former functionality would be relatively simple, I think the > latter is what people really want. > > No, it's the former, and that's exactly what the client requested. I'm not talking about parsing the header line - that's a much larger can of worms. cheers andrew
On Wed, 2009-09-09 at 16:34 -0400, Alvaro Herrera wrote: > Tom Lane wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: > > >> I have received a requirement for the ability to import ragged CSV > > >> files, i.e. files that contain variable numbers of columns per row. > > > > BTW, one other thought about this: I think the historical reason for > > COPY being strict about the number of incoming columns was that it > > provided a useful cross-check that the parsing hadn't gone off into > > the weeds. We have certainly seen enough examples where the reported > > manifestation of, say, an escaping mistake was that COPY saw the row > > as having too many or too few columns. So being permissive about it > > would lose some error detection capability. I am not clear about > > whether CSV format is sufficiently more robust than the traditional > > COPY format to render this an acceptable loss. Comments? > > I think accepting less columns and filling with nulls should be > protected enough for this not to be a problem; if the parser goes nuts, > it will die eventually. Silently dropping excessive trailing columns > does not seem acceptable though; you could lose entire rows and not > notice. Maybe we could put a catch-all "text" or even "text[]" column at as the last one of the table and gather all extra columns there ? > -- > Alvaro Herrera http://www.CommandPrompt.com/ > The PostgreSQL Company - Command Prompt, Inc. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
"Dann Corbit" <DCorbit@connx.com> wrote: > Kevin Grittner >> It's been a while since I've had a need for something like this, >> but of the copy features not currently available in PostgreSQL, >> the two most useful are to read in only some of the defined >> columns, and to output to a separate disk file any rows which >> failed to match the expected format. >> The latter would not cause the copy to fail unless the count of >> such rows exceeded a user-specified threshold. > Perhaps something like SQL Server's BCP format files could be > used. > http://support.microsoft.com/kb/67409 > http://technet.microsoft.com/en-us/library/ms178129.aspx > http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html > http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm That is what I was thinking of when I wrote the above, although in the Sybase incarnation, before Microsoft split off on their own. (I see they haven't changed sybchar from the Sybase name yet....) My reservation about referencing it is that it includes so many bells and whistles that it's not as easy to use as it might be, even with the "wizard" to generate the format description files. The other problem with it is that it was far and away the *least* stable part of the DBMS. You could count on it being broken in any version until six months to a year into that version's life. We eventually moved away from it because of the instability -- we could write code from scratch each time with more confidence of correct behavior. I think keeping it a little simpler might help with keeping it stable. -Kevin
Dann Corbit wrote: > Perhaps something like SQL Server's BCP format files could be used. > http://support.microsoft.com/kb/67409 > http://technet.microsoft.com/en-us/library/ms178129.aspx > http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html > http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm > This looks vastly more complex than anything I am interested in doing. cheers andrew
On Wed, Sep 9, 2009 at 4:13 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> I have received a requirement for the ability to import ragged CSV >> files, i.e. files that contain variable numbers of columns per row. The >> requirement is that extra columns would be ignored and missing columns >> filled with NULL. The client wanting this has wrestled with some >> preprocessors to try to get what they want, but they would feel happier >> with this built in. This isn't the first time I have received this >> request since we implemented CSV import. People have complained on >> numerous occasions about the strictness of the import routines w.r.t. >> the number of columns. > > Hmm. Accepting too few columns and filling with nulls isn't any > different than what INSERT has always done. But ignoring extra columns > seems like a different ballgame. Can you talk your client out of that > one? It just seems like a bad idea. I agree that ignoring extra columns is a bad idea, but I don't even like the idea of ignoring missing columns. It doesn't seem like a good idea to take a spreadsheet and feed it into COPY without doing any validation anyway, and this is the kind of thing that is trivial to clean up with a thin layer of Perl or your scripting language of choice. ...Robert
Robert Haas wrote: > I agree that ignoring extra columns is a bad idea, but I don't even > like the idea of ignoring missing columns. It doesn't seem like a > good idea to take a spreadsheet and feed it into COPY without doing > any validation anyway, and this is the kind of thing that is trivial > to clean up with a thin layer of Perl or your scripting language of > choice. > > If it's an optional feature then I don't see why there is a problem. What skin is it off anyone else's nose but those whose choose this behaviour? I am perfectly familiar with Perl and so is the client that requested this feature. They are quite technically savvy. They are using a scripting solution now but they find it cumbersome. As for general validation, the requestor's application in fact loads the spreadsheet into a temp table of text columns and then subjects it to a large variety of complex business rule checking before adding the data to the main tables. It is a whole lot faster and cleaner to do it that way than before the data is loaded at all. That's why they aren't concerned about missing columns. cheers andrew
On Wed, Sep 9, 2009 at 7:41 PM, Andrew Dunstan<andrew@dunslane.net> wrote: > > > Robert Haas wrote: >> >> I agree that ignoring extra columns is a bad idea, but I don't even >> like the idea of ignoring missing columns. It doesn't seem like a >> good idea to take a spreadsheet and feed it into COPY without doing >> any validation anyway, and this is the kind of thing that is trivial >> to clean up with a thin layer of Perl or your scripting language of >> choice. >> >> > > If it's an optional feature then I don't see why there is a problem. What > skin is it off anyone else's nose but those whose choose this behaviour? I have to admit I'm usually an advocate of that way of thinking, so maybe I am all wet. I suppose it comes down to whether you think this particular feature is something with broad applicability, or whether there are 100 other equally plausible features. I wonder whether it would be appropriate to do something like implement a method by which copy could return text[] and then one could write wrappers around that functionality to do this as well as other things. For example, suppose you wanted to have rows of the form: A,B,C,X1,Y1,X2,Y2,X3,Y3 ...which gets transformed into an insert of (A,B,C) into a main table and (A,X1,Y1), (A,X2,Y2), (A,X3,Y3) into a side table. (I have actual knowledge of a widely-deployed system produced by a large company that outputs data in a format similar to this, though the actual format is considerably more complex.) ...Robert
Robert Haas wrote: > I wonder whether it would be appropriate to do something like > implement a method by which copy could return text[] and then one > could write wrappers around that functionality to do this as well as > other things. > It's not a function ... writing a copy() function returning text[] (or setof text[], which would be better) in plperlu would be trivial. But it couldn't read from stdin or be able to be fed data from a client in copy-in mode. And even if we have it what does the sql look like to insert the values into some table with n columns? Anything along these lines in C is likely to be far larger than what I had in mind, which was a fairly minor enhancement. cheers andrew
On Wed, Sep 9, 2009 at 10:40 PM, Andrew Dunstan<andrew@dunslane.net> wrote: > Robert Haas wrote: >> >> I wonder whether it would be appropriate to do something like >> implement a method by which copy could return text[] and then one >> could write wrappers around that functionality to do this as well as >> other things. > > It's not a function ... writing a copy() function returning text[] (or setof > text[], which would be better) in plperlu would be trivial. But it couldn't > read from stdin or be able to be fed data from a client in copy-in mode. Yes, setof text[] is what I meant. Making it be able to read from stdin or be fed data from a client in copy-in mode would be the whole point of the feature, I suppose. > And even if we have it what does the sql look like to insert the values into > some table with n columns? INSERT INTO table SELECT t[1], t[2], t[3], t[4], ..., t[n] FROM (...however we get the copy results...) ...although I'm not entirely sure that's compatible with the client/server protocol. > Anything along these lines in C is likely to be far larger than what I had > in mind, which was a fairly minor enhancement. It wouldn't be the first time that someone was told that a particular enhancement didn't have enough value and that they had to do something larger if they wanted it in core, but on the other hand, I am only expressing my opinion, which is not binding on you or anyone else, nor even a fully-fleshed-out proposal. All I'm saying is that it seems to me that there is value in being able to use the CSV (or other) parsing code, but have some way to make modifications to how/where the data is actually inserted. However, I'm skeptical about whether the specific thing you want to do after parsing (namely, drop excess columns, null-fill missing ones) is sufficiently common to warrant a feature to do only that. YMMV, of course. ...Robert
Andrew Dunstan <andrew@dunslane.net> writes: > Robert Haas wrote: >> I wonder whether it would be appropriate to do something like >> implement a method by which copy could return text[] and then one >> could write wrappers around that functionality to do this as well as >> other things. > Anything along these lines in C is likely to be far larger than what I > had in mind, which was a fairly minor enhancement. I think Robert's got a point though. What you're talking about is a fairly specialized single-purpose feature, which nonetheless is going to require a lot of infrastructure (for example, teaching psql's \copy about it). Perhaps, for approximately the same amount of overhead, we could provide something more general. I don't agree with the idea of injecting something "behind" copy though. The thought that comes to mind for me is something "in front of" copy, that is, give it the text of each line and let it do a text-to-text transformation before COPY chews on it. Any of this is getting into territory we had previously agreed not to let COPY venture into, ie general purpose data transformation. I'm not sure I want to cross that bridge and only get "ignore extra columns" out of it. regards, tom lane
Robert Haas wrote: > However, I'm skeptical about whether the specific > thing you want to do after parsing (namely, drop excess columns, > null-fill missing ones) is sufficiently common to warrant a feature to > do only that. YMMV, of course. > > > So might my experience. I can tell you that I have been asked about this dozens of times since implementing CSV import, in various forums. From day one people have wanted a way of handling CSVs which were not strictly regular. Pretty much every time the request has been something fairly close to this. As for the feature not being large enough, we have a regrettable (IMNSHO) tendency in this project to take simple proposals and embelish them with many bells and whistles. Personally, I'm much more of a fan of incremental development. cheers andrew
On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> Robert Haas wrote: >>> I wonder whether it would be appropriate to do something like >>> implement a method by which copy could return text[] and then one >>> could write wrappers around that functionality to do this as well as >>> other things. > >> Anything along these lines in C is likely to be far larger than what I >> had in mind, which was a fairly minor enhancement. > > I think Robert's got a point though. What you're talking about is a > fairly specialized single-purpose feature, which nonetheless is going to > require a lot of infrastructure (for example, teaching psql's \copy > about it). Perhaps, for approximately the same amount of overhead, > we could provide something more general. > > I don't agree with the idea of injecting something "behind" copy though. > The thought that comes to mind for me is something "in front of" copy, > that is, give it the text of each line and let it do a text-to-text > transformation before COPY chews on it. That seems to me to be a whole lot less useful. As I see it, the whole point of any enhancement in this area is to reuse the parsing code. If I have a CSV file (or some other format COPY understands), I don't want to have to write my own parser for that format in order to do some simple data transformation (like dropping columns >n). Previous agreements nonwithstanding, I think letting COPY do general transformations is a great idea. But I'm a lot more skeptical about one specific transformation without some framework of which that case is the first instance. ...Robert
Tom Lane wrote: > What you're talking about is a > fairly specialized single-purpose feature, which nonetheless is going to > require a lot of infrastructure (for example, teaching psql's \copy > about it). Well, that's hardly a lot. > Perhaps, for approximately the same amount of overhead, > we could provide something more general. > Ok. > I don't agree with the idea of injecting something "behind" copy though. > The thought that comes to mind for me is something "in front of" copy, > that is, give it the text of each line and let it do a text-to-text > transformation before COPY chews on it. > you mean some sort of filter mechanism? COPY .... FILTER function_name ( args) ... ? That might work. Then we could provide a couple builtin and people could write others in C or PL/dujour. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > you mean some sort of filter mechanism? > COPY .... FILTER function_name ( args) ... ? > That might work. Then we could provide a couple builtin and people could > write others in C or PL/dujour. Yeah, that's pretty much what I was thinking, although exactly where the filter happens and what its arguments/results are is open to debate. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> The thought that comes to mind for me is something "in front of" copy, >> that is, give it the text of each line and let it do a text-to-text >> transformation before COPY chews on it. > That seems to me to be a whole lot less useful. As I see it, the > whole point of any enhancement in this area is to reuse the parsing > code. If I have a CSV file (or some other format COPY understands), > I don't want to have to write my own parser for that format in order > to do some simple data transformation (like dropping columns >n). I'm unconvinced. A significant fraction of the COPY customization requests that I can remember had to do with cleaning up input that didn't entirely match COPY's idea of what valid CSV is. If it's close enough that COPY will successfully parse it as N columns, you can stick it into an N-column text table and do your transformations afterwards. Thinking a bit more generally, it seems to me that as COPY IN is currently constituted, there are really four distinguishable bits of functionality in a pipeline: 1. Reading from FE/BE protocol (or file) and conversion to database character set. 2. Parsing the text into a set of de-escaped field values. 3. Converting the field values to internal Datum form (requires knowledge of a set of target data types). 4. Bulk insertion into the target table. It is not hard to come up with problems that are most easily solved by injecting some sort of filtering between any pair of these steps. You're suggesting a filter between 2 and 3, I suggested one between 1 and 2, and the well-known trick of inserting into a temp table and then filtering to a final table amounts to a filter between 3 and 4. We've had proposals before to come up with a form of COPY that can be used as a data source in an INSERT/SELECT context. That would automatically provide a way to filter between steps 3 and 4 --- or maybe better, insist that the COPY-thingy produces just text fields, and leave both steps 3 and 4 to be done by the INSERT/SELECT. With that, the only missing functionality would be a textual filter between steps 1 and 2. regards, tom lane
On Wed, 2009-09-09 at 19:18 -0400, Robert Haas wrote: > I agree that ignoring extra columns is a bad idea, but I don't even > like the idea of ignoring missing columns. It doesn't seem like a > good idea to take a spreadsheet and feed it into COPY without doing > any validation anyway, and this is the kind of thing that is trivial > to clean up with a thin layer of Perl or your scripting language of > choice. I would think that a spreadsheet application has some kind of control over its export format, too. Perhaps you can just tell it to export only the first N columns?
Hi, > the two most > useful are to read in only some of the defined columns, and to output > to > a separate disk file any rows which failed to match the expected > format. > The latter would not cause the copy to fail unless the count of such > rows exceeded a user-specified threshold. > +1 Allowing the capability to handle rows that might get discarded due to constraint violations, bad column inputs etc. sounds like a big help while doing large copy operations. Another capability would be to transform the input column via some sql expressions before loading it into the table. Given the way update works, this could avoid the unnecessary subsequent bloat to fine-tune some of the columns. Regards, Nikhils -- http://www.enterprisedb.com
Tom Lane <tgl@sss.pgh.pa.us> writes: > Andrew Dunstan <andrew@dunslane.net> writes: >> you mean some sort of filter mechanism? > >> COPY .... FILTER function_name ( args) ... ? > >> That might work. Then we could provide a couple builtin and people could >> write others in C or PL/dujour. > > Yeah, that's pretty much what I was thinking, although exactly where the > filter happens and what its arguments/results are is open to debate. Last time we opened this can of worms here's what it lead us to: http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php That is, COPY support in the FROM clause. Now the example gets rewritten as: SELECT * FROM function_name(a, b, c) LATERAL (COPY ... FROM ...); Or with some more nesting if we don't have LATERAL by the time we get COPY support in the FROM clause :) That allows to simplify some things: COPY in FROM always returns text datums, the function_name can be any callable function (pl/whatever), and we could import easily data whose representation has to be adapted to PostgreSQL data type input function (such as timestamps etc) INSERT INTO t SELECT myparse(a) FROM (COPY ...); As already said by Andrew on IRC, though, this goes nowhere to help him address his direct concern, COPY would still choke on input lines that are containing too few or too many columns... Regards, -- dim
Peter Eisentraut wrote: > On Wed, 2009-09-09 at 19:18 -0400, Robert Haas wrote: > >> I agree that ignoring extra columns is a bad idea, but I don't even >> like the idea of ignoring missing columns. It doesn't seem like a >> good idea to take a spreadsheet and feed it into COPY without doing >> any validation anyway, and this is the kind of thing that is trivial >> to clean up with a thin layer of Perl or your scripting language of >> choice. >> > > I would think that a spreadsheet application has some kind of control > over its export format, too. Perhaps you can just tell it to export > only the first N columns? > > Not to my knowledge. In any case, this requires more user work and much more user education, and in the situation I am dealing with neither is likely. cheers andrew
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> you mean some sort of filter mechanism? >> > > >> COPY .... FILTER function_name ( args) ... ? >> > > >> That might work. Then we could provide a couple builtin and people could >> write others in C or PL/dujour. >> > > Yeah, that's pretty much what I was thinking, although exactly where the > filter happens and what its arguments/results are is open to debate. > > > > I have slept on it, and while this is a nice idea, I think your suggestion that this would take about the same effort as what I had in mind (which I could just about have coded by now) is out by at least an order of magnitude. I think it would eat up all my available time between now and Nov 15, and I'm not prepared to spend my time budget that way. I will talk to the requestor and see if we can do something along the lines of Robert's suggestion using a plperlu function and a temp file, which I think I could do within the original time budget. Consider the suggestion withdrawn. cheers andrew
Andrew Dunstan escribió: > As for the feature not being large enough, we have a regrettable > (IMNSHO) tendency in this project to take simple proposals and > embelish them with many bells and whistles. Personally, I'm much > more of a fan of incremental development. +1 -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Sep 9, 2009 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>> The thought that comes to mind for me is something "in front of" copy, >>> that is, give it the text of each line and let it do a text-to-text >>> transformation before COPY chews on it. > >> That seems to me to be a whole lot less useful. As I see it, the >> whole point of any enhancement in this area is to reuse the parsing >> code. If I have a CSV file (or some other format COPY understands), >> I don't want to have to write my own parser for that format in order >> to do some simple data transformation (like dropping columns >n). > > I'm unconvinced. A significant fraction of the COPY customization > requests that I can remember had to do with cleaning up input that > didn't entirely match COPY's idea of what valid CSV is. If it's close > enough that COPY will successfully parse it as N columns, you can stick > it into an N-column text table and do your transformations afterwards. > > Thinking a bit more generally, it seems to me that as COPY IN is > currently constituted, there are really four distinguishable bits > of functionality in a pipeline: > > 1. Reading from FE/BE protocol (or file) and conversion to database > character set. > > 2. Parsing the text into a set of de-escaped field values. > > 3. Converting the field values to internal Datum form (requires > knowledge of a set of target data types). > > 4. Bulk insertion into the target table. > > It is not hard to come up with problems that are most easily solved > by injecting some sort of filtering between any pair of these steps. > You're suggesting a filter between 2 and 3, I suggested one between > 1 and 2, and the well-known trick of inserting into a temp table > and then filtering to a final table amounts to a filter between > 3 and 4. > > We've had proposals before to come up with a form of COPY that can be > used as a data source in an INSERT/SELECT context. That would > automatically provide a way to filter between steps 3 and 4 --- or maybe > better, insist that the COPY-thingy produces just text fields, and leave > both steps 3 and 4 to be done by the INSERT/SELECT. With that, the only > missing functionality would be a textual filter between steps 1 and 2. That would be swell. Being able to use COPY as an expression returning text[] would, I think, cater to an extremely wide variety of use cases, although it sounds like Andrew is not interested in implementing it at this point. If we want to add on a filter between steps 1 and 2, that sounds potentially useful as well. I am fuzzy on the implementation details for making COPY act as a data source for INSERT/SELECT though. I had thought to make EXPLAIN a data source, but it turned out not to be possible (as far as I could tell) without making EXPLAIN a fully-reserved word, which you vetoed. It seems likely that COPY will present similar issues, though I haven't tried. I am also wondering what happens when someone embeds multiple COPY statements in a single query, or sticks one inside of a CTE or on the inner side of a left join. I am inclined to think that we should cause COPY to materialize its input (so rescans will return the same data already read) and throw an error if more than one is present in the same query (unless you have a better idea?). ...Robert
Dimitri Fontaine <dfontaine@hi-media.com> writes: > Last time we opened this can of worms here's what it lead us to: > http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php Right, that's the discussion I was remembering. > As already said by Andrew on IRC, though, this goes nowhere to help him > address his direct concern, COPY would still choke on input lines that > are containing too few or too many columns... An easy fix for that is to allow this form of COPY to return text[] instead of a fixed number of text columns. regards, tom lane
On Thu, Sep 10, 2009 at 9:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dimitri Fontaine <dfontaine@hi-media.com> writes: >> Last time we opened this can of worms here's what it lead us to: >> http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php > > Right, that's the discussion I was remembering. > >> As already said by Andrew on IRC, though, this goes nowhere to help him >> address his direct concern, COPY would still choke on input lines that >> are containing too few or too many columns... > > An easy fix for that is to allow this form of COPY to return text[] > instead of a fixed number of text columns. Hmm... in fact, what if we just implemented COPY returning text[] for a first cut, and forget making it act like a table source for now. Then you could COPY to a temp table (with a text[] column) and select from there. There's really two separate features here: COPY producing text[] rather than a record type, and COPY being embeddable in some larger query. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > I am fuzzy on the implementation details for making COPY act as a data > source for INSERT/SELECT though. I had thought to make EXPLAIN a data > source, but it turned out not to be possible (as far as I could tell) > without making EXPLAIN a fully-reserved word, which you vetoed. It > seems likely that COPY will present similar issues, though I haven't > tried. IIRC the previous discussion touched on making it look like a set-returning function, although this would be a shade less convenient for option parsing etc. > I am also wondering what happens when someone embeds multiple COPY > statements in a single query, or sticks one inside of a CTE or on the > inner side of a left join. Yeah, it would need to be restricted somehow. A straight SRF function would materialize its result, but I doubt we want that to happen for COPY. (This brings up the whole question of performance impact, which would have to be thought about and minimized.) regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Hmm... in fact, what if we just implemented COPY returning text[] for > a first cut, and forget making it act like a table source for now. > Then you could COPY to a temp table (with a text[] column) and select > from there. There's really two separate features here: COPY producing > text[] rather than a record type, and COPY being embeddable in some > larger query. That might be a reasonable way-station ... regards, tom lane
On Thu, 2009-09-10 at 10:09 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I am fuzzy on the implementation details for making COPY act as a data > > source for INSERT/SELECT though. I had thought to make EXPLAIN a data > > source, but it turned out not to be possible (as far as I could tell) > > without making EXPLAIN a fully-reserved word, which you vetoed. It > > seems likely that COPY will present similar issues, though I haven't > > tried. > > IIRC the previous discussion touched on making it look like a > set-returning function, although this would be a shade less convenient > for option parsing etc. > > > I am also wondering what happens when someone embeds multiple COPY > > statements in a single query As long as they COPY from different input files it should be perfectly OK. Though this already goes out into SQL/MED land. > , or sticks one inside of a CTE or on the > > inner side of a left join. > > Yeah, it would need to be restricted somehow. A straight SRF function > would materialize its result, but I doubt we want that to happen for > COPY. Making SRF-s streamable is another thing that would make postgreSQL a lot more powerful for all kinds of computing. > (This brings up the whole question of performance impact, which would > have to be thought about and minimized.) Completely agree - nobody wants a slow COPY command. > regards, tom lane -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training
Tom Lane wrote: >> As already said by Andrew on IRC, though, this goes nowhere to help him >> address his direct concern, COPY would still choke on input lines that >> are containing too few or too many columns... >> > > An easy fix for that is to allow this form of COPY to return text[] > instead of a fixed number of text columns. > > > I'd like to see some concrete syntax suggestions for such an animal. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> An easy fix for that is to allow this form of COPY to return text[] >> instead of a fixed number of text columns. > I'd like to see some concrete syntax suggestions for such an animal. I think there were some in the previous discussion --- it came down to using a SRF with arguments that specify the copy source and options. Setting the result type as text[] instead of some has-to-be-specified composite type should actually make it a bit easier. regards, tom lane
Hannu Krosing <hannu@2ndQuadrant.com> writes: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I am also wondering what happens when someone embeds multiple COPY >>> statements in a single query > As long as they COPY from different input files it should be perfectly > OK. Though this already goes out into SQL/MED land. Multiple copy from file is okay, copy from stdin not so much. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> An easy fix for that is to allow this form of COPY to return text[] >>> instead of a fixed number of text columns. >>> > > >> I'd like to see some concrete syntax suggestions for such an animal. >> > > I think there were some in the previous discussion --- it came down > to using a SRF with arguments that specify the copy source and options. > Oh, so I guess not stdin. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> I think there were some in the previous discussion --- it came down >> to using a SRF with arguments that specify the copy source and options. > Oh, so I guess not stdin. Well, it'd be nice if stdin were allowed, but we'd have to look into how to prevent multiple reads or re-reads then. regards, tom lane
* Andrew Dunstan (andrew@dunslane.net) wrote: > Consider the suggestion withdrawn. Let's not throw it out completely. The proposal to have COPY return a text[] in some fashion is interesting enough that others (ah, such as myself..) might be willing to put effort into it. Andrew, could you put your thoughts plus some example files onto a wiki page, at least? Then Robert, Tom, myself, etc, could update that to nail down the specification and then it's just an implementation detail, as it were. Thanks! Stephen
Stephen Frost wrote: > * Andrew Dunstan (andrew@dunslane.net) wrote: > >> Consider the suggestion withdrawn. >> > > Let's not throw it out completely. The proposal to have COPY return a > text[] in some fashion is interesting enough that others (ah, such as > myself..) might be willing to put effort into it. Andrew, could you put > your thoughts plus some example files onto a wiki page, at least? Then > Robert, Tom, myself, etc, could update that to nail down the > specification and then it's just an implementation detail, as it were. > > > I don't mind discussing the idea a bit. I don't have any samples readily to hand, but really anything that's not strictly rectangular meets my original case, like a,b,c 1,2,3 4,5,6,7 8,9 10,11,12 I do like the idea of COPY returning a SETOF text[], but I am not at all clear on the mechanics of feeding STDIN to an SRF. ISTM that something like a RETURNING clause on COPY and the ability to use it in FROM clause or something similar might work better. I understand the difficulties, but maybe we could place some restrictions on where it could be used so as to obviate at least some of those. One of the things I like about a SETOF text[] is that it lets you reorder the columns, or cherry pick which ones you want. In fact, it might be argued with that the hacky FORCE NOT NULL, which has always pained me somewhat, even if it was my idea ;-) might no longer be needed. I'd love to be able to do something like INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING t FROM stdin CSV); The only thing that's been seriously on the table that isn't accounted for by something like this is the suggestion of making the header line have some semantic significance, and I'm far from sure that's a good idea. If time were not short on getting features presented I might attempt to do it, but I have one very large monkey (and a few small ones) on my back that I am determined to get rid of by the November CF, and there is not a hope in the world I could get two large features done, even if we had the details of this all sorted out and agreed on. That's why I said "Consider the suggestion withdrawn". cheers andrew
Andrew, * Andrew Dunstan (andrew@dunslane.net) wrote: > I don't mind discussing the idea a bit. Great! Thanks for this. I'll put it up on a wiki and start looking at implementation details and what the semantics should be. Thanks again, Stephen
Hi, Andrew Dunstan <andrew@dunslane.net> writes: > I do like the idea of COPY returning a SETOF text[], but I am not at all > clear on the mechanics of feeding STDIN to an SRF. ISTM that something like > a RETURNING clause on COPY and the ability to use it in FROM clause or > something similar might work better. I envisonned COPY "just" returning all what it reads (minus extra discarded column as soon as your proposal gets implemented), in the FROM clause, and the user sorting out what he wants in the SELECT clause. > I understand the difficulties, but > maybe we could place some restrictions on where it could be used so as to > obviate at least some of those. Maybe instead of opening FROM for COPY, having it accepted in WITH would be better, the same way (from the user point of view) that DML returning are worked on. > INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING > t FROM stdin CSV); Would become: WITH csv AS ( COPY t FROM stdin CSV)INSERT INTO foo(x, y, z) SELECT t[3], t[2], mytimestamptz([5], [6], [7]) FROM csv; Now the text[] has a strange feeling, without it it'd be: WITH csv AS ( COPY t(a, b, c, d, e, f, g) FROM stdin CSV IGNORING EXTRA COLUMNS -- random nice syntax MISSING COLUMNS DEFAULTS NULL -- that needs some reality check)INSERT INTO foo(x, y, z) SELECT c, b, mytimestamptz(e, f,g) FROM csv; The function mytimestamptz(date text, time text, timezone text) will accept input that PostgreSQL input types would have errored out on... so you can process in one go strange formats from other products. > The only thing that's been seriously on the table that isn't accounted for > by something like this is the suggestion of making the header line have some > semantic significance, and I'm far from sure that's a good idea. Yeah, and it seems only useful when you don't have any way to play with what COPY returns before it goes to a table (magic column reordering or ignoring). Regards, -- dim
I wrote: > I'd love to be able to do something like > > INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING > t FROM stdin CSV); > > Some IRC discussion suggested ways we could do better than that syntax. I think my current preferred candidate is something like COPY foo (a,b,c) FROM stdin CSV AS t USING (t[3],t[2],[t57]); I'm not sure how we'd could plug a filter into that. Maybe a WHERE clause? My immediate need at least doesn't actually involve filtering anything - we load every line in the CSV into a temp table and then filter what we load into the main tables after applying the business rules. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > I wrote: >> I'd love to be able to do something like >> >> INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING >> t FROM stdin CSV); > Some IRC discussion suggested ways we could do better than that syntax. > I think my current preferred candidate is something like > COPY foo (a,b,c) > FROM stdin > CSV > AS t USING (t[3],t[2],[t57]); [ scratches head... ] What happened to seeing COPY as a data source in a larger command? If that syntax has anything at all to recommend it, I'm not seeing what. It's not extensible and it would require lots of code duplication to implement the impoverished feature set it does have. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> I wrote: >> >>> I'd love to be able to do something like >>> >>> INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING >>> t FROM stdin CSV); >>> > > >> Some IRC discussion suggested ways we could do better than that syntax. >> I think my current preferred candidate is something like >> > > >> COPY foo (a,b,c) >> FROM stdin >> CSV >> AS t USING (t[3],t[2],[t57]); >> > > [ scratches head... ] What happened to seeing COPY as a data source in > a larger command? If that syntax has anything at all to recommend it, > I'm not seeing what. It's not extensible and it would require lots of > code duplication to implement the impoverished feature set it does have. > > > Well, I think the objection was that it would slow COPY down to have to go though the executor in the copy-as-source scenario. But maybe that would happen anyway, and maybe we don't care, we'd just accept that it wouldn't be nearly as fast as a raw copy. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Well, I think the objection was that it would slow COPY down to have to > go though the executor in the copy-as-source scenario. But maybe that > would happen anyway, and maybe we don't care, we'd just accept that it > wouldn't be nearly as fast as a raw copy. I haven't heard complaints about the COPY (query) syntax, which is the same thing in the opposite direction. You can't expect that flexibility costs zero. regards, tom lane
On Fri, Sep 11, 2009 at 10:27:06AM +0200, Dimitri Fontaine wrote: > Maybe instead of opening FROM for COPY, having it accepted in WITH would > be better, the same way (from the user point of view) that DML returning > are worked on. ... > WITH csv AS ( > COPY t FROM stdin CSV > ) > INSERT INTO foo(x, y, z) > SELECT t[3], t[2], mytimestamptz([5], [6], [7]) > FROM csv; > > Now the text[] has a strange feeling, without it it'd be: > > WITH csv AS ( > COPY t(a, b, c, d, e, f, g) > FROM stdin > CSV IGNORING EXTRA COLUMNS -- random nice syntax > MISSING COLUMNS DEFAULTS NULL -- that needs some reality check > ) > INSERT INTO foo(x, y, z) > SELECT c, b, mytimestamptz(e, f, g) > FROM csv; > > The function mytimestamptz(date text, time text, timezone text) will > accept input that PostgreSQL input types would have errored out on... so > you can process in one go strange formats from other products. +1 -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.