Thread: COPY formatting
Hi, in TODO is item: "* Allow dump/load of CSV format". I don't thinkit's clean idea. Why CSV and why not something other? :-) A why not allow to users full control of the format by they ownfunction. It means something like:COPY tablename [ ( column[, ...] ) ] TO { 'filename' | STDOUT } [ [ WITH ] [ BINARY ] [ OIDS ] [ DELIMITER [AS ] 'delimiter' ] [ NULL [ AS ] 'null string' ] [ FORMAT funcname ] ] ^^^^^^^^^^^^^^^^ The formatting function API can be pretty simple: text *my_copy_format(text *attrdata, int direction, int nattrs, int attr, oid attrtype, oid relation) -- it's pseudocode of course, it should be use standard fmgrinterface.It's probably interesting for non-binary COPY version.Comments? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Karel Zak wrote: > Hi, > > in TODO is item: "* Allow dump/load of CSV format". I don't think > it's clean idea. Why CSV and why not something other? :-) > > A why not allow to users full control of the format by they own > function. It means something like: > > COPY tablename [ ( column [, ...] ) ] > TO { 'filename' | STDOUT } > [ [ WITH ] > [ BINARY ] > [ OIDS ] > [ DELIMITER [ AS ] 'delimiter' ] > [ NULL [ AS ] 'null string' ] > [ FORMAT funcname ] ] > ^^^^^^^^^^^^^^^^ > > The formatting function API can be pretty simple: > > text *my_copy_format(text *attrdata, int direction, > int nattrs, int attr, oid attrtype, oid relation) > > -- it's pseudocode of course, it should be use standard fmgr > interface. > > It's probably interesting for non-binary COPY version. > > Comments? > > Karel > Karel, This seems to be an excellent idea. People have already asked for many different formats. Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...". Since Windows will be supported soon, it will be hard to pipe data to a useful program (awk, sed, ...). Maybe this feature would help a lot in this case. Regards, Hans -- Cybertec Geschwinde u Schoenig Schoengrabern 134, A-2020 Hollabrunn, Austria Tel: +43/2952/30706 or +43/664/233 90 75 www.cybertec.at, www.postgresql.at, kernel.cybertec.at
Hans-Jürgen Schönig wrote: > Karel Zak wrote: > >> Hi, >> >> in TODO is item: "* Allow dump/load of CSV format". I don't think >> it's clean idea. Why CSV and why not something other? :-) >> A why not allow to users full control of the format by they own >> function. It means something like: >> >> COPY tablename [ ( column [, ...] ) ] >> TO { 'filename' | STDOUT } >> [ [ WITH ] [ BINARY ] >> [ OIDS ] >> [ DELIMITER [ AS ] 'delimiter' ] >> [ NULL [ AS ] 'null string' ] >> [ FORMAT funcname ] ] >> ^^^^^^^^^^^^^^^^ >> The formatting >> function API can be pretty simple: >> >> text *my_copy_format(text *attrdata, int direction, int >> nattrs, int attr, oid attrtype, oid relation) >> >> -- it's pseudocode of course, it should be use standard fmgr >> interface. >> >> It's probably interesting for non-binary COPY version. >> >> Comments? >> >> Karel >> > > > Karel, > > This seems to be an excellent idea. > People have already asked for many different formats. > Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...". > Since Windows will be supported soon, it will be hard to pipe data to a > useful program (awk, sed, ...). Maybe this feature would help a lot in > this case. > Could a CSV-generating function be provided with the distribution then? Regards, Fernando
Karel Zak <zakkr@zf.jcu.cz> writes: > The formatting function API can be pretty simple: > text *my_copy_format(text *attrdata, int direction, > int nattrs, int attr, oid attrtype, oid relation) This seems like it could only reasonably be implemented as a C function. I can't really imagine the average user of COPY wanting to write C in preference to, say, an external perl script. What's the real use-case for the feature? regards, tom lane
Karel Zak wrote: > Hi, > > in TODO is item: "* Allow dump/load of CSV format". I don't think > it's clean idea. Why CSV and why not something other? :-) > > A why not allow to users full control of the format by they own > function. It means something like: > > COPY tablename [ ( column [, ...] ) ] > TO { 'filename' | STDOUT } > [ [ WITH ] > [ BINARY ] > [ OIDS ] > [ DELIMITER [ AS ] 'delimiter' ] > [ NULL [ AS ] 'null string' ] > [ FORMAT funcname ] ] > ^^^^^^^^^^^^^^^^ > > The formatting function API can be pretty simple: > > text *my_copy_format(text *attrdata, int direction, > int nattrs, int attr, oid attrtype, oid relation) > > -- it's pseudocode of course, it should be use standard fmgr > interface. > > It's probably interesting for non-binary COPY version. > > > Interesting ... The alternative might be an external program to munge CSVs and whatever other format people want to support and then call the exisiting COPY- either in bin or contrib. I have seen lots of people wanting to import CSVs, and that's even before we get a Windows port. cheers andrew
Tom Lane wrote: > Karel Zak <zakkr@zf.jcu.cz> writes: > >> The formatting function API can be pretty simple: >> text *my_copy_format(text *attrdata, int direction, >> int nattrs, int attr, oid attrtype, oid relation) > > > This seems like it could only reasonably be implemented as a C function. > I can't really imagine the average user of COPY wanting to write C in > preference to, say, an external perl script. What's the real use-case > for the feature? > That is why I suggested providing a pre-written/pre-compiled/installed function for CSV (call it CSV?). Advanced users could still write their own as people can write many other things if they know their ways.
> > >> > > That is why I suggested providing a pre-written/pre-compiled/installed > function for CSV (call it CSV?). Advanced users could still write > their own as people can write many other things if they know their ways. > As someone who just went through a whole truckload of crap getting delimited files parsed from MSSQL to PostgreSQL. I believe yes this would be great thing. We ended up using plPython with the CSV module. Sincerely, Joshua Drake > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com PostgreSQL Replicator -- production quality replication for PostgreSQL
I deal with this daily in a cygwin environment. I wrote a simple c++ program where I hardcoded the input file name/location and output file name/location. I strip the quotation marks out where they are used for identifying text fields and change the comma's used as CSV's to pipes. I use a combination of bash scripting to execute the c++ program and then Perl to execute a stored procedure. I am new to Perl so I have not yet considered migrating it all into Perl. The dos2unix tools in cygwin always messed up the first character of the first line. I thought the real issue with the copy function and CSVs was that it did not like the use of quotations around the fields to identify text fields. For a true Windows port handling MS Excel files in their native format would be a goal I would hope. If your api could handle that then I would agree with your method. On Wed, 2004-03-17 at 22:14, Joshua D. Drake wrote: > > > > > >> > > > > That is why I suggested providing a pre-written/pre-compiled/installed > > function for CSV (call it CSV?). Advanced users could still write > > their own as people can write many other things if they know their ways. > > > > As someone who just went through a whole truckload of crap getting > delimited files parsed from MSSQL to PostgreSQL. I believe yes this > would be great thing. We ended up using plPython with the CSV module. > > Sincerely, > > Joshua Drake > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: don't forget to increase your free space map settings > >
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote: > Karel Zak <zakkr@zf.jcu.cz> writes: > > The formatting function API can be pretty simple: > > text *my_copy_format(text *attrdata, int direction, > > int nattrs, int attr, oid attrtype, oid relation) > > This seems like it could only reasonably be implemented as a C function. Why? I said it's pseudo code. It should use standard fmgr API likeevery other PostgreSQL function or is it problem and Ioverlooksomething? It must to support arbitrary programming language and notC only. > I can't really imagine the average user of COPY wanting to write C in > preference to, say, an external perl script. What's the real use-case > for the feature? Don't hardcode any format to PostgreSQL, be open for others formats. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
On Wed, Mar 17, 2004 at 10:11:48AM -0500, Fernando Nasser wrote: > Could a CSV-generating function be provided with the distribution then? I think yes. Any other format suggestion? Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
On Thu, Mar 18, 2004 at 07:48:40AM +0100, Karel Zak wrote: > On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote: > > Karel Zak <zakkr@zf.jcu.cz> writes: > > > The formatting function API can be pretty simple: > > > text *my_copy_format(text *attrdata, int direction, > > > int nattrs, int attr, oid attrtype, oid relation) > > > > This seems like it could only reasonably be implemented as a C function. > > Why? I said it's pseudo code. It should use standard fmgr API like > every other PostgreSQL function or is it problem and I overlook > something? It must to support arbitrary programming language and not > C only. Well, I look over the COPY code and best will start with hardcodedversion, but make it modular in code and if all willright we can thinkabout some interface for others formats definition. OK? It's pity that main idea of current COPY is based on separated linesand it is not more common interface for streamingdata between FE and BE. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
To be honest this idea strikes me as overkill - over engineering. While there is a clear need for proper CSV import (i.e. just setting DELIMITER to ',' doesn't work due to ','s in strings) I cannot see how this would prove useful, or who would use it? While i have done a lot of messing around reading/writing the binary format (and been stung by changes in that format) if you are using this format then you're 99% likely to be in control of the incoming/outgoing data and thus able to format to your wishes outwith COPY. Something else in the TODO regarding COPY is XML import/export, and for this to be supported in your proposed implementation the function would need to be passed in a heap more information. L. Karel Zak writes:> > Hi,> > in TODO is item: "* Allow dump/load of CSV format". I don't think> it's clean idea. Why CSVand why not something other? :-) > > A why not allow to users full control of the format by they own> function. It meanssomething like:> > COPY tablename [ ( column [, ...] ) ]> TO { 'filename' | STDOUT }> [ [ WITH ] > [ BINARY ]> [ OIDS ]> [ DELIMITER [ AS ] 'delimiter' ]> [ NULL [ AS ] 'null string' ]> [ FORMAT funcname ] ]> ^^^^^^^^^^^^^^^^> > The formattingfunction API can be pretty simple:> > text *my_copy_format(text *attrdata, int direction, > int nattrs,int attr, oid attrtype, oid relation)> > -- it's pseudocode of course, it should be use standard fmgr> interface.> > It's probably interesting for non-binary COPY version.> > Comments?> > Karel
On Thu, Mar 18, 2004 at 09:29:03AM +0000, Lee Kindness wrote: > To be honest this idea strikes me as overkill - over > engineering. It was suggestion, maybe you're right :-) > While i have done a lot of messing around reading/writing the binary > format (and been stung by changes in that format) if you are using > this format then you're 99% likely to be in control of the > incoming/outgoing data and thus able to format to your wishes outwith > COPY. I partly agree.But.. there is possible write directly final file by backend withoutdata transfer to client. If we wantto support this feature we needcontrol output format by server... And.. I can image format that is use for BE/FE data transfer only andnot for some final data presentation. For example compression of datastream from/to BE without PostgreSQL protocol change. > Something else in the TODO regarding COPY is XML import/export, and > for this to be supported in your proposed implementation the function > would need to be passed in a heap more information. Yes, very probably some struct with all COPY information and formatspecific stuff. Tom was right that in this caseit will C functionsonly. As I said I will try implement it without user defined functioncall for format conversion, but I will do it modular and in future wecan create some interface for user defined formats. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Lee Kindness wrote: >To be honest this idea strikes me as overkill - over >engineering. While there is a clear need for proper CSV import >(i.e. just setting DELIMITER to ',' doesn't work due to ','s in >strings) I cannot see how this would prove useful, or who would use >it? > > > I agree. My modest proposal for handling CSVs would be to extend the DELIMITER parameter to allow up to 3 characters - separator, quote and escape. Escape would default to the quote char and the quote char would default to unspecified. This would involve no grammar changes and fairly isolated and small code changes, I believe. In the most common CSV cases you would just use $$,"$$ or $$,'$$. :-) COPY is basically line/tuple oriented, and that alone would exclude many file formats (e.g. imagine wanting to import a spreadsheet where each worksheet was the table name and the first row on each worksheet was the field names - I have seen such beasts more than once). If we want a general facility for loading and exporting foreign file formats, I really believe that is the province of a utility program rather than a database engine function. The reason in my mind for making CSV a special case is that it is very easy to do and so often asked for. (I used to set parsing CSVs as a basic programming exercise - it is amazing how many way people find to get it wrong). cheers andrew
Andrew Dunstan wrote: > Lee Kindness wrote: > > >To be honest this idea strikes me as overkill - over > >engineering. While there is a clear need for proper CSV import > >(i.e. just setting DELIMITER to ',' doesn't work due to ','s in > >strings) I cannot see how this would prove useful, or who would use > >it? > > > > > > > I agree. My modest proposal for handling CSVs would be to extend the > DELIMITER parameter to allow up to 3 characters - separator, quote and > escape. Escape would default to the quote char and the quote char would > default to unspecified. This would involve no grammar changes and fairly > isolated and small code changes, I believe. In the most common CSV cases > you would just use $$,"$$ or $$,'$$. :-) > > COPY is basically line/tuple oriented, and that alone would exclude many > file formats (e.g. imagine wanting to import a spreadsheet where each > worksheet was the table name and the first row on each worksheet was the > field names - I have seen such beasts more than once). If we want a > general facility for loading and exporting foreign file formats, I > really believe that is the province of a utility program rather than a > database engine function. > > The reason in my mind for making CSV a special case is that it is very > easy to do and so often asked for. > > (I used to set parsing CSVs as a basic programming exercise - it is > amazing how many way people find to get it wrong). I like the separator, quote, and escape idea. It allows variety without requiring folks to code in C. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Karel Zak <zakkr@zf.jcu.cz> writes: >> On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote: >>> Karel Zak <zakkr@zf.jcu.cz> writes: >>>> This seems like it could only reasonably be implemented as a C function. >> >> Why? I said it's pseudo code. It should use standard fmgr API like >> every other PostgreSQL function or is it problem and I overlook >> something? It must to support arbitrary programming language and not >> C only. Sure, but the question is whether the *stuff it has to do* can reasonably be coded in anything but C. Why are you passing in a relation OID, if not for lookups in relcache entries that are simply not accessible above the C level? (Don't tell me you want the function to do a bunch of actual SELECTs from system catalogs for every line of the copy...) Passing in a relation OID is probably a bad idea anyway, as it ties this API to the assumption that COPY is only for complete relations. There's been talk before of allowing a SELECT result to be presented via the COPY protocol, for instance. What might be a more usable API is COPY OUT: function formatter_out(text[]) returns text COPY IN: function formatter_in(text) returns text[] where the text array is either the results of or the input to the per-column datatype I/O routines. This makes it explicit that the formatter's job is solely to determine the column-level wrapping and unwrapping of the data. I'm assuming here that there is no good reason for the formatter to care about the specific datatypes involved; can you give a counterexample? > It's pity that main idea of current COPY is based on separated lines > and it is not more common interface for streaming data between FE and BE. Yeah, that was another concern I had. This API would let the formatter control line-level layout but it would not eliminate the hard-wired significance of newline. What's worse, there isn't any clean way to deal with reading quoted newlines --- the formatter can't really replace the default quoting rules if the low-level code is going to decide whether a newline is quoted or not. We could possibly solve that by specifying that the text output or input (respectively) is the complete line sent to or from the client, including newline or whatever other line-level formatting you are using. This still leaves the problem of how the low-level COPY IN code knows what is a complete line to pass off to the formatter_in routine. We could possibly fix this by adding a second input-control routine function formatter_linelength(text) returns integer which is defined to return -1 if the input isn't a complete line yet (i.e., read some more data, append to the buffer, and try again), or >= 0 to indicate that the first N bytes of the buffer represent a complete line to be passed off to formatter_in. I don't see a way to combine formatter_in and formatter_linelength into a single function without relying on "out" parameters, which would again confine the feature to format functions written in C. It's a tad annoying that we need two functions for input. One way that we could still keep the COPY option syntax to be justFORMAT csv is to create an arbitrary difference in the signatures of the input functions. Then we could have coexisting functionscsv(text[]) returns textcsv(text) returns text[]csv(text, ...) returnsint that are referenced by "FORMAT csv". regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > Lee Kindness wrote: >> To be honest this idea strikes me as overkill - over >> engineering. >> > I agree. My modest proposal for handling CSVs would be to extend the > DELIMITER parameter to allow up to 3 characters - separator, quote and > escape. Escape would default to the quote char and the quote char would > default to unspecified. I could go with that too. The question here is do we have any popular use-cases that aren't solved by that extension, but could be solved by simple user-level data formatting functions? I'm not real eager to add such a feature as an "if we build it they will come" speculation, but if anyone can point to solid use-cases besides handling CSV, then it probably is worth doing. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > Lee Kindness wrote: > >> To be honest this idea strikes me as overkill - over > >> engineering. > >> > > I agree. My modest proposal for handling CSVs would be to extend the > > DELIMITER parameter to allow up to 3 characters - separator, quote and > > escape. Escape would default to the quote char and the quote char would > > default to unspecified. > > I could go with that too. The question here is do we have any popular > use-cases that aren't solved by that extension, but could be solved by > simple user-level data formatting functions? I'm not real eager to add > such a feature as an "if we build it they will come" speculation, but > if anyone can point to solid use-cases besides handling CSV, then it > probably is worth doing. The thing I liked about Andrew's idea is that it even covers escape quoting for CVS, which might change from implementation to implentation, and it is flexible without requiring C coding. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane <tgl@sss.pgh.pa.us> writes: > I could go with that too. The question here is do we have any popular > use-cases that aren't solved by that extension, but could be solved by > simple user-level data formatting functions? I'm not real eager to add > such a feature as an "if we build it they will come" speculation, but > if anyone can point to solid use-cases besides handling CSV, then it > probably is worth doing. (I can't believe I'm saying this, but) It seems like xml output would be the use case you're looking for. -- greg
Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> I could go with that too. The question here is do we have any popular >> use-cases that aren't solved by that extension, but could be solved by >> simple user-level data formatting functions? > (I can't believe I'm saying this, but) It seems like xml output would be the > use case you're looking for. Does that fall into the category of stuff that could be supported by the kind of API we're talking about? I should think that XML would need a much more global view of the data, not just line-by-line reformatting. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Does that fall into the category of stuff that could be supported by the > kind of API we're talking about? I should think that XML would need a > much more global view of the data, not just line-by-line reformatting. I saw it the other way around. The only reason I saw for writing code in COPY directly was if it needed a more global view of the data. Any line-by-line reformatting could just be done with a perl one-liner. If it can't do that what's the advantage? Speed? -- greg
On Thu, Mar 18, 2004 at 09:36:17AM -0500, Andrew Dunstan wrote: > Lee Kindness wrote: > > >To be honest this idea strikes me as overkill - over > >engineering. While there is a clear need for proper CSV import > >(i.e. just setting DELIMITER to ',' doesn't work due to ','s in > >strings) I cannot see how this would prove useful, or who would use > >it? > > > > > > > I agree. My modest proposal for handling CSVs would be to extend the > DELIMITER parameter to allow up to 3 characters - separator, quote and > escape. Escape would default to the quote char and the quote char would > default to unspecified. This would involve no grammar changes and fairly > isolated and small code changes, I believe. In the most common CSV cases > you would just use $$,"$$ or $$,'$$. :-) I look at backend COPY code and it will better clean up this "party ofglobal static values" before add something to this code. (My priviteopinion of course). The problem with CSV is that it will correctly work with new protocolonly. Because old versions of clients are newline sensitive. And CSVcan contains newline in by quotation marks defined attributes: "John", "Smith", "The White House1600 Pennsylvania Avenue NWWashington, DC 20500", "male", "open source software office"Itis one record. It's difficult to say it :-), but your DELIMITER idea is better than mysuggested API. Andrew, goahead. I thought about some data streaming,but COPY is probably bad place for it. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
On Thu, Mar 18, 2004 at 10:16:36AM -0500, Tom Lane wrote: > Passing in a relation OID is probably a bad idea anyway, as it ties this > API to the assumption that COPY is only for complete relations. There's > been talk before of allowing a SELECT result to be presented via the > COPY protocol, for instance. What might be a more usable API is > > COPY OUT: > function formatter_out(text[]) returns text > COPY IN: > function formatter_in(text) returns text[] > > where the text array is either the results of or the input to the > per-column datatype I/O routines. This makes it explicit that the > formatter's job is solely to determine the column-level wrapping and > unwrapping of the data. I'm assuming here that there is no good reason > for the formatter to care about the specific datatypes involved; can you > give a counterexample? The idea was put maximum information about tuple to formatter, and whatwill formatter do with this information is a formatterproblem. > > It's pity that main idea of current COPY is based on separated lines > > and it is not more common interface for streaming data between FE and BE. > > Yeah, that was another concern I had. This API would let the formatter > control line-level layout but it would not eliminate the hard-wired > significance of newline. What's worse, there isn't any clean way to > deal with reading quoted newlines --- the formatter can't really replace > the default quoting rules if the low-level code is going to decide > whether a newline is quoted or not. I think latest protocol version works with blocks of data and no withlines and client PQputCopyData() returns a block --only docs says thatit is row of table. > We could possibly solve that by specifying that the text output or input > (respectively) is the complete line sent to or from the client, > including newline or whatever other line-level formatting you are using. > This still leaves the problem of how the low-level COPY IN code knows > what is a complete line to pass off to the formatter_in routine. We > could possibly fix this by adding a second input-control routine > > function formatter_linelength(text) returns integer > > which is defined to return -1 if the input isn't a complete line yet But formatter_linelength() will need some context information Ithink. The others words some struct with formatter specific internaldata. And for more difficult formats like XML you need some otherscontext data (parser data)too. Maybe there can be some global exported struct (like for triggers) andfunctions that is written in C can use it. It meansfor simple formatslike CSV you can use non-C functions and for formats like XML you canuse C functions. And if itwill intereting for PL developers they canadd support for access to this structs to their languages. > (i.e., read some more data, append to the buffer, and try again), or > >= 0 to indicate that the first N bytes of the buffer represent a > complete line to be passed off to formatter_in. I don't see a way to > combine formatter_in and formatter_linelength into a single function > without relying on "out" parameters, which would again confine the > feature to format functions written in C. > It's a tad annoying that we need two functions for input. One way that > we could still keep the COPY option syntax to be just > FORMAT csv > is to create an arbitrary difference in the signatures of the input > functions. Then we could have coexisting functions > csv(text[]) returns text > csv(text) returns text[] > csv(text, ...) returns int > that are referenced by "FORMAT csv". It sounds good, but I think we both not full sure about it now, right?CSV support will probably better add by DELIMITERextension. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
andrew@dunslane.net (Andrew Dunstan) writes: > Karel Zak wrote: > >> Hi, >> >> in TODO is item: "* Allow dump/load of CSV format". I don't think >> it's clean idea. Why CSV and why not something other? :-) >> >> A why not allow to users full control of the format by they own >> function. It means something like: >> COPY tablename [ ( column [, ...] ) ] >> TO { 'filename' | STDOUT } >> [ [ WITH ] [ BINARY ] >> [ OIDS ] >> [ DELIMITER [ AS ] 'delimiter' ] >> [ NULL [ AS ] 'null string' ] >> [ FORMAT funcname ] ] >> ^^^^^^^^^^^^^^^^ >> The formatting >> function API can be pretty simple: >> >> text *my_copy_format(text *attrdata, int direction, int >> nattrs, int attr, oid attrtype, oid relation) >> >> -- it's pseudocode of course, it should be use standard fmgr >> interface. >> It's probably interesting for non-binary COPY version. > > Interesting ... The alternative might be an external program to munge > CSVs and whatever other format people want to support and then call > the exisiting COPY- either in bin or contrib. I have seen lots of > people wanting to import CSVs, and that's even before we get a Windows > port. I know Jan Wieck has been working on something like this, with a bit of further smarts... - By having, alongside, a table definition, the table can be created concurrently; - A set of mapping functions can be used, so that if, for instance, the program generating the data was Excel, and you havea field with values like 37985, 38045, or 38061, they can respectively be mapped to '2004-01-01', '2004-03-01', and'2004-03-17'; - It can load whatever data is loadable, and use Ethernet-like backoffs when it encounters bad records so that it loadsall the data that is good, and leaves a bundle of 'crud' that is left over. He had been prototyping it in Tcl; I'm not sure how far a port to C has gotten. It looked pretty neat; it sure seems better to put the "cleverness" in userspace than to try to increase the complexity of the postmaster... -- output = ("cbbrowne" "@" "cbbrowne.com") http://cbbrowne.com/info/linuxxian.html Have you heard of the new Macsyma processor? It has three instructions -- LOAD, STORE, and SKIP IF INTEGRABLE.
Karel Zak wrote: > The problem with CSV is that it will correctly work with new protocol > only. Because old versions of clients are newline sensitive. And CSV > can contains newline in by quotation marks defined attributes: > > "John", "Smith", "The White House > 1600 Pennsylvania Avenue NW > Washington, DC 20500", "male", "open source software office" > > It is one record. > (Long Live President Smith!) I have never seen such a beast, and if I had been asked off the cuff would have said that it was probably illegal, except that I know of no standard actually defining the format. Perhaps others (Joshua Drake or Josh Berkus?) have wider experience. I think in any case we should ignore those for now and handle the straightforward case. I *have* seen monstrosities like fields that do not begin with the quote character but then break into a quote, e.g.: 1,2,a,123"abc""def",6,7,8 > > It's difficult to say it :-), but your DELIMITER idea is better than my > suggested API. Andrew, go ahead. I thought about some data streaming, > but COPY is probably bad place for it. > > > Just to be clear, I have no current plans for doing this. I have about 3 pg related things I am working on, in addition to doing paid work so I can eat! If I do it it will not be for quite a while. There are some wrinkles, though, concerning the interaction of CSV's notion of escaping and COPY's notion of escaping. If someone want to undertake this I can flesh those out in a further email. Someone was asking the other day about a newbie or student project - this might be a good one (impact relatively isolated, very easy to test) although possibly not as sexy as some might like. cheers andrew
Karel Zak <zakkr@zf.jcu.cz> writes: >>> It's pity that main idea of current COPY is based on separated lines >>> and it is not more common interface for streaming data between FE and BE. >> >> Yeah, that was another concern I had. This API would let the formatter >> control line-level layout but it would not eliminate the hard-wired >> significance of newline. What's worse, there isn't any clean way to >> deal with reading quoted newlines --- the formatter can't really replace >> the default quoting rules if the low-level code is going to decide >> whether a newline is quoted or not. > I think latest protocol version works with blocks of data and no with > lines and client PQputCopyData() returns a block -- only docs says that > it is row of table. But you can't assume that the client will send blocks that are semantically significant. For instance, if psql is reading a file to send with \copy, how's it going to know how the file is formatted? It's just gonna send disk-block-sized messages, and the backend has to discover the semantic boundaries for itself. > It sounds good, but I think we both not full sure about it now, right? > CSV support will probably better add by DELIMITER extension. Yeah, without people beating on our door for such a hook, it seems like Andrew's DELIMITER idea is the best thing to do for now. regards, tom lane
Andrew Dunstan wrote: > > > Karel Zak wrote: > >> The problem with CSV is that it will correctly work with new protocol >> only. Because old versions of clients are newline sensitive. And CSV >> can contains newline in by quotation marks defined attributes: >> >> "John", "Smith", "The White House >> 1600 Pennsylvania Avenue NW >> Washington, DC 20500", "male", "open source software office" >> >> It is one record. > > > (Long Live President Smith!) > > I have never seen such a beast, Export from a spreadsheet where people have formated the cell with the address on it. Regards, Fernando
On Fri, Mar 19, 2004 at 09:39:58AM -0500, Tom Lane wrote: > Karel Zak <zakkr@zf.jcu.cz> writes: > >>> It's pity that main idea of current COPY is based on separated lines > >>> and it is not more common interface for streaming data between FE and BE. > >> > >> Yeah, that was another concern I had. This API would let the formatter > >> control line-level layout but it would not eliminate the hard-wired > >> significance of newline. What's worse, there isn't any clean way to > >> deal with reading quoted newlines --- the formatter can't really replace > >> the default quoting rules if the low-level code is going to decide > >> whether a newline is quoted or not. > > > I think latest protocol version works with blocks of data and no with > > lines and client PQputCopyData() returns a block -- only docs says that > > it is row of table. > > But you can't assume that the client will send blocks that are > semantically significant. For instance, if psql is reading a file to > send with \copy, how's it going to know how the file is formatted? And what \n in attibutes data in CSV? I think CSV format doesn't usesome escape for newline char. It means psql with\copy cannot be surewith CSV. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Karel Zak <zakkr@zf.jcu.cz> writes: > The problem with CSV is that it will correctly work with new protocol > only. Because old versions of clients are newline sensitive. Why? The client-side code doesn't have any real say over the meaning of the data, at least not in psql-class clients. I suppose a client app that tries to interpret the data could get confused, but psql sure doesn't do that. regards, tom lane
Fernando Nasser wrote: > Andrew Dunstan wrote: > >> >> >> Karel Zak wrote: >> >>> The problem with CSV is that it will correctly work with new protocol >>> only. Because old versions of clients are newline sensitive. And CSV >>> can contains newline in by quotation marks defined attributes: >>> >>> "John", "Smith", "The White House >>> 1600 Pennsylvania Avenue NW >>> Washington, DC 20500", "male", "open source software office" >>> >>> It is one record. >> >> >> >> (Long Live President Smith!) >> >> I have never seen such a beast, > > > Export from a spreadsheet where people have formated the cell with the > address on it. > Well, I just tried with OpenOffice on my RH9 box, and it translated the line breaks in the cell into spaces in the CSV. When I replaced them with line breaks in a text editor and reloaded it treated them as separate rows. I don't have a Windows box handy on which I can test Excel's behaviour. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > There are some wrinkles, though, concerning the interaction of CSV's > notion of escaping and COPY's notion of escaping. If someone want to > undertake this I can flesh those out in a further email. Please do that, so that the info is in the archives in case someone else wants to tackle the project. regards, tom lane
I wrote: > Fernando Nasser wrote: > >>> [snip re multiline CSV fields] >>> >>> I have never seen such a beast, >> >> >> >> Export from a spreadsheet where people have formated the cell with >> the address on it. >> > > Well, I just tried with OpenOffice on my RH9 box, and it translated > the line breaks in the cell into spaces in the CSV. When I replaced > them with line breaks in a text editor and reloaded it treated them as > separate rows. > > I don't have a Windows box handy on which I can test Excel's behaviour. > Clearly my experience is out of date - I just found a web ref to Excel doing just this. Oh, well, it doesn't seem to me an insurmountable problem. cheers andrew
Karel Zak <zakkr@zf.jcu.cz> writes: > And what \n in attibutes data in CSV? I think CSV format doesn't use > some escape for newline char. It means psql with \copy cannot be sure > with CSV. I think CSV only allows embedded newlines that are either escaped, or inside quotes. COPY doesn't currently have the notion of a quote character, but that was part of Andrew's proposal ... regards, tom lane
<quote who="Andrew Dunstan"> > > > Karel Zak wrote: > >> The problem with CSV is that it will correctly work with new protocol >> only. Because old versions of clients are newline sensitive. And CSV >> can contains newline in by quotation marks defined attributes: >> >> "John", "Smith", "The White House >> 1600 Pennsylvania Avenue NW >> Washington, DC 20500", "male", "open source software office" >> >> It is one record. >> > > (Long Live President Smith!) > > I have never seen such a beast, and if I had been asked off the cuff > would have said that it was probably illegal, except that I know of no > standard actually defining the format. Perhaps others (Joshua Drake or > Josh Berkus?) have wider experience. I think in any case we should > ignore those for now and handle the straightforward case. > > I *have* seen monstrosities like fields that do not begin with the quote > character but then break into a quote, e.g.: > > 1,2,a,123"abc""def",6,7,8 > I have dealt with both, frequently. The CSV format allows you to begin a block of text with the quote. Newlines are included in that quoted space. If qoutes are included in the field then the quotesare double quotes to denote they are not part of the quoted space. Also, the following is valid. 1,2,,,"",,,"" "" is empty. 1,2,3,"""",,,,"", The 4 quotes denote a single double quote. Writing simple CSV converts that just explode on commas and newlines miss these oddities. Try exporting an Access table with a Memo field (containg multiple lines) to CSV.
Tom Lane wrote: >Andrew Dunstan <andrew@dunslane.net> writes: > > >>There are some wrinkles, though, concerning the interaction of CSV's >>notion of escaping and COPY's notion of escaping. If someone want to >>undertake this I can flesh those out in a further email. >> >> > >Please do that, so that the info is in the archives in case someone else >wants to tackle the project. > > > briefly: According to my understanding, in a CSV file backslash has no magical meaning unless it is the escape character, in which case we only expect to find it prefacing either itself or the quote character inside a quoted field. Otherwise, it is just another character. One way of handling this might be to have 2 modes of COPY processing: . if a quote delimiter is specified turn off all of COPY's usual backslash processing, and make the default NULL marker the empty string . if no quote delimiter is specified, act as now. OTOH it might be a good idea to be able to turn off backslash processing even without a quote delimiter, e.g. in a CSV-like file using tab as the delimiter and no quote escaping, so maybe another switch on COPY would be a better way to go. Another issue I wondered about is how to specify nicely that TAB is the field delimiter - I hate putting a literal semantic tab in files, and consider its magical use in places like Makefiles and syslog.conf files some of the worst decisions in computing ever made ;-). I'd like a nicer *visible* way of specifying it, either with \t or ^I maybe. I'm sure other issues will arise - that's all that's in my head for the moment :-) cheers andrew
Thomas Swan wrote: > > I have never seen such a beast, and if I had been asked off the cuff > > would have said that it was probably illegal, except that I know of no > > standard actually defining the format. Perhaps others (Joshua Drake or > > Josh Berkus?) have wider experience. I think in any case we should > > ignore those for now and handle the straightforward case. > > > > I *have* seen monstrosities like fields that do not begin with the quote > > character but then break into a quote, e.g.: > > > > 1,2,a,123"abc""def",6,7,8 > > > > I have dealt with both, frequently. The CSV format allows you to begin a > block of text with the quote. Newlines are included in that quoted space. > If qoutes are included in the field then the quotes are double quotes to > denote they are not part of the quoted space. > > Also, the following is valid. > > 1,2,,,"",,,"" > > "" is empty. > > 1,2,3,"""",,,,"", > > The 4 quotes denote a single double quote. > > Writing simple CSV converts that just explode on commas and newlines miss > these oddities. OK, so for the separator, quote, and escape options: separator is ,quote is "escape is " so if the quote and escape are the same, then a double denotes a single? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian wrote: >OK, so for the separator, quote, and escape options: > > separator is , > quote is " > escape is " > >so if the quote and escape are the same, then a double denotes a >single? > > > Yes. i.e. with the above settings "abc""def" -> abc"def cheers andrew
On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote: > Karel Zak <zakkr@zf.jcu.cz> writes: > > The problem with CSV is that it will correctly work with new protocol > > only. Because old versions of clients are newline sensitive. > > Why? The client-side code doesn't have any real say over the meaning of > the data, at least not in psql-class clients. I suppose a client app > that tries to interpret the data could get confused, but psql sure > doesn't do that. libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut stringbehind '\n'. Karel -- Karel Zak <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/
Karel Zak <zakkr@zf.jcu.cz> writes: > On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote: >> Why? The client-side code doesn't have any real say over the meaning of >> the data, at least not in psql-class clients. I suppose a client app >> that tries to interpret the data could get confused, but psql sure >> doesn't do that. > libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut string > behind '\n'. Sure, but that doesn't invalidate the data stream as a whole, it's just a bufferload boundary choice that won't be very helpful for clients not using a newline-based data layout. In any case, v2 protocol is obsolete and needn't limit our thoughts about what to do in future. regards, tom lane