Thread: COPY formatting

COPY formatting

From
Karel Zak
Date:
Hi,
in TODO is item: "* Allow dump/load of CSV format". I don't thinkit's clean idea. Why CSV and why not something other?
:-)
 
A why not allow to users full control of the format by they ownfunction. It means something like:COPY tablename [ (
column[, ...] ) ]    TO { 'filename' | STDOUT }    [ [ WITH ]          [ BINARY ]         [ OIDS ]         [ DELIMITER
[AS ] 'delimiter' ]         [ NULL [ AS ] 'null string' ]         [ FORMAT funcname ] ]          ^^^^^^^^^^^^^^^^
                                        The formatting function API can be pretty simple:
 
text *my_copy_format(text *attrdata, int direction,             int nattrs, int attr, oid attrtype, oid relation)
-- it's pseudocode of course, it should be use standard fmgrinterface.It's probably interesting for non-binary COPY
version.Comments?
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Hans-Jürgen Schönig
Date:
Karel Zak wrote:
>  Hi,
> 
>  in TODO is item: "* Allow dump/load of CSV format". I don't think
>  it's clean idea. Why CSV and why not something other? :-) 
> 
>  A why not allow to users full control of the format by they own
>  function. It means something like:
>  
>  COPY tablename [ ( column [, ...] ) ]
>      TO { 'filename' | STDOUT }
>      [ [ WITH ] 
>           [ BINARY ]
>           [ OIDS ]
>           [ DELIMITER [ AS ] 'delimiter' ]
>           [ NULL [ AS ] 'null string' ]
>           [ FORMAT funcname ] ]
>            ^^^^^^^^^^^^^^^^
>                                                  
>  The formatting function API can be pretty simple:
> 
>  text *my_copy_format(text *attrdata, int direction, 
>              int nattrs, int attr, oid attrtype, oid relation)
> 
>  -- it's pseudocode of course, it should be use standard fmgr
>  interface.
>  
>  It's probably interesting for non-binary COPY version.
>  
>  Comments?
> 
>     Karel
> 


Karel,

This seems to be an excellent idea.
People have already asked for many different formats.
Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...".
Since Windows will be supported soon, it will be hard to pipe data to a 
useful program (awk, sed, ...). Maybe this feature would help a lot in 
this case.
Regards,
    Hans




-- 
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at



Re: COPY formatting

From
Fernando Nasser
Date:
Hans-Jürgen Schönig wrote:
> Karel Zak wrote:
> 
>>  Hi,
>>
>>  in TODO is item: "* Allow dump/load of CSV format". I don't think
>>  it's clean idea. Why CSV and why not something other? :-)
>>  A why not allow to users full control of the format by they own
>>  function. It means something like:
>>  
>>  COPY tablename [ ( column [, ...] ) ]
>>      TO { 'filename' | STDOUT }
>>      [ [ WITH ]           [ BINARY ]
>>           [ OIDS ]
>>           [ DELIMITER [ AS ] 'delimiter' ]
>>           [ NULL [ AS ] 'null string' ]
>>           [ FORMAT funcname ] ]
>>            ^^^^^^^^^^^^^^^^
>>                                                   The formatting 
>> function API can be pretty simple:
>>
>>  text *my_copy_format(text *attrdata, int direction,              int 
>> nattrs, int attr, oid attrtype, oid relation)
>>
>>  -- it's pseudocode of course, it should be use standard fmgr
>>  interface.
>>  
>>  It's probably interesting for non-binary COPY version.
>>  
>>  Comments?
>>
>>     Karel
>>
> 
> 
> Karel,
> 
> This seems to be an excellent idea.
> People have already asked for many different formats.
> Usually I recommend them to use "psql -c "COPY ..." dbname | awk ...".
> Since Windows will be supported soon, it will be hard to pipe data to a 
> useful program (awk, sed, ...). Maybe this feature would help a lot in 
> this case.
> 

Could a CSV-generating function be provided with the distribution then?

Regards,
Fernando


Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
>  The formatting function API can be pretty simple:
>  text *my_copy_format(text *attrdata, int direction, 
>              int nattrs, int attr, oid attrtype, oid relation)

This seems like it could only reasonably be implemented as a C function.
I can't really imagine the average user of COPY wanting to write C in
preference to, say, an external perl script.  What's the real use-case
for the feature?
        regards, tom lane


Re: COPY formatting

From
Andrew Dunstan
Date:
Karel Zak wrote:

> Hi,
>
> in TODO is item: "* Allow dump/load of CSV format". I don't think
> it's clean idea. Why CSV and why not something other? :-) 
>
> A why not allow to users full control of the format by they own
> function. It means something like:
> 
> COPY tablename [ ( column [, ...] ) ]
>     TO { 'filename' | STDOUT }
>     [ [ WITH ] 
>          [ BINARY ]
>          [ OIDS ]
>          [ DELIMITER [ AS ] 'delimiter' ]
>          [ NULL [ AS ] 'null string' ]
>          [ FORMAT funcname ] ]
>           ^^^^^^^^^^^^^^^^
>                                                 
> The formatting function API can be pretty simple:
>
> text *my_copy_format(text *attrdata, int direction, 
>             int nattrs, int attr, oid attrtype, oid relation)
>
> -- it's pseudocode of course, it should be use standard fmgr
> interface.
> 
> It's probably interesting for non-binary COPY version.
> 
>  
>

Interesting ... The alternative might be an external program to munge 
CSVs and whatever other format people want to support and then call the 
exisiting COPY- either in bin or contrib. I have seen lots of people 
wanting to import CSVs, and that's even before we get a Windows port.

cheers

andrew


Re: COPY formatting

From
Fernando Nasser
Date:
Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> 
>> The formatting function API can be pretty simple:
>> text *my_copy_format(text *attrdata, int direction, 
>>             int nattrs, int attr, oid attrtype, oid relation)
> 
> 
> This seems like it could only reasonably be implemented as a C function.
> I can't really imagine the average user of COPY wanting to write C in
> preference to, say, an external perl script.  What's the real use-case
> for the feature?
> 

That is why I suggested providing a pre-written/pre-compiled/installed 
function for CSV (call it CSV?).  Advanced users could still write their 
own as people can write many other things if they know their ways.


Re: COPY formatting

From
"Joshua D. Drake"
Date:
>
>
>>
>
> That is why I suggested providing a pre-written/pre-compiled/installed 
> function for CSV (call it CSV?).  Advanced users could still write 
> their own as people can write many other things if they know their ways.
>

As someone who just went through a whole truckload of crap getting 
delimited files parsed from MSSQL to PostgreSQL. I believe yes this 
would be  great thing. We ended up using plPython with the CSV module.

Sincerely,

Joshua Drake



> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings



-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL



Re: COPY formatting

From
mike g
Date:
I deal with this daily in a cygwin environment.  I wrote a simple c++
program where I hardcoded the input file name/location and output file
name/location.  I strip the quotation marks out where they are used for
identifying text fields and change the comma's used as CSV's to pipes.

I use a combination of bash scripting to execute the c++ program and
then Perl to execute a stored procedure.  I am new to Perl so I have not
yet considered migrating it all into Perl.

The dos2unix tools in cygwin always messed up the first character of the
first line.

I thought the real issue with the copy function and CSVs was that it did
not like the use of quotations around the fields to identify text
fields.

For a true Windows port handling MS Excel files in their native format
would be a goal I would hope. If your api could handle that then I would
agree with your method.



On Wed, 2004-03-17 at 22:14, Joshua D. Drake wrote:
> >
> >
> >>
> >
> > That is why I suggested providing a pre-written/pre-compiled/installed
> > function for CSV (call it CSV?).  Advanced users could still write
> > their own as people can write many other things if they know their ways.
> >
>
> As someone who just went through a whole truckload of crap getting
> delimited files parsed from MSSQL to PostgreSQL. I believe yes this
> would be  great thing. We ended up using plPython with the CSV module.
>
> Sincerely,
>
> Joshua Drake
>
>
>
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: don't forget to increase your free space map settings
>
>

Re: COPY formatting

From
Karel Zak
Date:
On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> >  The formatting function API can be pretty simple:
> >  text *my_copy_format(text *attrdata, int direction, 
> >              int nattrs, int attr, oid attrtype, oid relation)
> 
> This seems like it could only reasonably be implemented as a C function.
Why? I said it's pseudo code. It should use standard fmgr API likeevery other PostgreSQL function or is it problem and
Ioverlooksomething? It must to support arbitrary programming language and notC only.
 

> I can't really imagine the average user of COPY wanting to write C in
> preference to, say, an external perl script.  What's the real use-case
> for the feature?
Don't hardcode any format to PostgreSQL, be open for others formats.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Karel Zak
Date:
On Wed, Mar 17, 2004 at 10:11:48AM -0500, Fernando Nasser wrote:
> Could a CSV-generating function be provided with the distribution then?
I think yes. Any other format suggestion?
   Karel
-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Karel Zak
Date:
On Thu, Mar 18, 2004 at 07:48:40AM +0100, Karel Zak wrote:
> On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
> > Karel Zak <zakkr@zf.jcu.cz> writes:
> > >  The formatting function API can be pretty simple:
> > >  text *my_copy_format(text *attrdata, int direction, 
> > >              int nattrs, int attr, oid attrtype, oid relation)
> > 
> > This seems like it could only reasonably be implemented as a C function.
> 
>  Why? I said it's pseudo code. It should use standard fmgr API like
>  every other PostgreSQL function or is it problem and I overlook
>  something? It must to support arbitrary programming language and not
>  C only.
Well, I  look over  the COPY  code and best  will start  with hardcodedversion, but make it modular in code and if all
willright we can thinkabout some interface for others formats definition. OK?
 
It's pity  that main idea of  current COPY is based  on separated linesand it is not more common interface for
streamingdata between FE and BE.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


COPY formatting

From
Lee Kindness
Date:
To be honest this idea strikes me as overkill - over
engineering. While there is a clear need for proper CSV import
(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
strings) I cannot see how this would prove useful, or who would use
it?

While i have done a lot of messing around reading/writing the binary
format (and been stung by changes in that format) if you are using
this format then you're 99% likely to be in control of the
incoming/outgoing data and thus able to format to your wishes outwith
COPY.

Something else in the TODO regarding COPY is XML import/export, and
for this to be supported in your proposed implementation the function
would need to be passed in a heap more information.

L.

Karel Zak writes:> >  Hi,> >  in TODO is item: "* Allow dump/load of CSV format". I don't think>  it's clean idea. Why
CSVand why not something other? :-) > >  A why not allow to users full control of the format by they own>  function. It
meanssomething like:>  >  COPY tablename [ ( column [, ...] ) ]>      TO { 'filename' | STDOUT }>      [ [ WITH ] >
     [ BINARY ]>           [ OIDS ]>           [ DELIMITER [ AS ] 'delimiter' ]>           [ NULL [ AS ] 'null string'
]>          [ FORMAT funcname ] ]>            ^^^^^^^^^^^^^^^^>                                                  >  The
formattingfunction API can be pretty simple:> >  text *my_copy_format(text *attrdata, int direction, >              int
nattrs,int attr, oid attrtype, oid relation)> >  -- it's pseudocode of course, it should be use standard fmgr>
interface.> >  It's probably interesting for non-binary COPY version.>  >  Comments?> >     Karel
 


Re: COPY formatting

From
Karel Zak
Date:
On Thu, Mar 18, 2004 at 09:29:03AM +0000, Lee Kindness wrote:
> To be honest this idea strikes me as overkill - over
> engineering. 
It was suggestion, maybe you're right :-)

> While i have done a lot of messing around reading/writing the binary
> format (and been stung by changes in that format) if you are using
> this format then you're 99% likely to be in control of the
> incoming/outgoing data and thus able to format to your wishes outwith
> COPY.
I partly agree.But.. there  is possible write  directly final file by  backend withoutdata transfer  to client. If we
wantto  support this feature  we needcontrol output format by server...
 
And.. I can image  format that is use for BE/FE  data transfer only andnot for some  final data presentation. For
example compression of datastream from/to BE without PostgreSQL protocol change.
 

> Something else in the TODO regarding COPY is XML import/export, and
> for this to be supported in your proposed implementation the function
> would need to be passed in a heap more information.
Yes, very  probably some  struct with all  COPY information  and formatspecific stuff. Tom  was right that  in this
caseit will  C functionsonly. As I said  I will try implement it without  user defined functioncall for format
conversion, but I will do it modular  and in future wecan create some interface for user defined formats.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Andrew Dunstan
Date:
Lee Kindness wrote:

>To be honest this idea strikes me as overkill - over
>engineering. While there is a clear need for proper CSV import
>(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
>strings) I cannot see how this would prove useful, or who would use
>it?
>
>  
>
I agree. My modest proposal for handling CSVs would be to extend the 
DELIMITER parameter to allow up to 3 characters - separator, quote and 
escape. Escape would default to the quote char and the quote char would 
default to unspecified. This would involve no grammar changes and fairly 
isolated and small code changes, I believe. In the most common CSV cases 
you would just use $$,"$$ or $$,'$$. :-)

COPY is basically line/tuple oriented, and that alone would exclude many 
file formats (e.g. imagine wanting to import a spreadsheet where each 
worksheet was the table name and the first row on each worksheet was the 
field names - I have seen such beasts more than once). If we want a 
general facility for loading and exporting foreign file formats, I 
really believe that is the province of a utility program rather than a 
database engine function.

The reason in my mind for making CSV a special case is that it is very 
easy to do and so often asked for.

(I used to set parsing CSVs as a basic programming exercise - it is 
amazing how many way people find to get it wrong).

cheers

andrew


Re: COPY formatting

From
Bruce Momjian
Date:
Andrew Dunstan wrote:
> Lee Kindness wrote:
> 
> >To be honest this idea strikes me as overkill - over
> >engineering. While there is a clear need for proper CSV import
> >(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
> >strings) I cannot see how this would prove useful, or who would use
> >it?
> >
> >  
> >
> I agree. My modest proposal for handling CSVs would be to extend the 
> DELIMITER parameter to allow up to 3 characters - separator, quote and 
> escape. Escape would default to the quote char and the quote char would 
> default to unspecified. This would involve no grammar changes and fairly 
> isolated and small code changes, I believe. In the most common CSV cases 
> you would just use $$,"$$ or $$,'$$. :-)
> 
> COPY is basically line/tuple oriented, and that alone would exclude many 
> file formats (e.g. imagine wanting to import a spreadsheet where each 
> worksheet was the table name and the first row on each worksheet was the 
> field names - I have seen such beasts more than once). If we want a 
> general facility for loading and exporting foreign file formats, I 
> really believe that is the province of a utility program rather than a 
> database engine function.
> 
> The reason in my mind for making CSV a special case is that it is very 
> easy to do and so often asked for.
> 
> (I used to set parsing CSVs as a basic programming exercise - it is 
> amazing how many way people find to get it wrong).

I like the separator, quote, and escape idea.  It allows variety without
requiring folks to code in C.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
>> On Wed, Mar 17, 2004 at 11:02:38AM -0500, Tom Lane wrote:
>>> Karel Zak <zakkr@zf.jcu.cz> writes:
>>>> This seems like it could only reasonably be implemented as a C function.
>> 
>> Why? I said it's pseudo code. It should use standard fmgr API like
>> every other PostgreSQL function or is it problem and I overlook
>> something? It must to support arbitrary programming language and not
>> C only.

Sure, but the question is whether the *stuff it has to do* can
reasonably be coded in anything but C.  Why are you passing in a
relation OID, if not for lookups in relcache entries that are simply
not accessible above the C level?  (Don't tell me you want the function
to do a bunch of actual SELECTs from system catalogs for every line
of the copy...)

Passing in a relation OID is probably a bad idea anyway, as it ties this
API to the assumption that COPY is only for complete relations.  There's
been talk before of allowing a SELECT result to be presented via the
COPY protocol, for instance.  What might be a more usable API is

COPY OUT:    function formatter_out(text[]) returns text
COPY IN:    function formatter_in(text) returns text[]

where the text array is either the results of or the input to the
per-column datatype I/O routines.  This makes it explicit that the
formatter's job is solely to determine the column-level wrapping and
unwrapping of the data.  I'm assuming here that there is no good reason
for the formatter to care about the specific datatypes involved; can you
give a counterexample?

>  It's pity  that main idea of  current COPY is based  on separated lines
>  and it is not more common interface for streaming data between FE and BE.

Yeah, that was another concern I had.  This API would let the formatter
control line-level layout but it would not eliminate the hard-wired
significance of newline.  What's worse, there isn't any clean way to
deal with reading quoted newlines --- the formatter can't really replace
the default quoting rules if the low-level code is going to decide
whether a newline is quoted or not.

We could possibly solve that by specifying that the text output or input
(respectively) is the complete line sent to or from the client,
including newline or whatever other line-level formatting you are using.
This still leaves the problem of how the low-level COPY IN code knows
what is a complete line to pass off to the formatter_in routine.  We
could possibly fix this by adding a second input-control routine
function formatter_linelength(text) returns integer

which is defined to return -1 if the input isn't a complete line yet
(i.e., read some more data, append to the buffer, and try again), or
>= 0 to indicate that the first N bytes of the buffer represent a
complete line to be passed off to formatter_in.  I don't see a way to
combine formatter_in and formatter_linelength into a single function
without relying on "out" parameters, which would again confine the
feature to format functions written in C.

It's a tad annoying that we need two functions for input.  One way that
we could still keep the COPY option syntax to be justFORMAT csv
is to create an arbitrary difference in the signatures of the input
functions.  Then we could have coexisting functionscsv(text[]) returns textcsv(text) returns text[]csv(text, ...)
returnsint
 
that are referenced by "FORMAT csv".
        regards, tom lane


Re: COPY formatting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Lee Kindness wrote:
>> To be honest this idea strikes me as overkill - over
>> engineering.
>> 
> I agree. My modest proposal for handling CSVs would be to extend the 
> DELIMITER parameter to allow up to 3 characters - separator, quote and 
> escape. Escape would default to the quote char and the quote char would 
> default to unspecified.

I could go with that too.  The question here is do we have any popular
use-cases that aren't solved by that extension, but could be solved by
simple user-level data formatting functions?  I'm not real eager to add
such a feature as an "if we build it they will come" speculation, but
if anyone can point to solid use-cases besides handling CSV, then it
probably is worth doing.
        regards, tom lane


Re: COPY formatting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > Lee Kindness wrote:
> >> To be honest this idea strikes me as overkill - over
> >> engineering.
> >> 
> > I agree. My modest proposal for handling CSVs would be to extend the 
> > DELIMITER parameter to allow up to 3 characters - separator, quote and 
> > escape. Escape would default to the quote char and the quote char would 
> > default to unspecified.
> 
> I could go with that too.  The question here is do we have any popular
> use-cases that aren't solved by that extension, but could be solved by
> simple user-level data formatting functions?  I'm not real eager to add
> such a feature as an "if we build it they will come" speculation, but
> if anyone can point to solid use-cases besides handling CSV, then it
> probably is worth doing.

The thing I liked about Andrew's idea is that it even covers escape
quoting for CVS, which might change from implementation to implentation,
and it is flexible without requiring C coding.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: COPY formatting

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> I could go with that too.  The question here is do we have any popular
> use-cases that aren't solved by that extension, but could be solved by
> simple user-level data formatting functions?  I'm not real eager to add
> such a feature as an "if we build it they will come" speculation, but
> if anyone can point to solid use-cases besides handling CSV, then it
> probably is worth doing.

(I can't believe I'm saying this, but) It seems like xml output would be the
use case you're looking for.

-- 
greg



Re: COPY formatting

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I could go with that too.  The question here is do we have any popular
>> use-cases that aren't solved by that extension, but could be solved by
>> simple user-level data formatting functions?

> (I can't believe I'm saying this, but) It seems like xml output would be the
> use case you're looking for.

Does that fall into the category of stuff that could be supported by the
kind of API we're talking about?  I should think that XML would need a
much more global view of the data, not just line-by-line reformatting.
        regards, tom lane


Re: COPY formatting

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Does that fall into the category of stuff that could be supported by the
> kind of API we're talking about?  I should think that XML would need a
> much more global view of the data, not just line-by-line reformatting.

I saw it the other way around. The only reason I saw for writing code in COPY
directly was if it needed a more global view of the data. Any line-by-line
reformatting could just be done with a perl one-liner. 

If it can't do that what's the advantage? Speed?

-- 
greg



Re: COPY formatting

From
Karel Zak
Date:
On Thu, Mar 18, 2004 at 09:36:17AM -0500, Andrew Dunstan wrote:
> Lee Kindness wrote:
> 
> >To be honest this idea strikes me as overkill - over
> >engineering. While there is a clear need for proper CSV import
> >(i.e. just setting DELIMITER to ',' doesn't work due to ','s in
> >strings) I cannot see how this would prove useful, or who would use
> >it?
> >
> > 
> >
> I agree. My modest proposal for handling CSVs would be to extend the 
> DELIMITER parameter to allow up to 3 characters - separator, quote and 
> escape. Escape would default to the quote char and the quote char would 
> default to unspecified. This would involve no grammar changes and fairly 
> isolated and small code changes, I believe. In the most common CSV cases 
> you would just use $$,"$$ or $$,'$$. :-)
I look at backend COPY code and  it will better clean up this "party ofglobal static  values" before add  something to
this code. (My priviteopinion of course).
 
The problem with  CSV is that it will correctly  work with new protocolonly. Because old  versions of  clients are
newline sensitive. And CSVcan contains newline in by quotation marks defined attributes:
 
"John", "Smith", "The White House1600 Pennsylvania Avenue NWWashington, DC 20500", "male", "open source software
office"Itis one record.  It's difficult to say it :-), but your DELIMITER idea is better than mysuggested API. Andrew,
goahead. I thought about  some data streaming,but COPY is probably bad place for it.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Karel Zak
Date:
On Thu, Mar 18, 2004 at 10:16:36AM -0500, Tom Lane wrote:
> Passing in a relation OID is probably a bad idea anyway, as it ties this
> API to the assumption that COPY is only for complete relations.  There's
> been talk before of allowing a SELECT result to be presented via the
> COPY protocol, for instance.  What might be a more usable API is
> 
> COPY OUT:
>         function formatter_out(text[]) returns text
> COPY IN:
>         function formatter_in(text) returns text[]
> 
> where the text array is either the results of or the input to the
> per-column datatype I/O routines.  This makes it explicit that the
> formatter's job is solely to determine the column-level wrapping and
> unwrapping of the data.  I'm assuming here that there is no good reason
> for the formatter to care about the specific datatypes involved; can you
> give a counterexample?
The idea was put maximum information about tuple to formatter, and whatwill formatter do with this information is a
formatterproblem.
 

> >  It's pity  that main idea of  current COPY is based  on separated lines
> >  and it is not more common interface for streaming data between FE and BE.
> 
> Yeah, that was another concern I had.  This API would let the formatter
> control line-level layout but it would not eliminate the hard-wired
> significance of newline.  What's worse, there isn't any clean way to
> deal with reading quoted newlines --- the formatter can't really replace
> the default quoting rules if the low-level code is going to decide
> whether a newline is quoted or not.
I think latest  protocol version works with blocks of  data and no withlines and client PQputCopyData() returns a block
--only docs says thatit is row of table.
 

> We could possibly solve that by specifying that the text output or input
> (respectively) is the complete line sent to or from the client,
> including newline or whatever other line-level formatting you are using.
> This still leaves the problem of how the low-level COPY IN code knows
> what is a complete line to pass off to the formatter_in routine.  We
> could possibly fix this by adding a second input-control routine
> 
>     function formatter_linelength(text) returns integer
> 
> which is defined to return -1 if the input isn't a complete line yet
But  formatter_linelength()  will  need   some  context  information  Ithink. The others  words some  struct with
formatter specific internaldata. And  for more  difficult formats  like XML  you need  some otherscontext data (parser
data)too.
 
Maybe there can be some global  exported struct (like for triggers) andfunctions that is written in C  can use it. It
meansfor simple formatslike CSV you can  use non-C functions and for formats  like XML you canuse C functions. And  if
itwill intereting for PL  developers they canadd support for access to this structs to their languages.
 

> (i.e., read some more data, append to the buffer, and try again), or
> >= 0 to indicate that the first N bytes of the buffer represent a
> complete line to be passed off to formatter_in.  I don't see a way to
> combine formatter_in and formatter_linelength into a single function
> without relying on "out" parameters, which would again confine the
> feature to format functions written in C.

> It's a tad annoying that we need two functions for input.  One way that
> we could still keep the COPY option syntax to be just
>     FORMAT csv
> is to create an arbitrary difference in the signatures of the input
> functions.  Then we could have coexisting functions
>     csv(text[]) returns text
>     csv(text) returns text[]
>     csv(text, ...) returns int
> that are referenced by "FORMAT csv".
It sounds good, but I think we  both not full sure about it now, right?CSV support will probably better add by
DELIMITERextension.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Chris Browne
Date:
andrew@dunslane.net (Andrew Dunstan) writes:
> Karel Zak wrote:
>
>> Hi,
>>
>> in TODO is item: "* Allow dump/load of CSV format". I don't think
>> it's clean idea. Why CSV and why not something other? :-)
>>
>> A why not allow to users full control of the format by they own
>> function. It means something like:
>> COPY tablename [ ( column [, ...] ) ]
>>     TO { 'filename' | STDOUT }
>>     [ [ WITH ]          [ BINARY ]
>>          [ OIDS ]
>>          [ DELIMITER [ AS ] 'delimiter' ]
>>          [ NULL [ AS ] 'null string' ]
>>          [ FORMAT funcname ] ]
>>           ^^^^^^^^^^^^^^^^
>>                                                 The formatting
>> function API can be pretty simple:
>>
>> text *my_copy_format(text *attrdata, int direction,             int
>> nattrs, int attr, oid attrtype, oid relation)
>>
>> -- it's pseudocode of course, it should be use standard fmgr
>> interface.
>> It's probably interesting for non-binary COPY version.
>
> Interesting ... The alternative might be an external program to munge
> CSVs and whatever other format people want to support and then call
> the exisiting COPY- either in bin or contrib. I have seen lots of
> people wanting to import CSVs, and that's even before we get a Windows
> port.

I know Jan Wieck has been working on something like this, with a bit
of further smarts...
- By having, alongside, a table definition, the table can be created   concurrently;
- A set of mapping functions can be used, so that if, for instance,  the program generating the data was Excel, and you
havea field with  values like 37985, 38045, or 38061, they can respectively be mapped  to '2004-01-01', '2004-03-01',
and'2004-03-17';
 
- It can load whatever data is loadable, and use Ethernet-like  backoffs when it encounters bad records so that it
loadsall the data  that is good, and leaves a bundle of 'crud' that is left over.
 

He had been prototyping it in Tcl; I'm not sure how far a port to C
has gotten.  It looked pretty neat; it sure seems better to put the
"cleverness" in userspace than to try to increase the complexity of
the postmaster...
-- 
output = ("cbbrowne" "@" "cbbrowne.com")
http://cbbrowne.com/info/linuxxian.html
Have you heard of the new Macsyma processor?  It has three instructions --
LOAD, STORE, and SKIP IF INTEGRABLE.


Re: COPY formatting

From
Andrew Dunstan
Date:

Karel Zak wrote:

> The problem with  CSV is that it will correctly  work with new protocol
> only. Because old  versions of  clients are newline  sensitive. And CSV
> can contains newline in by quotation marks defined attributes:
>
> "John", "Smith", "The White House
> 1600 Pennsylvania Avenue NW
> Washington, DC 20500", "male", "open source software office"
> 
> It is one record. 
>

(Long Live President Smith!)

I have never seen such a beast, and if I had been asked off the cuff 
would have said that it was probably illegal, except that I know of no 
standard actually defining the format. Perhaps others (Joshua Drake or 
Josh Berkus?) have wider experience. I think in any case we should 
ignore those for now and handle the straightforward case.

I *have* seen monstrosities like fields that do not begin with the quote 
character but then break into a quote, e.g.:

1,2,a,123"abc""def",6,7,8

>  
> It's difficult to say it :-), but your DELIMITER idea is better than my
> suggested API. Andrew,  go ahead. I thought about  some data streaming,
> but COPY is probably bad place for it.
> 
>  
>


Just to be clear, I have no current plans for doing this. I have about 3 
pg related things I am working on, in addition to doing paid work so I 
can eat! If I do it it will not be for quite a while.

There are some wrinkles, though, concerning the interaction of CSV's 
notion of escaping and  COPY's notion of escaping. If someone want to 
undertake this I can flesh those out in a further email.

Someone was asking the other day about a newbie or student project - 
this might be a good one (impact relatively isolated, very easy to test) 
although possibly not as sexy as some might like.

cheers

andrew



Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
>>> It's pity  that main idea of  current COPY is based  on separated lines
>>> and it is not more common interface for streaming data between FE and BE.
>> 
>> Yeah, that was another concern I had.  This API would let the formatter
>> control line-level layout but it would not eliminate the hard-wired
>> significance of newline.  What's worse, there isn't any clean way to
>> deal with reading quoted newlines --- the formatter can't really replace
>> the default quoting rules if the low-level code is going to decide
>> whether a newline is quoted or not.

>  I think latest  protocol version works with blocks of  data and no with
>  lines and client PQputCopyData() returns a block -- only docs says that
>  it is row of table.

But you can't assume that the client will send blocks that are
semantically significant.  For instance, if psql is reading a file to
send with \copy, how's it going to know how the file is formatted?
It's just gonna send disk-block-sized messages, and the backend has to
discover the semantic boundaries for itself.

>  It sounds good, but I think we  both not full sure about it now, right?
>  CSV support will probably better add by DELIMITER extension.

Yeah, without people beating on our door for such a hook, it seems like
Andrew's DELIMITER idea is the best thing to do for now.
        regards, tom lane


Re: COPY formatting

From
Fernando Nasser
Date:
Andrew Dunstan wrote:
> 
> 
> Karel Zak wrote:
> 
>> The problem with  CSV is that it will correctly  work with new protocol
>> only. Because old  versions of  clients are newline  sensitive. And CSV
>> can contains newline in by quotation marks defined attributes:
>>
>> "John", "Smith", "The White House
>> 1600 Pennsylvania Avenue NW
>> Washington, DC 20500", "male", "open source software office"
>>
>> It is one record.
> 
> 
> (Long Live President Smith!)
> 
> I have never seen such a beast,

Export from a spreadsheet where people have formated the cell with the 
address on it.

Regards,
Fernando


Re: COPY formatting

From
Karel Zak
Date:
On Fri, Mar 19, 2004 at 09:39:58AM -0500, Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> >>> It's pity  that main idea of  current COPY is based  on separated lines
> >>> and it is not more common interface for streaming data between FE and BE.
> >> 
> >> Yeah, that was another concern I had.  This API would let the formatter
> >> control line-level layout but it would not eliminate the hard-wired
> >> significance of newline.  What's worse, there isn't any clean way to
> >> deal with reading quoted newlines --- the formatter can't really replace
> >> the default quoting rules if the low-level code is going to decide
> >> whether a newline is quoted or not.
> 
> >  I think latest  protocol version works with blocks of  data and no with
> >  lines and client PQputCopyData() returns a block -- only docs says that
> >  it is row of table.
> 
> But you can't assume that the client will send blocks that are
> semantically significant.  For instance, if psql is reading a file to
> send with \copy, how's it going to know how the file is formatted?
And what  \n in attibutes data  in CSV? I think CSV  format doesn't usesome escape for  newline char. It means psql
with\copy  cannot be surewith CSV.
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
>  The problem with  CSV is that it will correctly  work with new protocol
>  only. Because old  versions of  clients are newline  sensitive.

Why?  The client-side code doesn't have any real say over the meaning of
the data, at least not in psql-class clients.  I suppose a client app
that tries to interpret the data could get confused, but psql sure
doesn't do that.
        regards, tom lane


Re: COPY formatting

From
Andrew Dunstan
Date:
Fernando Nasser wrote:

> Andrew Dunstan wrote:
>
>>
>>
>> Karel Zak wrote:
>>
>>> The problem with CSV is that it will correctly work with new protocol
>>> only. Because old versions of clients are newline sensitive. And CSV
>>> can contains newline in by quotation marks defined attributes:
>>>
>>> "John", "Smith", "The White House
>>> 1600 Pennsylvania Avenue NW
>>> Washington, DC 20500", "male", "open source software office"
>>>
>>> It is one record.
>>
>>
>>
>> (Long Live President Smith!)
>>
>> I have never seen such a beast,
>
>
> Export from a spreadsheet where people have formated the cell with the 
> address on it.
>

Well, I just tried with OpenOffice on my RH9 box, and it translated the 
line breaks in the cell into spaces in the CSV. When I replaced them 
with line breaks in a text editor and reloaded it treated them as 
separate rows.

I don't have a Windows box handy on which I can test Excel's behaviour.

cheers

andrew


Re: COPY formatting

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> There are some wrinkles, though, concerning the interaction of CSV's 
> notion of escaping and  COPY's notion of escaping. If someone want to 
> undertake this I can flesh those out in a further email.

Please do that, so that the info is in the archives in case someone else
wants to tackle the project.
        regards, tom lane


Re: COPY formatting

From
Andrew Dunstan
Date:
I wrote:

> Fernando Nasser wrote:
>
>>> [snip re multiline CSV fields]
>>>
>>> I have never seen such a beast,
>>
>>
>>
>> Export from a spreadsheet where people have formated the cell with 
>> the address on it.
>>
>
> Well, I just tried with OpenOffice on my RH9 box, and it translated 
> the line breaks in the cell into spaces in the CSV. When I replaced 
> them with line breaks in a text editor and reloaded it treated them as 
> separate rows.
>
> I don't have a Windows box handy on which I can test Excel's behaviour.
>

Clearly my experience is out of date - I just found a web ref to Excel 
doing just this. Oh, well, it doesn't seem to me an insurmountable problem.

cheers

andrew


Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
>  And what  \n in attibutes data  in CSV? I think CSV  format doesn't use
>  some escape for  newline char. It means psql with \copy  cannot be sure
>  with CSV.

I think CSV only allows embedded newlines that are either escaped, or
inside quotes.  COPY doesn't currently have the notion of a quote
character, but that was part of Andrew's proposal ...
        regards, tom lane


Re: COPY formatting

From
"Thomas Swan"
Date:
<quote who="Andrew Dunstan">
>
>
> Karel Zak wrote:
>
>> The problem with  CSV is that it will correctly  work with new protocol
>> only. Because old  versions of  clients are newline  sensitive. And CSV
>> can contains newline in by quotation marks defined attributes:
>>
>> "John", "Smith", "The White House
>> 1600 Pennsylvania Avenue NW
>> Washington, DC 20500", "male", "open source software office"
>>
>> It is one record.
>>
>
> (Long Live President Smith!)
>
> I have never seen such a beast, and if I had been asked off the cuff
> would have said that it was probably illegal, except that I know of no
> standard actually defining the format. Perhaps others (Joshua Drake or
> Josh Berkus?) have wider experience. I think in any case we should
> ignore those for now and handle the straightforward case.
>
> I *have* seen monstrosities like fields that do not begin with the quote
> character but then break into a quote, e.g.:
>
> 1,2,a,123"abc""def",6,7,8
>

I have dealt with both, frequently.   The CSV format allows you to begin a
block of text with the quote.  Newlines are included in that quoted space. If qoutes are included in the field then the
quotesare double quotes to
 
denote they are not part of the quoted space.

Also, the following is valid.

1,2,,,"",,,""

"" is empty.

1,2,3,"""",,,,"",

The 4 quotes denote a single double quote.

Writing simple CSV converts that just explode on commas and newlines miss
these oddities.

Try exporting an Access table with a Memo field (containg multiple lines)
to CSV.




Re: COPY formatting

From
Andrew Dunstan
Date:
Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>  
>
>>There are some wrinkles, though, concerning the interaction of CSV's 
>>notion of escaping and  COPY's notion of escaping. If someone want to 
>>undertake this I can flesh those out in a further email.
>>    
>>
>
>Please do that, so that the info is in the archives in case someone else
>wants to tackle the project.
>
>  
>

briefly:

According to my understanding, in a CSV file backslash has no magical 
meaning unless it is the escape character, in which case we only expect 
to find it prefacing either itself or the quote character inside a 
quoted field. Otherwise, it is just another character.

One way of handling this might be to have 2 modes of COPY processing:
. if a quote delimiter is specified turn off all of COPY's usual 
backslash processing, and make the default NULL marker the empty string
. if no quote delimiter is specified, act as now.

OTOH it might be a good idea to be able to turn off backslash processing 
even without a quote delimiter, e.g. in a CSV-like file using tab as the 
delimiter and no quote escaping, so maybe another switch on COPY would 
be a better way to go.

Another issue I wondered about is how to specify nicely that TAB is the 
field delimiter - I hate putting a literal semantic tab in files, and 
consider its magical use in places like Makefiles and syslog.conf files 
some of the worst decisions in computing ever made ;-). I'd like a nicer 
*visible* way of specifying it, either with \t or ^I maybe.

I'm sure other issues will arise - that's all that's in my head for the 
moment :-)

cheers

andrew


Re: COPY formatting

From
Bruce Momjian
Date:
Thomas Swan wrote:
> > I have never seen such a beast, and if I had been asked off the cuff
> > would have said that it was probably illegal, except that I know of no
> > standard actually defining the format. Perhaps others (Joshua Drake or
> > Josh Berkus?) have wider experience. I think in any case we should
> > ignore those for now and handle the straightforward case.
> >
> > I *have* seen monstrosities like fields that do not begin with the quote
> > character but then break into a quote, e.g.:
> >
> > 1,2,a,123"abc""def",6,7,8
> >
> 
> I have dealt with both, frequently.   The CSV format allows you to begin a
> block of text with the quote.  Newlines are included in that quoted space.
>   If qoutes are included in the field then the quotes are double quotes to
> denote they are not part of the quoted space.
> 
> Also, the following is valid.
> 
> 1,2,,,"",,,""
> 
> "" is empty.
> 
> 1,2,3,"""",,,,"",
> 
> The 4 quotes denote a single double quote.
> 
> Writing simple CSV converts that just explode on commas and newlines miss
> these oddities.

OK, so for the separator, quote, and escape options:
separator is ,quote is "escape is "

so if the quote and escape are the same, then a double denotes a
single?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: COPY formatting

From
Andrew Dunstan
Date:
Bruce Momjian wrote:

>OK, so for the separator, quote, and escape options:
>
>    separator is ,
>    quote is "
>    escape is "
>
>so if the quote and escape are the same, then a double denotes a
>single?
>
>  
>

Yes. i.e. with the above settings "abc""def" -> abc"def

cheers

andrew



Re: COPY formatting

From
Karel Zak
Date:
On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote:
> Karel Zak <zakkr@zf.jcu.cz> writes:
> >  The problem with  CSV is that it will correctly  work with new protocol
> >  only. Because old  versions of  clients are newline  sensitive.
> 
> Why?  The client-side code doesn't have any real say over the meaning of
> the data, at least not in psql-class clients.  I suppose a client app
> that tries to interpret the data could get confused, but psql sure
> doesn't do that.
libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut stringbehind '\n'.
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/


Re: COPY formatting

From
Tom Lane
Date:
Karel Zak <zakkr@zf.jcu.cz> writes:
> On Fri, Mar 19, 2004 at 09:54:37AM -0500, Tom Lane wrote:
>> Why?  The client-side code doesn't have any real say over the meaning of
>> the data, at least not in psql-class clients.  I suppose a client app
>> that tries to interpret the data could get confused, but psql sure
>> doesn't do that.

>  libpq, fe-protocol2.c, pqGetCopyData2() and pqGetline2() cut string
>  behind '\n'.

Sure, but that doesn't invalidate the data stream as a whole, it's just
a bufferload boundary choice that won't be very helpful for clients not
using a newline-based data layout.  In any case, v2 protocol is
obsolete and needn't limit our thoughts about what to do in future.
        regards, tom lane