Thread: Ragged CSV import

Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 17:00:15

I have received a requirement for the ability to import ragged CSV 
files, i.e. files that contain variable numbers of columns per row. The 
requirement is that extra columns would be ignored and missing columns 
filled with NULL. The client wanting this has wrestled with some 
preprocessors to try to get what they want, but they would feel happier 
with this built in. This isn't the first time I have received this 
request since we implemented CSV import. People have complained on 
numerous occasions about the strictness of the import routines w.r.t. 
the number of columns.

I don't think that providing a CSV import mode for reading these files 
will be very difficult or invasive. If it's not acceptable, I will 
provide it to the client as a patch, and I will be retrofitting it to 
8.4 anyway. But I think it's not an unreasonable request.

Thoughts?

cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

09 September 2009, 17:14:43

Andrew Dunstan <andrew@dunslane.net> writes:
> I have received a requirement for the ability to import ragged CSV 
> files, i.e. files that contain variable numbers of columns per row. The 
> requirement is that extra columns would be ignored and missing columns 
> filled with NULL. The client wanting this has wrestled with some 
> preprocessors to try to get what they want, but they would feel happier 
> with this built in. This isn't the first time I have received this 
> request since we implemented CSV import. People have complained on 
> numerous occasions about the strictness of the import routines w.r.t. 
> the number of columns.

Hmm.  Accepting too few columns and filling with nulls isn't any
different than what INSERT has always done.  But ignoring extra columns
seems like a different ballgame.  Can you talk your client out of that
one?  It just seems like a bad idea.

As for the "numerous occasions", maybe I've not been paying attention,
but I don't recall any ...
        regards, tom lane

Re: Ragged CSV import

From

Alvaro Herrera

Date:

09 September 2009, 17:25:23

Tom Lane wrote:

> As for the "numerous occasions", maybe I've not been paying attention,
> but I don't recall any ...

I don't know about numerous, but I've seen it in the spanish list; for
example:

http://archives.postgresql.org/pgsql-es-ayuda/2007-03/msg00901.php

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Ragged CSV import

From

Tom Lane

Date:

09 September 2009, 17:29:05

Andrew Dunstan <andrew@dunslane.net> writes:
>> I have received a requirement for the ability to import ragged CSV 
>> files, i.e. files that contain variable numbers of columns per row.

BTW, one other thought about this: I think the historical reason for
COPY being strict about the number of incoming columns was that it
provided a useful cross-check that the parsing hadn't gone off into
the weeds.  We have certainly seen enough examples where the reported
manifestation of, say, an escaping mistake was that COPY saw the row
as having too many or too few columns.  So being permissive about it
would lose some error detection capability.  I am not clear about
whether CSV format is sufficiently more robust than the traditional
COPY format to render this an acceptable loss.  Comments?

(One possible answer to this is to make the behavior optional, though
surely COPY has got too many options already :-()
        regards, tom lane

Re: Ragged CSV import

From

Alvaro Herrera

Date:

09 September 2009, 17:34:44

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> >> I have received a requirement for the ability to import ragged CSV 
> >> files, i.e. files that contain variable numbers of columns per row.
> 
> BTW, one other thought about this: I think the historical reason for
> COPY being strict about the number of incoming columns was that it
> provided a useful cross-check that the parsing hadn't gone off into
> the weeds.  We have certainly seen enough examples where the reported
> manifestation of, say, an escaping mistake was that COPY saw the row
> as having too many or too few columns.  So being permissive about it
> would lose some error detection capability.  I am not clear about
> whether CSV format is sufficiently more robust than the traditional
> COPY format to render this an acceptable loss.  Comments?

I think accepting less columns and filling with nulls should be
protected enough for this not to be a problem; if the parser goes nuts,
it will die eventually.  Silently dropping excessive trailing columns
does not seem acceptable though; you could lose entire rows and not
notice.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Ragged CSV import

From

Josh Berkus

Date:

09 September 2009, 17:37:49

On 9/9/09 12:59 PM, Andrew Dunstan wrote:
> 
> I have received a requirement for the ability to import ragged CSV
> files, i.e. files that contain variable numbers of columns per row. The
> requirement is that extra columns would be ignored and missing columns
> filled with NULL. The client wanting this has wrestled with some
> preprocessors to try to get what they want, but they would feel happier
> with this built in. This isn't the first time I have received this
> request since we implemented CSV import. People have complained on
> numerous occasions about the strictness of the import routines w.r.t.
> the number of columns.

Would this just work on columns on the end, or would it work on the
basis of parsing the CSV header and matching columns?

While the former functionality would be relatively simple, I think the
latter is what people really want.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

Re: Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 17:47:11

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I have received a requirement for the ability to import ragged CSV 
>> files, i.e. files that contain variable numbers of columns per row. The 
>> requirement is that extra columns would be ignored and missing columns 
>> filled with NULL. The client wanting this has wrestled with some 
>> preprocessors to try to get what they want, but they would feel happier 
>> with this built in. This isn't the first time I have received this 
>> request since we implemented CSV import. People have complained on 
>> numerous occasions about the strictness of the import routines w.r.t. 
>> the number of columns.
>>     
>
> Hmm.  Accepting too few columns and filling with nulls isn't any
> different than what INSERT has always done.  But ignoring extra columns
> seems like a different ballgame.  Can you talk your client out of that
> one?  It just seems like a bad idea.
>   

No, that's critical. The application this is wanted for uploads data 
that users put in spreadsheets. The users apparently expect that they 
will be able to put comments on some rows off to the right of the data 
they want loaded, and have it ignored.

To answer your other point made later, my intention was to make this 
optional behaviour, not default behaviour.  I agree that it would be too 
slack for default behaviour. Yes, we have quite a few options, but 
that's not surprising in dealing with a format that is at best 
ill-defined and which we do not control.

> As for the "numerous occasions", maybe I've not been paying attention,
> but I don't recall any ...
>
>   

The requests have been made on IRC, at conferences, in private emails.

cheers

andrew

Re: Ragged CSV import

From

"Kevin Grittner"

Date:

09 September 2009, 17:48:10

Josh Berkus <josh@agliodbs.com> wrote:
> Would this just work on columns on the end, or would it work on the
> basis of parsing the CSV header and matching columns?
> 
> While the former functionality would be relatively simple, I think
the
> latter is what people really want.
It's been a while since I've had a need for something like this, but
of
the copy features not currently available in PostgreSQL, the two most
useful are to read in only some of the defined columns, and to output
to
a separate disk file any rows which failed to match the expected
format.
The latter would not cause the copy to fail unless the count of such
rows exceeded a user-specified threshold.
-Kevin

Re: Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 17:50:23


Josh Berkus wrote:
> On 9/9/09 12:59 PM, Andrew Dunstan wrote:
>   
>> I have received a requirement for the ability to import ragged CSV
>> files, i.e. files that contain variable numbers of columns per row. The
>> requirement is that extra columns would be ignored and missing columns
>> filled with NULL. The client wanting this has wrestled with some
>> preprocessors to try to get what they want, but they would feel happier
>> with this built in. This isn't the first time I have received this
>> request since we implemented CSV import. People have complained on
>> numerous occasions about the strictness of the import routines w.r.t.
>> the number of columns.
>>     
>
> Would this just work on columns on the end, or would it work on the
> basis of parsing the CSV header and matching columns?
>
> While the former functionality would be relatively simple, I think the
> latter is what people really want.
>
>   

No, it's the former, and that's exactly what the client requested. I'm 
not talking about parsing the header line - that's a much larger can of 
worms.

cheers

andrew

Re: Ragged CSV import

From

Hannu Krosing

Date:

09 September 2009, 17:56:22

On Wed, 2009-09-09 at 16:34 -0400, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Andrew Dunstan <andrew@dunslane.net> writes:
> > >> I have received a requirement for the ability to import ragged CSV 
> > >> files, i.e. files that contain variable numbers of columns per row.
> > 
> > BTW, one other thought about this: I think the historical reason for
> > COPY being strict about the number of incoming columns was that it
> > provided a useful cross-check that the parsing hadn't gone off into
> > the weeds.  We have certainly seen enough examples where the reported
> > manifestation of, say, an escaping mistake was that COPY saw the row
> > as having too many or too few columns.  So being permissive about it
> > would lose some error detection capability.  I am not clear about
> > whether CSV format is sufficiently more robust than the traditional
> > COPY format to render this an acceptable loss.  Comments?
> 
> I think accepting less columns and filling with nulls should be
> protected enough for this not to be a problem; if the parser goes nuts,
> it will die eventually.  Silently dropping excessive trailing columns
> does not seem acceptable though; you could lose entire rows and not
> notice.

Maybe we could put a catch-all "text" or even "text[]" column at as the
last one of the table and gather all extra columns there ?

> -- 
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training

Re: Ragged CSV import

From

"Kevin Grittner"

Date:

09 September 2009, 18:03:40

"Dann Corbit" <DCorbit@connx.com> wrote: 
> Kevin Grittner
>> It's been a while since I've had a need for something like this,
>> but of the copy features not currently available in PostgreSQL,
>> the two most useful are to read in only some of the defined
>> columns, and to output to a separate disk file any rows which
>> failed to match the expected format.
>> The latter would not cause the copy to fail unless the count of
>> such rows exceeded a user-specified threshold.
> Perhaps something like SQL Server's BCP format files could be
> used.
> http://support.microsoft.com/kb/67409
> http://technet.microsoft.com/en-us/library/ms178129.aspx
> http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html
> http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm
That is what I was thinking of when I wrote the above, although in
the Sybase incarnation, before Microsoft split off on their own.  (I
see they haven't changed sybchar from the Sybase name yet....)
My reservation about referencing it is that it includes so many
bells and whistles that it's not as easy to use as it might be, even
with the "wizard" to generate the format description files.  The
other problem with it is that it was far and away the *least* stable
part of the DBMS.  You could count on it being broken in any version
until six months to a year into that version's life.  We eventually
moved away from it because of the instability -- we could write code
from scratch each time with more confidence of correct behavior.  I
think keeping it a little simpler might help with keeping it stable.
-Kevin

Re: Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 18:13:58


Dann Corbit wrote:
> Perhaps something like SQL Server's BCP format files could be used.
> http://support.microsoft.com/kb/67409
> http://technet.microsoft.com/en-us/library/ms178129.aspx
> http://www.nigelrivett.net/SQLTsql/BCP_quoted_CSV_Format_file.html
> http://doc.ddart.net/mssql/sql70/impt_bcp_16.htm
>   


This looks vastly more complex than anything I am interested in doing.

cheers

andrew

Re: Ragged CSV import

From

Robert Haas

Date:

09 September 2009, 20:18:25

On Wed, Sep 9, 2009 at 4:13 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I have received a requirement for the ability to import ragged CSV
>> files, i.e. files that contain variable numbers of columns per row. The
>> requirement is that extra columns would be ignored and missing columns
>> filled with NULL. The client wanting this has wrestled with some
>> preprocessors to try to get what they want, but they would feel happier
>> with this built in. This isn't the first time I have received this
>> request since we implemented CSV import. People have complained on
>> numerous occasions about the strictness of the import routines w.r.t.
>> the number of columns.
>
> Hmm.  Accepting too few columns and filling with nulls isn't any
> different than what INSERT has always done.  But ignoring extra columns
> seems like a different ballgame.  Can you talk your client out of that
> one?  It just seems like a bad idea.

I agree that ignoring extra columns is a bad idea, but I don't even
like the idea of ignoring missing columns.  It doesn't seem like a
good idea to take a spreadsheet and feed it into COPY without doing
any validation anyway, and this is the kind of thing that is trivial
to clean up with a thin layer of Perl or your scripting language of
choice.

...Robert

Re: Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 20:41:46

Robert Haas wrote:
> I agree that ignoring extra columns is a bad idea, but I don't even
> like the idea of ignoring missing columns.  It doesn't seem like a
> good idea to take a spreadsheet and feed it into COPY without doing
> any validation anyway, and this is the kind of thing that is trivial
> to clean up with a thin layer of Perl or your scripting language of
> choice.
>
>   

If it's an optional feature then I don't see why there is a problem. 
What skin is it off anyone else's nose but those whose choose this 
behaviour?

I am perfectly familiar with Perl and so is the client that requested 
this feature. They are quite technically savvy. They are using a 
scripting solution now but they find it cumbersome.

As for general validation, the requestor's application in fact loads the 
spreadsheet into a temp table of text columns and then subjects it to a 
large variety of complex business rule checking before adding the data 
to the main tables. It is a whole lot faster and cleaner to do it that 
way than before the data is loaded at all. That's why they aren't 
concerned about missing columns.

cheers

andrew

Re: Ragged CSV import

From

Robert Haas

Date:

09 September 2009, 21:43:27

On Wed, Sep 9, 2009 at 7:41 PM, Andrew Dunstan<andrew@dunslane.net> wrote:
>
>
> Robert Haas wrote:
>>
>> I agree that ignoring extra columns is a bad idea, but I don't even
>> like the idea of ignoring missing columns.  It doesn't seem like a
>> good idea to take a spreadsheet and feed it into COPY without doing
>> any validation anyway, and this is the kind of thing that is trivial
>> to clean up with a thin layer of Perl or your scripting language of
>> choice.
>>
>>
>
> If it's an optional feature then I don't see why there is a problem. What
> skin is it off anyone else's nose but those whose choose this behaviour?

I have to admit I'm usually an advocate of that way of thinking, so
maybe I am all wet.  I suppose it comes down to whether you think this
particular feature is something with broad applicability, or whether
there are 100 other equally plausible features.

I wonder whether it would be appropriate to do something like
implement a method by which copy could return text[] and then one
could write wrappers around that functionality to do this as well as
other things.  For example, suppose you wanted to have rows of the
form:

A,B,C,X1,Y1,X2,Y2,X3,Y3

...which gets transformed into an insert of (A,B,C) into a main table
and (A,X1,Y1), (A,X2,Y2), (A,X3,Y3) into a side table.  (I have actual
knowledge of a widely-deployed system produced by a large company that
outputs data in a format similar to this, though the actual format is
considerably more complex.)

...Robert

Re: Ragged CSV import

From

Andrew Dunstan

Date:

09 September 2009, 23:40:50

Robert Haas wrote:
> I wonder whether it would be appropriate to do something like
> implement a method by which copy could return text[] and then one
> could write wrappers around that functionality to do this as well as
> other things.  
>   

It's not a function ... writing a copy() function returning text[] (or 
setof text[], which would be better) in plperlu would be trivial. But it 
couldn't read from stdin or be able to be fed data from a client in 
copy-in mode.

And even if we have it what does the sql look like to insert the values 
into some table with n columns?

Anything along these lines in C is likely to be far larger than what I 
had in mind, which was a fairly minor enhancement.

cheers

andrew

Re: Ragged CSV import

From

Robert Haas

Date:

09 September 2009, 23:55:24

On Wed, Sep 9, 2009 at 10:40 PM, Andrew Dunstan<andrew@dunslane.net> wrote:
> Robert Haas wrote:
>>
>> I wonder whether it would be appropriate to do something like
>> implement a method by which copy could return text[] and then one
>> could write wrappers around that functionality to do this as well as
>> other things.
>
> It's not a function ... writing a copy() function returning text[] (or setof
> text[], which would be better) in plperlu would be trivial. But it couldn't
> read from stdin or be able to be fed data from a client in copy-in mode.

Yes, setof text[] is what I meant.  Making it be able to read from
stdin or be fed data from a client in copy-in mode would be the whole
point of the feature, I suppose.

> And even if we have it what does the sql look like to insert the values into
> some table with n columns?

INSERT INTO table
SELECT t[1], t[2], t[3], t[4], ..., t[n]
FROM (...however we get the copy results...)

...although I'm not entirely sure that's compatible with the
client/server protocol.

> Anything along these lines in C is likely to be far larger than what I had
> in mind, which was a fairly minor enhancement.

It wouldn't be the first time that someone was told that a particular
enhancement didn't have enough value and that they had to do something
larger if they wanted it in core, but on the other hand, I am only
expressing my opinion, which is not binding on you or anyone else, nor
even a fully-fleshed-out proposal.  All I'm saying is that it seems to
me that there is value in being able to use the CSV (or other) parsing
code, but have some way to make modifications to how/where the data is
actually inserted.  However, I'm skeptical about whether the specific
thing you want to do after parsing (namely, drop excess columns,
null-fill missing ones) is sufficiently common to warrant a feature to
do only that.  YMMV, of course.

...Robert

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 00:01:26

Andrew Dunstan <andrew@dunslane.net> writes:
> Robert Haas wrote:
>> I wonder whether it would be appropriate to do something like
>> implement a method by which copy could return text[] and then one
>> could write wrappers around that functionality to do this as well as
>> other things.  

> Anything along these lines in C is likely to be far larger than what I 
> had in mind, which was a fairly minor enhancement.

I think Robert's got a point though.  What you're talking about is a
fairly specialized single-purpose feature, which nonetheless is going to
require a lot of infrastructure (for example, teaching psql's \copy
about it).  Perhaps, for approximately the same amount of overhead,
we could provide something more general.

I don't agree with the idea of injecting something "behind" copy though.
The thought that comes to mind for me is something "in front of" copy,
that is, give it the text of each line and let it do a text-to-text
transformation before COPY chews on it.

Any of this is getting into territory we had previously agreed not to
let COPY venture into, ie general purpose data transformation.  I'm not
sure I want to cross that bridge and only get "ignore extra columns" out
of it.
        regards, tom lane

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 00:13:56

Robert Haas wrote:
> However, I'm skeptical about whether the specific
> thing you want to do after parsing (namely, drop excess columns,
> null-fill missing ones) is sufficiently common to warrant a feature to
> do only that.  YMMV, of course.
>
>
>   

So might my experience. I can tell you that I have been asked about this 
dozens of times since implementing CSV import, in various forums. From 
day one people have wanted a way of handling CSVs which were not 
strictly regular. Pretty much every time the request has been something 
fairly close to this.

As for the feature not being large enough, we have a regrettable 
(IMNSHO) tendency in this project to take simple proposals and embelish 
them with many bells and whistles. Personally, I'm much more of a fan of 
incremental development.

cheers

andrew

Re: Ragged CSV import

From

Robert Haas

Date:

10 September 2009, 00:19:20

On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Robert Haas wrote:
>>> I wonder whether it would be appropriate to do something like
>>> implement a method by which copy could return text[] and then one
>>> could write wrappers around that functionality to do this as well as
>>> other things.
>
>> Anything along these lines in C is likely to be far larger than what I
>> had in mind, which was a fairly minor enhancement.
>
> I think Robert's got a point though.  What you're talking about is a
> fairly specialized single-purpose feature, which nonetheless is going to
> require a lot of infrastructure (for example, teaching psql's \copy
> about it).  Perhaps, for approximately the same amount of overhead,
> we could provide something more general.
>
> I don't agree with the idea of injecting something "behind" copy though.
> The thought that comes to mind for me is something "in front of" copy,
> that is, give it the text of each line and let it do a text-to-text
> transformation before COPY chews on it.

That seems to me to be a whole lot less useful.  As I see it, the
whole point of any enhancement in this area is to reuse the parsing
code.   If I have a CSV file (or some other format COPY understands),
I don't want to have to write my own parser for that format in order
to do some simple data transformation (like dropping columns >n).

Previous agreements nonwithstanding, I think letting COPY do general
transformations is a great idea.  But I'm a lot more skeptical about
one specific transformation without some framework of which that case
is the first instance.

...Robert

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 00:27:46


Tom Lane wrote:
> What you're talking about is a
> fairly specialized single-purpose feature, which nonetheless is going to
> require a lot of infrastructure (for example, teaching psql's \copy
> about it).

Well, that's hardly a lot.


>  Perhaps, for approximately the same amount of overhead,
> we could provide something more general.
>   


Ok.

> I don't agree with the idea of injecting something "behind" copy though.
> The thought that comes to mind for me is something "in front of" copy,
> that is, give it the text of each line and let it do a text-to-text
> transformation before COPY chews on it.
>   


you mean some sort of filter mechanism?

COPY .... FILTER function_name ( args) ... ?

That might work. Then we could provide a couple builtin and people could 
write others in C or PL/dujour.

cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 00:36:14

Andrew Dunstan <andrew@dunslane.net> writes:
> you mean some sort of filter mechanism?

> COPY .... FILTER function_name ( args) ... ?

> That might work. Then we could provide a couple builtin and people could 
> write others in C or PL/dujour.

Yeah, that's pretty much what I was thinking, although exactly where the
filter happens and what its arguments/results are is open to debate.
        regards, tom lane

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 00:52:46

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>> The thought that comes to mind for me is something "in front of" copy,
>> that is, give it the text of each line and let it do a text-to-text
>> transformation before COPY chews on it.

> That seems to me to be a whole lot less useful.  As I see it, the
> whole point of any enhancement in this area is to reuse the parsing
> code.   If I have a CSV file (or some other format COPY understands),
> I don't want to have to write my own parser for that format in order
> to do some simple data transformation (like dropping columns >n).

I'm unconvinced.  A significant fraction of the COPY customization
requests that I can remember had to do with cleaning up input that
didn't entirely match COPY's idea of what valid CSV is.  If it's close
enough that COPY will successfully parse it as N columns, you can stick
it into an N-column text table and do your transformations afterwards.

Thinking a bit more generally, it seems to me that as COPY IN is
currently constituted, there are really four distinguishable bits
of functionality in a pipeline:

1. Reading from FE/BE protocol (or file) and conversion to database
character set.

2. Parsing the text into a set of de-escaped field values.

3. Converting the field values to internal Datum form (requires
knowledge of a set of target data types).

4. Bulk insertion into the target table.

It is not hard to come up with problems that are most easily solved
by injecting some sort of filtering between any pair of these steps.
You're suggesting a filter between 2 and 3, I suggested one between
1 and 2, and the well-known trick of inserting into a temp table
and then filtering to a final table amounts to a filter between
3 and 4.

We've had proposals before to come up with a form of COPY that can be
used as a data source in an INSERT/SELECT context.  That would
automatically provide a way to filter between steps 3 and 4 --- or maybe
better, insist that the COPY-thingy produces just text fields, and leave
both steps 3 and 4 to be done by the INSERT/SELECT.  With that, the only
missing functionality would be a textual filter between steps 1 and 2.
        regards, tom lane

Re: Ragged CSV import

From

Peter Eisentraut

Date:

10 September 2009, 05:24:23

On Wed, 2009-09-09 at 19:18 -0400, Robert Haas wrote:
> I agree that ignoring extra columns is a bad idea, but I don't even
> like the idea of ignoring missing columns.  It doesn't seem like a
> good idea to take a spreadsheet and feed it into COPY without doing
> any validation anyway, and this is the kind of thing that is trivial
> to clean up with a thin layer of Perl or your scripting language of
> choice.

I would think that a spreadsheet application has some kind of control
over its export format, too.  Perhaps you can just tell it to export
only the first N columns?

Re: Ragged CSV import

From

Nikhil Sontakke

Date:

10 September 2009, 05:38:34

Hi,

> the two most
> useful are to read in only some of the defined columns, and to output
> to
> a separate disk file any rows which failed to match the expected
> format.
> The latter would not cause the copy to fail unless the count of such
> rows exceeded a user-specified threshold.
>

+1

Allowing the capability to handle rows that might get discarded due to
constraint violations, bad column inputs etc. sounds like a big help
while doing large copy operations.

Another capability would be to transform the input column via some sql
expressions before loading it into the table. Given the way update
works, this could avoid the unnecessary subsequent bloat to fine-tune
some of the columns.

Regards,
Nikhils
-- 
http://www.enterprisedb.com

Re: Ragged CSV import

From

Dimitri Fontaine

Date:

10 September 2009, 06:25:10

Tom Lane <tgl@sss.pgh.pa.us> writes:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> you mean some sort of filter mechanism?
>
>> COPY .... FILTER function_name ( args) ... ?
>
>> That might work. Then we could provide a couple builtin and people could 
>> write others in C or PL/dujour.
>
> Yeah, that's pretty much what I was thinking, although exactly where the
> filter happens and what its arguments/results are is open to debate.

Last time we opened this can of worms here's what it lead us to:
 http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php

That is, COPY support in the FROM clause. Now the example gets rewritten
as:
 SELECT *   FROM function_name(a, b, c)        LATERAL (COPY ... FROM ...);

Or with some more nesting if we don't have LATERAL by the time we get
COPY support in the FROM clause :)   
That allows to simplify some things: COPY in FROM always returns text
datums, the function_name can be any callable function (pl/whatever),
and we could import easily data whose representation has to be adapted
to PostgreSQL data type input function (such as timestamps etc)
 INSERT INTO t SELECT myparse(a) FROM (COPY ...);

As already said by Andrew on IRC, though, this goes nowhere to help him
address his direct concern, COPY would still choke on input lines that
are containing too few or too many columns...

Regards,
-- 
dim

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 08:09:09


Peter Eisentraut wrote:
> On Wed, 2009-09-09 at 19:18 -0400, Robert Haas wrote:
>   
>> I agree that ignoring extra columns is a bad idea, but I don't even
>> like the idea of ignoring missing columns.  It doesn't seem like a
>> good idea to take a spreadsheet and feed it into COPY without doing
>> any validation anyway, and this is the kind of thing that is trivial
>> to clean up with a thin layer of Perl or your scripting language of
>> choice.
>>     
>
> I would think that a spreadsheet application has some kind of control
> over its export format, too.  Perhaps you can just tell it to export
> only the first N columns?
>
>   


Not to my knowledge. In any case, this requires more user work and much 
more user education, and in the situation I am dealing with neither is 
likely.

cheers

andrew

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 08:16:19

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> you mean some sort of filter mechanism?
>>     
>
>   
>> COPY .... FILTER function_name ( args) ... ?
>>     
>
>   
>> That might work. Then we could provide a couple builtin and people could 
>> write others in C or PL/dujour.
>>     
>
> Yeah, that's pretty much what I was thinking, although exactly where the
> filter happens and what its arguments/results are is open to debate.
>
>             
>
>   

I have slept on it, and while this is a nice idea, I think your 
suggestion that this would take about the same effort as what I had in 
mind (which I could just about have coded by now) is out by at least an 
order of magnitude. I think it would eat up all my available time 
between now and Nov 15, and I'm not prepared to spend my time budget 
that way.

I will talk to the requestor and see if we can do something along the 
lines of Robert's suggestion using a plperlu function and a temp file, 
which I think I could do within the original time budget.

Consider the suggestion withdrawn.

cheers

andrew

Re: Ragged CSV import

From

Alvaro Herrera

Date:

10 September 2009, 10:32:46

Andrew Dunstan escribió:

> As for the feature not being large enough, we have a regrettable
> (IMNSHO) tendency in this project to take simple proposals and
> embelish them with many bells and whistles. Personally, I'm much
> more of a fan of incremental development.

+1

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Ragged CSV import

From

Robert Haas

Date:

10 September 2009, 10:56:39

On Wed, Sep 9, 2009 at 11:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 9, 2009 at 11:01 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
>>> The thought that comes to mind for me is something "in front of" copy,
>>> that is, give it the text of each line and let it do a text-to-text
>>> transformation before COPY chews on it.
>
>> That seems to me to be a whole lot less useful.  As I see it, the
>> whole point of any enhancement in this area is to reuse the parsing
>> code.   If I have a CSV file (or some other format COPY understands),
>> I don't want to have to write my own parser for that format in order
>> to do some simple data transformation (like dropping columns >n).
>
> I'm unconvinced.  A significant fraction of the COPY customization
> requests that I can remember had to do with cleaning up input that
> didn't entirely match COPY's idea of what valid CSV is.  If it's close
> enough that COPY will successfully parse it as N columns, you can stick
> it into an N-column text table and do your transformations afterwards.
>
> Thinking a bit more generally, it seems to me that as COPY IN is
> currently constituted, there are really four distinguishable bits
> of functionality in a pipeline:
>
> 1. Reading from FE/BE protocol (or file) and conversion to database
> character set.
>
> 2. Parsing the text into a set of de-escaped field values.
>
> 3. Converting the field values to internal Datum form (requires
> knowledge of a set of target data types).
>
> 4. Bulk insertion into the target table.
>
> It is not hard to come up with problems that are most easily solved
> by injecting some sort of filtering between any pair of these steps.
> You're suggesting a filter between 2 and 3, I suggested one between
> 1 and 2, and the well-known trick of inserting into a temp table
> and then filtering to a final table amounts to a filter between
> 3 and 4.
>
> We've had proposals before to come up with a form of COPY that can be
> used as a data source in an INSERT/SELECT context.  That would
> automatically provide a way to filter between steps 3 and 4 --- or maybe
> better, insist that the COPY-thingy produces just text fields, and leave
> both steps 3 and 4 to be done by the INSERT/SELECT.  With that, the only
> missing functionality would be a textual filter between steps 1 and 2.

That would be swell.  Being able to use COPY as an expression
returning text[] would, I think, cater to an extremely wide variety of
use cases, although it sounds like Andrew is not interested in
implementing it at this point.  If we want to add on a filter between
steps 1 and 2, that sounds potentially useful as well.

I am fuzzy on the implementation details for making COPY act as a data
source for INSERT/SELECT though.  I had thought to make EXPLAIN a data
source, but it turned out not to be possible (as far as I could tell)
without making EXPLAIN a fully-reserved word, which you vetoed.  It
seems likely that COPY will present similar issues, though I haven't
tried.

I am also wondering what happens when someone embeds multiple COPY
statements in a single query, or sticks one inside of a CTE or on the
inner side of a left join.  I am inclined to think that we should
cause COPY to materialize its input (so rescans will return the same
data already read) and throw an error if more than one is present in
the same query (unless you have a better idea?).

...Robert

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 10:57:59

Dimitri Fontaine <dfontaine@hi-media.com> writes:
> Last time we opened this can of worms here's what it lead us to:
>   http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php

Right, that's the discussion I was remembering.

> As already said by Andrew on IRC, though, this goes nowhere to help him
> address his direct concern, COPY would still choke on input lines that
> are containing too few or too many columns...

An easy fix for that is to allow this form of COPY to return text[]
instead of a fixed number of text columns.
        regards, tom lane

Re: Ragged CSV import

From

Robert Haas

Date:

10 September 2009, 10:59:05

On Thu, Sep 10, 2009 at 9:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dimitri Fontaine <dfontaine@hi-media.com> writes:
>> Last time we opened this can of worms here's what it lead us to:
>>   http://archives.postgresql.org/pgsql-hackers/2008-04/msg00222.php
>
> Right, that's the discussion I was remembering.
>
>> As already said by Andrew on IRC, though, this goes nowhere to help him
>> address his direct concern, COPY would still choke on input lines that
>> are containing too few or too many columns...
>
> An easy fix for that is to allow this form of COPY to return text[]
> instead of a fixed number of text columns.

Hmm... in fact, what if we just implemented COPY returning text[] for
a first cut, and forget making it act like a table source for now.
Then you could COPY to a temp table (with a text[] column) and select
from there.  There's really two separate features here: COPY producing
text[] rather than a record type, and COPY being embeddable in some
larger query.

...Robert

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 11:09:36

Robert Haas <robertmhaas@gmail.com> writes:
> I am fuzzy on the implementation details for making COPY act as a data
> source for INSERT/SELECT though.  I had thought to make EXPLAIN a data
> source, but it turned out not to be possible (as far as I could tell)
> without making EXPLAIN a fully-reserved word, which you vetoed.  It
> seems likely that COPY will present similar issues, though I haven't
> tried.

IIRC the previous discussion touched on making it look like a
set-returning function, although this would be a shade less convenient
for option parsing etc.

> I am also wondering what happens when someone embeds multiple COPY
> statements in a single query, or sticks one inside of a CTE or on the
> inner side of a left join.

Yeah, it would need to be restricted somehow.  A straight SRF function
would materialize its result, but I doubt we want that to happen for
COPY.

(This brings up the whole question of performance impact, which would
have to be thought about and minimized.)
        regards, tom lane

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 11:12:11

Robert Haas <robertmhaas@gmail.com> writes:
> Hmm... in fact, what if we just implemented COPY returning text[] for
> a first cut, and forget making it act like a table source for now.
> Then you could COPY to a temp table (with a text[] column) and select
> from there.  There's really two separate features here: COPY producing
> text[] rather than a record type, and COPY being embeddable in some
> larger query.

That might be a reasonable way-station ...
        regards, tom lane

Re: Ragged CSV import

From

Hannu Krosing

Date:

10 September 2009, 15:21:49

On Thu, 2009-09-10 at 10:09 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I am fuzzy on the implementation details for making COPY act as a data
> > source for INSERT/SELECT though.  I had thought to make EXPLAIN a data
> > source, but it turned out not to be possible (as far as I could tell)
> > without making EXPLAIN a fully-reserved word, which you vetoed.  It
> > seems likely that COPY will present similar issues, though I haven't
> > tried.
> 
> IIRC the previous discussion touched on making it look like a
> set-returning function, although this would be a shade less convenient
> for option parsing etc.
> 
> > I am also wondering what happens when someone embeds multiple COPY
> > statements in a single query

As long as they COPY from different input files it should be perfectly
OK. Though this already goes out into SQL/MED land. 

> , or sticks one inside of a CTE or on the
> > inner side of a left join.
> 
> Yeah, it would need to be restricted somehow.  A straight SRF function
> would materialize its result, but I doubt we want that to happen for
> COPY.

Making SRF-s streamable is another thing that would make postgreSQL a
lot more powerful for all kinds of computing. 

> (This brings up the whole question of performance impact, which would
> have to be thought about and minimized.)

Completely agree - nobody wants a slow COPY command.

>             regards, tom lane

-- 
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 15:26:55


Tom Lane wrote:
>> As already said by Andrew on IRC, though, this goes nowhere to help him
>> address his direct concern, COPY would still choke on input lines that
>> are containing too few or too many columns...
>>     
>
> An easy fix for that is to allow this form of COPY to return text[]
> instead of a fixed number of text columns.
>
>             
>   

I'd like to see some concrete syntax suggestions for such an animal.

cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 15:50:21

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> An easy fix for that is to allow this form of COPY to return text[]
>> instead of a fixed number of text columns.

> I'd like to see some concrete syntax suggestions for such an animal.

I think there were some in the previous discussion --- it came down
to using a SRF with arguments that specify the copy source and options.
Setting the result type as text[] instead of some has-to-be-specified
composite type should actually make it a bit easier.
        regards, tom lane

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 15:51:27

Hannu Krosing <hannu@2ndQuadrant.com> writes:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I am also wondering what happens when someone embeds multiple COPY
>>> statements in a single query

> As long as they COPY from different input files it should be perfectly
> OK. Though this already goes out into SQL/MED land. 

Multiple copy from file is okay, copy from stdin not so much.
        regards, tom lane

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 16:23:32


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> An easy fix for that is to allow this form of COPY to return text[]
>>> instead of a fixed number of text columns.
>>>       
>
>   
>> I'd like to see some concrete syntax suggestions for such an animal.
>>     
>
> I think there were some in the previous discussion --- it came down
> to using a SRF with arguments that specify the copy source and options.
>   

Oh, so I guess not stdin.


cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

10 September 2009, 16:34:22

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> I think there were some in the previous discussion --- it came down
>> to using a SRF with arguments that specify the copy source and options.

> Oh, so I guess not stdin.

Well, it'd be nice if stdin were allowed, but we'd have to look into how
to prevent multiple reads or re-reads then.
        regards, tom lane

Re: Ragged CSV import

From

Stephen Frost

Date:

10 September 2009, 21:22:10

* Andrew Dunstan (andrew@dunslane.net) wrote:
> Consider the suggestion withdrawn.

Let's not throw it out completely.  The proposal to have COPY return a
text[] in some fashion is interesting enough that others (ah, such as
myself..) might be willing to put effort into it.  Andrew, could you put
your thoughts plus some example files onto a wiki page, at least?  Then
Robert, Tom, myself, etc, could update that to nail down the
specification and then it's just an implementation detail, as it were.
Thanks!
    Stephen

Re: Ragged CSV import

From

Andrew Dunstan

Date:

10 September 2009, 22:26:54

Stephen Frost wrote:
> * Andrew Dunstan (andrew@dunslane.net) wrote:
>   
>> Consider the suggestion withdrawn.
>>     
>
> Let's not throw it out completely.  The proposal to have COPY return a
> text[] in some fashion is interesting enough that others (ah, such as
> myself..) might be willing to put effort into it.  Andrew, could you put
> your thoughts plus some example files onto a wiki page, at least?  Then
> Robert, Tom, myself, etc, could update that to nail down the
> specification and then it's just an implementation detail, as it were.
>
>     
>   

I don't mind discussing the idea a bit.

I don't have any samples readily to hand, but really anything that's not 
strictly rectangular meets my original case, like
   a,b,c   1,2,3   4,5,6,7   8,9   10,11,12

I do like the idea of COPY returning a SETOF text[], but I am not at all 
clear on the mechanics of feeding STDIN to an SRF. ISTM that something 
like a RETURNING clause on COPY and the ability to use it in FROM clause 
or something similar might work better. I understand the difficulties, 
but maybe we could place some restrictions on where it could be used so 
as to obviate at least some of those.

One of the things I like about a SETOF text[] is that it lets you 
reorder the columns, or cherry pick which ones you want. In fact, it 
might be argued with that the hacky FORCE NOT NULL, which has always 
pained me somewhat, even if it was my idea ;-) might no longer be needed.

I'd love to be able to do something like
   INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING   t FROM stdin CSV);

The only thing that's been seriously on the table that isn't accounted 
for by something like this is the suggestion of making the header line 
have some semantic significance, and I'm far from sure that's a good idea.

If time were not short on getting features presented I might attempt to 
do it, but I have one very large monkey (and a few small ones) on my 
back that I am determined to get rid of by the November CF, and there is 
not a hope in the world I could get two large features done, even if we 
had the details of this all sorted out and agreed on. That's why I said 
"Consider the suggestion withdrawn".

cheers

andrew

Re: Ragged CSV import

From

Stephen Frost

Date:

10 September 2009, 23:35:01

Andrew,

* Andrew Dunstan (andrew@dunslane.net) wrote:
> I don't mind discussing the idea a bit.

Great!  Thanks for this.  I'll put it up on a wiki and start looking at
implementation details and what the semantics should be.
Thanks again,
    Stephen

Re: Ragged CSV import

From

Dimitri Fontaine

Date:

11 September 2009, 05:27:28

Hi,

Andrew Dunstan <andrew@dunslane.net> writes:
> I do like the idea of COPY returning a SETOF text[], but I am not at all
> clear on the mechanics of feeding STDIN to an SRF. ISTM that something like
> a RETURNING clause on COPY and the ability to use it in FROM clause or
> something similar might work better. 

I envisonned COPY "just" returning all what it reads (minus extra discarded
column as soon as your proposal gets implemented), in the FROM clause,
and the user sorting out what he wants in the SELECT clause.

> I understand the difficulties, but
> maybe we could place some restrictions on where it could be used so as to
> obviate at least some of those.

Maybe instead of opening FROM for COPY, having it accepted in WITH would
be better, the same way (from the user point of view) that DML returning
are worked on.

>    INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING
>    t FROM stdin CSV);

Would become:
WITH csv AS (  COPY t FROM stdin CSV)INSERT INTO foo(x, y, z) SELECT t[3], t[2], mytimestamptz([5], [6], [7])  FROM
csv;

Now the text[] has a strange feeling, without it it'd be:
WITH csv AS (  COPY t(a, b, c, d, e, f, g)   FROM stdin    CSV IGNORING EXTRA COLUMNS         -- random nice syntax
 MISSING COLUMNS DEFAULTS NULL  -- that needs some reality check)INSERT INTO foo(x, y, z) SELECT c, b, mytimestamptz(e,
f,g)  FROM csv;

The function mytimestamptz(date text, time text, timezone text) will
accept input that PostgreSQL input types would have errored out on... so
you can process in one go strange formats from other products.

> The only thing that's been seriously on the table that isn't accounted for
> by something like this is the suggestion of making the header line have some
> semantic significance, and I'm far from sure that's a good idea.

Yeah, and it seems only useful when you don't have any way to play with
what COPY returns before it goes to a table (magic column reordering or
ignoring).

Regards,
-- 
dim

Re: Ragged CSV import

From

Andrew Dunstan

Date:

11 September 2009, 12:00:03

I wrote:
> I'd love to be able to do something like
>
>    INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING
>    t FROM stdin CSV);
>
>

Some IRC discussion suggested ways we could do better than that syntax. 
I think my current preferred candidate is something like
   COPY foo (a,b,c)   FROM stdin   CSV   AS t USING (t[3],t[2],[t57]);

I'm not sure how we'd could plug a filter into that. Maybe a WHERE 
clause? My immediate need at least doesn't actually involve filtering 
anything - we load every line in the CSV into a temp table and then 
filter what we load into the main tables after applying the business rules.

cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

11 September 2009, 12:16:59

Andrew Dunstan <andrew@dunslane.net> writes:
> I wrote:
>> I'd love to be able to do something like
>> 
>> INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING
>> t FROM stdin CSV);

> Some IRC discussion suggested ways we could do better than that syntax. 
> I think my current preferred candidate is something like

>     COPY foo (a,b,c)
>     FROM stdin
>     CSV
>     AS t USING (t[3],t[2],[t57]);

[ scratches head... ]  What happened to seeing COPY as a data source in
a larger command?  If that syntax has anything at all to recommend it,
I'm not seeing what.  It's not extensible and it would require lots of
code duplication to implement the impoverished feature set it does have.
        regards, tom lane

Re: Ragged CSV import

From

Andrew Dunstan

Date:

11 September 2009, 12:29:32


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I wrote:
>>     
>>> I'd love to be able to do something like
>>>
>>> INSERT into foo (x,y,z) select t[3],[t2],[t57] from (COPY RETURNING
>>> t FROM stdin CSV);
>>>       
>
>   
>> Some IRC discussion suggested ways we could do better than that syntax. 
>> I think my current preferred candidate is something like
>>     
>
>   
>>     COPY foo (a,b,c)
>>     FROM stdin
>>     CSV
>>     AS t USING (t[3],t[2],[t57]);
>>     
>
> [ scratches head... ]  What happened to seeing COPY as a data source in
> a larger command?  If that syntax has anything at all to recommend it,
> I'm not seeing what.  It's not extensible and it would require lots of
> code duplication to implement the impoverished feature set it does have.
>
>             
>   

Well, I think the objection was that it would slow COPY down to have to 
go though the executor in the copy-as-source scenario. But maybe that 
would happen anyway, and maybe we don't care, we'd just accept that it 
wouldn't be nearly as fast as a raw copy.

cheers

andrew

Re: Ragged CSV import

From

Tom Lane

Date:

11 September 2009, 12:36:56

Andrew Dunstan <andrew@dunslane.net> writes:
> Well, I think the objection was that it would slow COPY down to have to 
> go though the executor in the copy-as-source scenario. But maybe that 
> would happen anyway, and maybe we don't care, we'd just accept that it 
> wouldn't be nearly as fast as a raw copy.

I haven't heard complaints about the COPY (query) syntax, which is
the same thing in the opposite direction.  You can't expect that
flexibility costs zero.
        regards, tom lane

Re: Ragged CSV import

From

daveg

Date:

12 September 2009, 23:15:23

On Fri, Sep 11, 2009 at 10:27:06AM +0200, Dimitri Fontaine wrote:
> Maybe instead of opening FROM for COPY, having it accepted in WITH would
> be better, the same way (from the user point of view) that DML returning
> are worked on.
... 
>  WITH csv AS (
>    COPY t FROM stdin CSV
>  )
>  INSERT INTO foo(x, y, z) 
>  SELECT t[3], t[2], mytimestamptz([5], [6], [7])
>    FROM csv;
> 
> Now the text[] has a strange feeling, without it it'd be:
> 
>  WITH csv AS (
>    COPY t(a, b, c, d, e, f, g) 
>    FROM stdin 
>     CSV IGNORING EXTRA COLUMNS         -- random nice syntax
>         MISSING COLUMNS DEFAULTS NULL  -- that needs some reality check
>  )
>  INSERT INTO foo(x, y, z) 
>  SELECT c, b, mytimestamptz(e, f, g)
>    FROM csv;
> 
> The function mytimestamptz(date text, time text, timezone text) will
> accept input that PostgreSQL input types would have errored out on... so
> you can process in one go strange formats from other products.

+1

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.