Thread: Different length lines in COPY CSV

Different length lines in COPY CSV

From

Christopher Kings-Lynne

Date:

12 December 2005, 00:17:46

Hi,

Is there any way to force COPY to accept that there will be lines of 
different length in a data file?

I have a rather large file I'm trying to import.  It's in CSV format, 
however, they leave off trailing empty columns on most lines.

Any way to do this?  Should it be supported by CSV mode?

Chris

Re: Different length lines in COPY CSV

From

Bruce Momjian

Date:

12 December 2005, 00:56:06

Christopher Kings-Lynne wrote:
> Hi,
> 
> Is there any way to force COPY to accept that there will be lines of 
> different length in a data file?
> 
> I have a rather large file I'm trying to import.  It's in CSV format, 
> however, they leave off trailing empty columns on most lines.
> 
> Any way to do this?  Should it be supported by CSV mode?

No there is not, partly so we can complain that the file is corrupt, and
partly because we are unsure what to make the missing values, NULL?

I suppose we could have a TRAILINGNULL flag to COPY but because few ask
for this feature, it hardly seems worth it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Different length lines in COPY CSV

From

Tom Lane

Date:

12 December 2005, 01:43:07

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Christopher Kings-Lynne wrote:
>> Is there any way to force COPY to accept that there will be lines of 
>> different length in a data file?

> I suppose we could have a TRAILINGNULL flag to COPY but because few ask
> for this feature, it hardly seems worth it.

There is no chance that we'll ever be able to cope with every insane
file format that some benighted program claims is CSV.  The harder we
try, the more we will lose the ability to detect data errors at all;
not to mention the likely negative consequences for the readability
and performance of the COPY code.  I think "fix it with a perl script"
is a very reasonable answer for cases like this one.
        regards, tom lane

Re: Different length lines in COPY CSV

From

"Andrew Dunstan"

Date:

12 December 2005, 05:15:20

Tom Lane said:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> Christopher Kings-Lynne wrote:
>>> Is there any way to force COPY to accept that there will be lines of
>>> different length in a data file?
>
>> I suppose we could have a TRAILINGNULL flag to COPY but because few
>> ask for this feature, it hardly seems worth it.
>
> There is no chance that we'll ever be able to cope with every insane
> file format that some benighted program claims is CSV.  The harder we
> try, the more we will lose the ability to detect data errors at all;
> not to mention the likely negative consequences for the readability and
> performance of the COPY code.  I think "fix it with a perl script" is a
> very reasonable answer for cases like this one.
>

I agree.

The COPY code is probably on the edge of maintainability now.

Our CSV routines accept a wide variety of imports formats, but a fixed
number of columns is required. Maybe we need a pgfoundry project with some
general perl CSV munging utilities - this issue comes up often enough.

cheers

andrew

Re: Different length lines in COPY CSV

From

Tom Lane

Date:

12 December 2005, 11:00:26

"Andrew Dunstan" <andrew@dunslane.net> writes:
> The COPY code is probably on the edge of maintainability now.
> Our CSV routines accept a wide variety of imports formats, but a fixed
> number of columns is required. Maybe we need a pgfoundry project with some
> general perl CSV munging utilities - this issue comes up often enough.

What's been suggested in the past is some sort of standalone
file-format-conversion utility, which could deal with this sort of stuff
without having to also deal with all the backend-internal considerations
that COPY must handle.  So (at least in theory) it'd be simpler and more
maintainable.  That still seems like a good idea to me --- in fact,
given my druthers I would rather have seen CSV support done in such an
external program.
        regards, tom lane

Re: Different length lines in COPY CSV

From

"Pollard, Mike"

Date:

12 December 2005, 11:15:06

Tom Lane wrote:
> What's been suggested in the past is some sort of standalone
> file-format-conversion utility, which could deal with this sort of
stuff
> without having to also deal with all the backend-internal
considerations
> that COPY must handle.  So (at least in theory) it'd be simpler and
more
> maintainable.  That still seems like a good idea to me --- in fact,
> given my druthers I would rather have seen CSV support done in such an
> external program.

Why not add hooks into COPY to call the user's massage functions?  That
way you don't have to read and write the data, then read it again to
load it into the database.

Mike Pollard
SUPRA Server SQL Engineering and Support
Cincom Systems, Inc.

Re: Different length lines in COPY CSV

From

Andrew Dunstan

Date:

12 December 2005, 11:56:22


Tom Lane wrote:

>"Andrew Dunstan" <andrew@dunslane.net> writes:
>  
>
>>The COPY code is probably on the edge of maintainability now.
>>Our CSV routines accept a wide variety of imports formats, but a fixed
>>number of columns is required. Maybe we need a pgfoundry project with some
>>general perl CSV munging utilities - this issue comes up often enough.
>>    
>>
>
>What's been suggested in the past is some sort of standalone
>file-format-conversion utility, which could deal with this sort of stuff
>without having to also deal with all the backend-internal considerations
>that COPY must handle.  So (at least in theory) it'd be simpler and more
>maintainable.  That still seems like a good idea to me --- in fact,
>given my druthers I would rather have seen CSV support done in such an
>external program.
>
>    
>  
>

We debated the reasons at the time, and I am not convinced we were wrong 
- huge bulk loads are a lot simpler if you don't have to call some 
external program to munge the data first.
From time to time people thank me for things I have contributed to in 
PostgreSQL. The two that get the most thanks by far are CSV support and 
dollar quoting.

Anyway, that's history now. Where would you want this file conversion 
utility? bin? contrib? pgfoundry?

cheers

andrew

Re: Different length lines in COPY CSV

From

Tom Lane

Date:

12 December 2005, 12:10:54

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> What's been suggested in the past is some sort of standalone
>> file-format-conversion utility,

> Anyway, that's history now. Where would you want this file conversion 
> utility? bin? contrib? pgfoundry?

I'd say pgfoundry for starters --- there's no reason to tie it down
to server release cycles.  Maybe when the thing is fairly mature and
doesn't need frequent releases, we could think about whether it ought
to be brought into the core distro.

However, it likely won't ever be a candidate to become part of core
unless it's written in C, and offhand I would judge C to not be the
best choice of implementation language for such a thing.  This is surely
going to be mostly a string-pushing type of problem, so something like
perl might be a better bet.
        regards, tom lane

Re: Different length lines in COPY CSV

From

Martijn van Oosterhout

Date:

12 December 2005, 14:59:20

On Mon, Dec 12, 2005 at 10:15:03AM -0500, Pollard, Mike wrote:
> Tom Lane wrote:
> > What's been suggested in the past is some sort of standalone
> > file-format-conversion utility, which could deal with this sort of
> > stuff without having to also deal with all the backend-internal
> > considerations that COPY must handle.  So (at least in theory) it'd
> > be simpler and more maintainable.  That still seems like a good
> > idea to me --- in fact, given my druthers I would rather have seen
> > CSV support done in such an external program.
>
> Why not add hooks into COPY to call the user's massage functions?  That
> way you don't have to read and write the data, then read it again to
> load it into the database.

Well, it does make you wonder about supporting something like (perl
style):

\copy foo FROM 'myfilter dodgy-data.csv |'

or maybe

\copy foo from pipe 'myfilter dodgy-data.csv'

or possibly

\pipe foo from dodgy-data.csv using autodetecting-format-filter.pl

Which would cause psql to fork/exec the filter and pass the output data
to the server. We could then provide all sorts of parsers for
format-of-the-week. This would probably include the converse:

\pipe foo to table.xls using make-excel-spreadsheet.pl

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Different length lines in COPY CSV

From

"Mike G."

Date:

12 December 2005, 15:19:49

I too have to deal with this issue daily since I deal with Windows daily.  It sounds like the source of the csv files
wereoriginally Excel files. I have never been able to figure out how Excel determines to quit putting null values in
emptycolumns after X rows and resume again later on.

If the file has less than 65000 rows I would suggest using OpenOffice 2.0 instead of Excel.  OpenOffice does not stop
fillingthe empty columns and with 2.0 it now supports the same maximum number of rows that Excel does.

I use Perl constantly to "reformat" files and import them as a csv using the COPY command.  I think the original poster
wouldprefer a php solution though...

While it is not a problem for me I do have other less technical users who don't know perl and this makes postgres much
moredifficult for them to use.  Most of them come from a M$ Access background which can handle importing of Excel files
directlythus don't have to deal with this issue.  

A file conversion utility would be very helpful for supporting Postgres with Windows especially if it could handle
Excelfiles in their native format.

Mike

On Mon, Dec 12, 2005 at 07:58:52PM +0100, Martijn van Oosterhout wrote:
> On Mon, Dec 12, 2005 at 10:15:03AM -0500, Pollard, Mike wrote:
> > Tom Lane wrote:
> > > What's been suggested in the past is some sort of standalone
> > > file-format-conversion utility, which could deal with this sort of
> > > stuff without having to also deal with all the backend-internal
> > > considerations that COPY must handle.  So (at least in theory) it'd
> > > be simpler and more maintainable.  That still seems like a good
> > > idea to me --- in fact, given my druthers I would rather have seen
> > > CSV support done in such an external program.
> > 
> > Why not add hooks into COPY to call the user's massage functions?  That
> > way you don't have to read and write the data, then read it again to
> > load it into the database.
> 
> Well, it does make you wonder about supporting something like (perl
> style):
> 
> \copy foo FROM 'myfilter dodgy-data.csv |'
> 
> or maybe
> 
> \copy foo from pipe 'myfilter dodgy-data.csv'
> 
> or possibly
> 
> \pipe foo from dodgy-data.csv using autodetecting-format-filter.pl
> 
> Which would cause psql to fork/exec the filter and pass the output data
> to the server. We could then provide all sorts of parsers for
> format-of-the-week. This would probably include the converse:
> 
> \pipe foo to table.xls using make-excel-spreadsheet.pl
> 
> Have a nice day,
> -- 
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> > tool for doing 5% of the work and then sitting around waiting for someone
> > else to do the other 95% so you can sue them.

Re: Different length lines in COPY CSV

From

Andrew Dunstan

Date:

12 December 2005, 16:11:55

Tom Lane wrote:

>>Where would you want this file conversion 
>>utility? bin? contrib? pgfoundry?
>>    
>>
>
>I'd say pgfoundry for starters --- there's no reason to tie it down
>to server release cycles.  Maybe when the thing is fairly mature and
>doesn't need frequent releases, we could think about whether it ought
>to be brought into the core distro.
>
>However, it likely won't ever be a candidate to become part of core
>unless it's written in C, and offhand I would judge C to not be the
>best choice of implementation language for such a thing.  This is surely
>going to be mostly a string-pushing type of problem, so something like
>perl might be a better bet.
>
>    
>  
>

You are probably right. The biggest wrinkle will be dealing with various 
encodings, I suspect. That at least is one thing that doing CSV within 
the backend bought us fairly painlessly. Perl's Text::CSV_XS module for 
example simply handles this by declaring that only [\x09\x20-\x7f] are 
valid in its non-binary mode, and in either mode appears to be MBCS 
unaware. We should try to do better than that.

cheers

andrew

Re: Different length lines in COPY CSV

From

Tino Wildenhain

Date:

12 December 2005, 16:30:18

Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan:
> 
> Tom Lane wrote:
...
> 
> You are probably right. The biggest wrinkle will be dealing with various 
> encodings, I suspect. That at least is one thing that doing CSV within 
> the backend bought us fairly painlessly. Perl's Text::CSV_XS module for 
> example simply handles this by declaring that only [\x09\x20-\x7f] are 
> valid in its non-binary mode, and in either mode appears to be MBCS 
> unaware. We should try to do better than that.

Are there any test datafiles available in a repository?
I could give it a shot I think.

If not maybe we could set up something like that.

Regards
Tino

Re: Different length lines in COPY CSV

From

Martijn van Oosterhout

Date:

12 December 2005, 16:41:08

On Mon, Dec 12, 2005 at 09:30:12PM +0100, Tino Wildenhain wrote:
> Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan:
> > You are probably right. The biggest wrinkle will be dealing with various
> > encodings, I suspect. That at least is one thing that doing CSV within
> > the backend bought us fairly painlessly. Perl's Text::CSV_XS module for
> > example simply handles this by declaring that only [\x09\x20-\x7f] are
> > valid in its non-binary mode, and in either mode appears to be MBCS
> > unaware. We should try to do better than that.
>
> Are there any test datafiles available in a repository?
> I could give it a shot I think.
>
> If not maybe we could set up something like that.

Note, recent versions of Perl allow you to specify the file encoding
when you open the file and will convert things to UTF-8 as appropriate.
So in theory it should be fairly simple to make a script that could
handle various encodings. The hardest part is always determining which
encoding a file is in in the first place...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Different length lines in COPY CSV

From

Christopher Kings-Lynne

Date:

12 December 2005, 22:04:22

> Anyway, that's history now. Where would you want this file conversion 
> utility? bin? contrib? pgfoundry?

How about a full SQL*Loader clone? :D