Thread: Different length lines in COPY CSV
Hi, Is there any way to force COPY to accept that there will be lines of different length in a data file? I have a rather large file I'm trying to import. It's in CSV format, however, they leave off trailing empty columns on most lines. Any way to do this? Should it be supported by CSV mode? Chris
Christopher Kings-Lynne wrote: > Hi, > > Is there any way to force COPY to accept that there will be lines of > different length in a data file? > > I have a rather large file I'm trying to import. It's in CSV format, > however, they leave off trailing empty columns on most lines. > > Any way to do this? Should it be supported by CSV mode? No there is not, partly so we can complain that the file is corrupt, and partly because we are unsure what to make the missing values, NULL? I suppose we could have a TRAILINGNULL flag to COPY but because few ask for this feature, it hardly seems worth it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Christopher Kings-Lynne wrote: >> Is there any way to force COPY to accept that there will be lines of >> different length in a data file? > I suppose we could have a TRAILINGNULL flag to COPY but because few ask > for this feature, it hardly seems worth it. There is no chance that we'll ever be able to cope with every insane file format that some benighted program claims is CSV. The harder we try, the more we will lose the ability to detect data errors at all; not to mention the likely negative consequences for the readability and performance of the COPY code. I think "fix it with a perl script" is a very reasonable answer for cases like this one. regards, tom lane
Tom Lane said: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Christopher Kings-Lynne wrote: >>> Is there any way to force COPY to accept that there will be lines of >>> different length in a data file? > >> I suppose we could have a TRAILINGNULL flag to COPY but because few >> ask for this feature, it hardly seems worth it. > > There is no chance that we'll ever be able to cope with every insane > file format that some benighted program claims is CSV. The harder we > try, the more we will lose the ability to detect data errors at all; > not to mention the likely negative consequences for the readability and > performance of the COPY code. I think "fix it with a perl script" is a > very reasonable answer for cases like this one. > I agree. The COPY code is probably on the edge of maintainability now. Our CSV routines accept a wide variety of imports formats, but a fixed number of columns is required. Maybe we need a pgfoundry project with some general perl CSV munging utilities - this issue comes up often enough. cheers andrew
"Andrew Dunstan" <andrew@dunslane.net> writes: > The COPY code is probably on the edge of maintainability now. > Our CSV routines accept a wide variety of imports formats, but a fixed > number of columns is required. Maybe we need a pgfoundry project with some > general perl CSV munging utilities - this issue comes up often enough. What's been suggested in the past is some sort of standalone file-format-conversion utility, which could deal with this sort of stuff without having to also deal with all the backend-internal considerations that COPY must handle. So (at least in theory) it'd be simpler and more maintainable. That still seems like a good idea to me --- in fact, given my druthers I would rather have seen CSV support done in such an external program. regards, tom lane
Tom Lane wrote: > What's been suggested in the past is some sort of standalone > file-format-conversion utility, which could deal with this sort of stuff > without having to also deal with all the backend-internal considerations > that COPY must handle. So (at least in theory) it'd be simpler and more > maintainable. That still seems like a good idea to me --- in fact, > given my druthers I would rather have seen CSV support done in such an > external program. Why not add hooks into COPY to call the user's massage functions? That way you don't have to read and write the data, then read it again to load it into the database. Mike Pollard SUPRA Server SQL Engineering and Support Cincom Systems, Inc.
Tom Lane wrote: >"Andrew Dunstan" <andrew@dunslane.net> writes: > > >>The COPY code is probably on the edge of maintainability now. >>Our CSV routines accept a wide variety of imports formats, but a fixed >>number of columns is required. Maybe we need a pgfoundry project with some >>general perl CSV munging utilities - this issue comes up often enough. >> >> > >What's been suggested in the past is some sort of standalone >file-format-conversion utility, which could deal with this sort of stuff >without having to also deal with all the backend-internal considerations >that COPY must handle. So (at least in theory) it'd be simpler and more >maintainable. That still seems like a good idea to me --- in fact, >given my druthers I would rather have seen CSV support done in such an >external program. > > > > We debated the reasons at the time, and I am not convinced we were wrong - huge bulk loads are a lot simpler if you don't have to call some external program to munge the data first. From time to time people thank me for things I have contributed to in PostgreSQL. The two that get the most thanks by far are CSV support and dollar quoting. Anyway, that's history now. Where would you want this file conversion utility? bin? contrib? pgfoundry? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> What's been suggested in the past is some sort of standalone >> file-format-conversion utility, > Anyway, that's history now. Where would you want this file conversion > utility? bin? contrib? pgfoundry? I'd say pgfoundry for starters --- there's no reason to tie it down to server release cycles. Maybe when the thing is fairly mature and doesn't need frequent releases, we could think about whether it ought to be brought into the core distro. However, it likely won't ever be a candidate to become part of core unless it's written in C, and offhand I would judge C to not be the best choice of implementation language for such a thing. This is surely going to be mostly a string-pushing type of problem, so something like perl might be a better bet. regards, tom lane
On Mon, Dec 12, 2005 at 10:15:03AM -0500, Pollard, Mike wrote: > Tom Lane wrote: > > What's been suggested in the past is some sort of standalone > > file-format-conversion utility, which could deal with this sort of > > stuff without having to also deal with all the backend-internal > > considerations that COPY must handle. So (at least in theory) it'd > > be simpler and more maintainable. That still seems like a good > > idea to me --- in fact, given my druthers I would rather have seen > > CSV support done in such an external program. > > Why not add hooks into COPY to call the user's massage functions? That > way you don't have to read and write the data, then read it again to > load it into the database. Well, it does make you wonder about supporting something like (perl style): \copy foo FROM 'myfilter dodgy-data.csv |' or maybe \copy foo from pipe 'myfilter dodgy-data.csv' or possibly \pipe foo from dodgy-data.csv using autodetecting-format-filter.pl Which would cause psql to fork/exec the filter and pass the output data to the server. We could then provide all sorts of parsers for format-of-the-week. This would probably include the converse: \pipe foo to table.xls using make-excel-spreadsheet.pl Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
I too have to deal with this issue daily since I deal with Windows daily. It sounds like the source of the csv files wereoriginally Excel files. I have never been able to figure out how Excel determines to quit putting null values in emptycolumns after X rows and resume again later on. If the file has less than 65000 rows I would suggest using OpenOffice 2.0 instead of Excel. OpenOffice does not stop fillingthe empty columns and with 2.0 it now supports the same maximum number of rows that Excel does. I use Perl constantly to "reformat" files and import them as a csv using the COPY command. I think the original poster wouldprefer a php solution though... While it is not a problem for me I do have other less technical users who don't know perl and this makes postgres much moredifficult for them to use. Most of them come from a M$ Access background which can handle importing of Excel files directlythus don't have to deal with this issue. A file conversion utility would be very helpful for supporting Postgres with Windows especially if it could handle Excelfiles in their native format. Mike On Mon, Dec 12, 2005 at 07:58:52PM +0100, Martijn van Oosterhout wrote: > On Mon, Dec 12, 2005 at 10:15:03AM -0500, Pollard, Mike wrote: > > Tom Lane wrote: > > > What's been suggested in the past is some sort of standalone > > > file-format-conversion utility, which could deal with this sort of > > > stuff without having to also deal with all the backend-internal > > > considerations that COPY must handle. So (at least in theory) it'd > > > be simpler and more maintainable. That still seems like a good > > > idea to me --- in fact, given my druthers I would rather have seen > > > CSV support done in such an external program. > > > > Why not add hooks into COPY to call the user's massage functions? That > > way you don't have to read and write the data, then read it again to > > load it into the database. > > Well, it does make you wonder about supporting something like (perl > style): > > \copy foo FROM 'myfilter dodgy-data.csv |' > > or maybe > > \copy foo from pipe 'myfilter dodgy-data.csv' > > or possibly > > \pipe foo from dodgy-data.csv using autodetecting-format-filter.pl > > Which would cause psql to fork/exec the filter and pass the output data > to the server. We could then provide all sorts of parsers for > format-of-the-week. This would probably include the converse: > > \pipe foo to table.xls using make-excel-spreadsheet.pl > > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > > tool for doing 5% of the work and then sitting around waiting for someone > > else to do the other 95% so you can sue them.
Tom Lane wrote: >>Where would you want this file conversion >>utility? bin? contrib? pgfoundry? >> >> > >I'd say pgfoundry for starters --- there's no reason to tie it down >to server release cycles. Maybe when the thing is fairly mature and >doesn't need frequent releases, we could think about whether it ought >to be brought into the core distro. > >However, it likely won't ever be a candidate to become part of core >unless it's written in C, and offhand I would judge C to not be the >best choice of implementation language for such a thing. This is surely >going to be mostly a string-pushing type of problem, so something like >perl might be a better bet. > > > > You are probably right. The biggest wrinkle will be dealing with various encodings, I suspect. That at least is one thing that doing CSV within the backend bought us fairly painlessly. Perl's Text::CSV_XS module for example simply handles this by declaring that only [\x09\x20-\x7f] are valid in its non-binary mode, and in either mode appears to be MBCS unaware. We should try to do better than that. cheers andrew
Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan: > > Tom Lane wrote: ... > > You are probably right. The biggest wrinkle will be dealing with various > encodings, I suspect. That at least is one thing that doing CSV within > the backend bought us fairly painlessly. Perl's Text::CSV_XS module for > example simply handles this by declaring that only [\x09\x20-\x7f] are > valid in its non-binary mode, and in either mode appears to be MBCS > unaware. We should try to do better than that. Are there any test datafiles available in a repository? I could give it a shot I think. If not maybe we could set up something like that. Regards Tino
On Mon, Dec 12, 2005 at 09:30:12PM +0100, Tino Wildenhain wrote: > Am Montag, den 12.12.2005, 15:08 -0500 schrieb Andrew Dunstan: > > You are probably right. The biggest wrinkle will be dealing with various > > encodings, I suspect. That at least is one thing that doing CSV within > > the backend bought us fairly painlessly. Perl's Text::CSV_XS module for > > example simply handles this by declaring that only [\x09\x20-\x7f] are > > valid in its non-binary mode, and in either mode appears to be MBCS > > unaware. We should try to do better than that. > > Are there any test datafiles available in a repository? > I could give it a shot I think. > > If not maybe we could set up something like that. Note, recent versions of Perl allow you to specify the file encoding when you open the file and will convert things to UTF-8 as appropriate. So in theory it should be fairly simple to make a script that could handle various encodings. The hardest part is always determining which encoding a file is in in the first place... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
> Anyway, that's history now. Where would you want this file conversion > utility? bin? contrib? pgfoundry? How about a full SQL*Loader clone? :D