Thread: COPY and double-quote-separated fields

COPY and double-quote-separated fields

From
Ron Johnson
Date:
Hi,

PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc
(GCC) 3.3 (Debian)

I must load a file in which each field in each row is contained
in double quotes, i.e.
"1","foo"
"2","bar"
"3","puddle"

test1=# copy foo from '/usr/local/data/zip/foo.unl';
ERROR:  copy: line 1, pg_atoi: error in ""1","foo"": can't parse
""1","foo""

I have a very large dataset (14.3GiB in 22 files) that has such
a format, and would prefer not to sed each one of them, creating
new files in the process.

Any suggestions?

Many TIA
--
+-----------------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net             |
| Jefferson, LA  USA                                              |
|                                                                 |
| "I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!"                                    |
|    unknown                                                      |
+-----------------------------------------------------------------+



Re: COPY and double-quote-separated fields

From
Jason Godden
Date:
Hi Ron,

I have the same issue - best thing is to run the data through sed but pipe it
straight to the psql import table - that way you never have to worry about
creating misc. files.  I also always use an intermediate temporary import
table rather than populate my actual db tables and usually copy from stdin
rather than a file.  That way you can do your imports without having the data
in a directory readable by the pg process owner, ie:

cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
| psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

The first sed will replace all instances of "," with a tab character
(I achieve tab in my shell scripts with a line like TAB=$(printf "\t"))

The second sed will remove the start and end " because the first only checks
for ",".

The third sed will remove the Windows carriage returns (if any - depends on
the platform the file was generated from.

Note here that my file format never contains any " in a field so I can safely
run the second sed.  If your import file contains these then you will need to
change the regex or use awk.

If you don't want to use sed at all however you can probably do:

cat unprocesseddata.txt | psql -dmydatabase -c "copy importtable from stdin
delimiter '\",\"';"

(you may have to investigate the escape characters around the " there)

and then run:

update importtable set firstfield = trim(beginning '\"' from firstfield);
update importtable set lastfield = trim(end '\"' from lastfield);

or if you're concerned about additional " in the import:

update importtable set firstfield =
substring(firstfield,2,length(firstfield)-1);
update importtable set lastfield =
substring(lastfield,1,length(firstfield)-1);

... within the same transaction.

hth,

Jason

On Sun, 3 Aug 2003 05:36 pm, Ron Johnson wrote:
> Hi,
>
> PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc
> (GCC) 3.3 (Debian)
>
> I must load a file in which each field in each row is contained
> in double quotes, i.e.
> "1","foo"
> "2","bar"
> "3","puddle"
>
> test1=# copy foo from '/usr/local/data/zip/foo.unl';
> ERROR:  copy: line 1, pg_atoi: error in ""1","foo"": can't parse
> ""1","foo""
>
> I have a very large dataset (14.3GiB in 22 files) that has such
> a format, and would prefer not to sed each one of them, creating
> new files in the process.
>
> Any suggestions?
>
> Many TIA


Re: COPY and double-quote-separated fields

From
Ron Johnson
Date:
On Sun, 2003-08-03 at 06:08, Jason Godden wrote:
> Hi Ron,
>
> I have the same issue - best thing is to run the data through sed but pipe it
> straight to the psql import table - that way you never have to worry about
> creating misc. files.  I also always use an intermediate temporary import
> table rather than populate my actual db tables and usually copy from stdin
> rather than a file.  That way you can do your imports without having the data
> in a directory readable by the pg process owner, ie:
>
> cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g"
> | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';"

Great...

cat first10.unl | sed "s/\"//g" | \
        psql test1 -c "copy t_lane_tx2 from stdin delimiter ',';"

Then do some date-related transforms before inserting into the main
table.

[snip]
> Note here that my file format never contains any " in a field so I can safely
> run the second sed.  If your import file contains these then you will need to
> change the regex or use awk.
[snip]
> cat unprocesseddata.txt | psql -dmydatabase -c "copy importtable from stdin
> delimiter '\",\"';"

Wouldn't work, because the 1st field has it's leading \", and the last
field has it's trailing \".

Thanks

[snip]
> On Sun, 3 Aug 2003 05:36 pm, Ron Johnson wrote:
> > Hi,
> >
> > PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc
> > (GCC) 3.3 (Debian)
> >
> > I must load a file in which each field in each row is contained
> > in double quotes, i.e.
> > "1","foo"
> > "2","bar"
> > "3","puddle"
> >
> > test1=# copy foo from '/usr/local/data/zip/foo.unl';
> > ERROR:  copy: line 1, pg_atoi: error in ""1","foo"": can't parse
> > ""1","foo""
> >
> > I have a very large dataset (14.3GiB in 22 files) that has such
> > a format, and would prefer not to sed each one of them, creating
> > new files in the process.
> >
> > Any suggestions?
> >
> > Many TIA
--
+-----------------------------------------------------------------+
| Ron Johnson, Jr.        Home: ron.l.johnson@cox.net             |
| Jefferson, LA  USA                                              |
|                                                                 |
| "I'm not a vegetarian because I love animals, I'm a vegetarian  |
|  because I hate vegetables!"                                    |
|    unknown                                                      |
+-----------------------------------------------------------------+