Thread: COPY and double-quote-separated fields
Hi, PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3 (Debian) I must load a file in which each field in each row is contained in double quotes, i.e. "1","foo" "2","bar" "3","puddle" test1=# copy foo from '/usr/local/data/zip/foo.unl'; ERROR: copy: line 1, pg_atoi: error in ""1","foo"": can't parse ""1","foo"" I have a very large dataset (14.3GiB in 22 files) that has such a format, and would prefer not to sed each one of them, creating new files in the process. Any suggestions? Many TIA -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!" | | unknown | +-----------------------------------------------------------------+
Hi Ron, I have the same issue - best thing is to run the data through sed but pipe it straight to the psql import table - that way you never have to worry about creating misc. files. I also always use an intermediate temporary import table rather than populate my actual db tables and usually copy from stdin rather than a file. That way you can do your imports without having the data in a directory readable by the pg process owner, ie: cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g" | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';" The first sed will replace all instances of "," with a tab character (I achieve tab in my shell scripts with a line like TAB=$(printf "\t")) The second sed will remove the start and end " because the first only checks for ",". The third sed will remove the Windows carriage returns (if any - depends on the platform the file was generated from. Note here that my file format never contains any " in a field so I can safely run the second sed. If your import file contains these then you will need to change the regex or use awk. If you don't want to use sed at all however you can probably do: cat unprocesseddata.txt | psql -dmydatabase -c "copy importtable from stdin delimiter '\",\"';" (you may have to investigate the escape characters around the " there) and then run: update importtable set firstfield = trim(beginning '\"' from firstfield); update importtable set lastfield = trim(end '\"' from lastfield); or if you're concerned about additional " in the import: update importtable set firstfield = substring(firstfield,2,length(firstfield)-1); update importtable set lastfield = substring(lastfield,1,length(firstfield)-1); ... within the same transaction. hth, Jason On Sun, 3 Aug 2003 05:36 pm, Ron Johnson wrote: > Hi, > > PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc > (GCC) 3.3 (Debian) > > I must load a file in which each field in each row is contained > in double quotes, i.e. > "1","foo" > "2","bar" > "3","puddle" > > test1=# copy foo from '/usr/local/data/zip/foo.unl'; > ERROR: copy: line 1, pg_atoi: error in ""1","foo"": can't parse > ""1","foo"" > > I have a very large dataset (14.3GiB in 22 files) that has such > a format, and would prefer not to sed each one of them, creating > new files in the process. > > Any suggestions? > > Many TIA
On Sun, 2003-08-03 at 06:08, Jason Godden wrote: > Hi Ron, > > I have the same issue - best thing is to run the data through sed but pipe it > straight to the psql import table - that way you never have to worry about > creating misc. files. I also always use an intermediate temporary import > table rather than populate my actual db tables and usually copy from stdin > rather than a file. That way you can do your imports without having the data > in a directory readable by the pg process owner, ie: > > cat unprocesseddata.txt | sed "s/\",\"/$TAB/g" | sed "s/\"//g" | sed "s/.$//g" > | psql -dmydatabase -c "copy importtable from stdin delimiter '\t';" Great... cat first10.unl | sed "s/\"//g" | \ psql test1 -c "copy t_lane_tx2 from stdin delimiter ',';" Then do some date-related transforms before inserting into the main table. [snip] > Note here that my file format never contains any " in a field so I can safely > run the second sed. If your import file contains these then you will need to > change the regex or use awk. [snip] > cat unprocesseddata.txt | psql -dmydatabase -c "copy importtable from stdin > delimiter '\",\"';" Wouldn't work, because the 1st field has it's leading \", and the last field has it's trailing \". Thanks [snip] > On Sun, 3 Aug 2003 05:36 pm, Ron Johnson wrote: > > Hi, > > > > PostgreSQL 7.3.3 on i386-pc-linux-gnu, compiled by GCC gcc > > (GCC) 3.3 (Debian) > > > > I must load a file in which each field in each row is contained > > in double quotes, i.e. > > "1","foo" > > "2","bar" > > "3","puddle" > > > > test1=# copy foo from '/usr/local/data/zip/foo.unl'; > > ERROR: copy: line 1, pg_atoi: error in ""1","foo"": can't parse > > ""1","foo"" > > > > I have a very large dataset (14.3GiB in 22 files) that has such > > a format, and would prefer not to sed each one of them, creating > > new files in the process. > > > > Any suggestions? > > > > Many TIA -- +-----------------------------------------------------------------+ | Ron Johnson, Jr. Home: ron.l.johnson@cox.net | | Jefferson, LA USA | | | | "I'm not a vegetarian because I love animals, I'm a vegetarian | | because I hate vegetables!" | | unknown | +-----------------------------------------------------------------+