Thread: copy from questions

copy from questions

From
Kirk Wythers
Date:
I am using version 9.1 and have a large number of files to insert. I am trying to use a simple COPY FROM command but
havea couple questions. 

1. There are a small number of instances where there are duplicate records that are being caught by the primary key (as
itis supposed to do). However, the duplicate records are artifacts of a recording system and can be ignored. I would
liketo be able to simply skip the duplicate or UPDATE the table with the duplicate… Anything that allows the COPY FROM
toproceed while adding only one of the duplicate records to the table.  

2. SInce I have several hundred files to perform a COPY FROM on, I'd like to automate the import in some way… sort of
a,grab all files in the directory approach: 

COPY newtable FROM '/directory_of_files/*' WITH CSV HEADER DELIMITER AS ',' NULL AS 'NA';

Is this possible?

Thanks in advance

Re: copy from questions

From
Steve Crawford
Date:
On 12/19/2012 08:13 AM, Kirk Wythers wrote:
> I am using version 9.1 and have a large number of files to insert. I am trying to use a simple COPY FROM command but
havea couple questions. 
>
> 1. There are a small number of instances where there are duplicate records that are being caught by the primary key
(asit is supposed to do). However, the duplicate records are artifacts of a recording system and can be ignored. I
wouldlike to be able to simply skip the duplicate or UPDATE the table with the duplicate… Anything that allows the COPY
FROMto proceed while adding only one of the duplicate records to the table. 
>
> 2. SInce I have several hundred files to perform a COPY FROM on, I'd like to automate the import in some way… sort of
a,grab all files in the directory approach: 
>
> COPY newtable FROM '/directory_of_files/*' WITH CSV HEADER DELIMITER AS ',' NULL AS 'NA';
>
>
I suppose you could use a trigger to check each record before inserting
but that is likely to be inefficient for bulk loads. A quick bash loop
is probably your best bet. Something along the lines of:

for inputfile in /infiledirectory/*.csv
do
     cat inputfile | psql [connection-params] -c '\copy rawinput from
stdin csv header...'
done

This imports everything into a "staging" table (I called it rawinput).
 From there you can create your final table with SELECT DISTINCT...

For speed make sure that you create your staging table as "unlogged".

Cheers,
Steve



Re: copy from questions

From
Kirk Wythers
Date:
Thanks for the reply Steve. These suggestions are new to me, so I'd like to rephrase them back to you in order to make sure I understand the bits and details. 



On Dec 19, 2012, at 10:48 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:

I suppose you could use a trigger to check each record before inserting but that is likely to be inefficient for bulk loads. A quick bash loop is probably your best bet. Something along the lines of:

for inputfile in /infiledirectory/*.csv
do
   cat inputfile | psql [connection-params] -c '\copy rawinput from stdin csv header...'
done

I get this… If my except for the header… bit. Here is my interpretation of your code including my specific connection parameters. 

#!/bin/sh
for inputfile in '/Volumes/disk7/b4warmed3/export/60min2/*.txt'
do
   cat inputfile | psql -p 54321 -h localhost -c '\copy rawinput FROM stdin WITH CSV HEADER DELIMTER AS ',' NULL AS 'NA '
done

I added single quotes around the path to the input files. Correct right?


This imports everything into a "staging" table (I called it rawinput). From there you can create your final table with SELECT DISTINCT…

This bit needs to be as a separate step right? (rowid is the primary key)

SELECT DISTINCT ON (rowid) *
FROM rawinput;

From here do I need another COPY FROM or some kind of INSERT statement? 


For speed make sure that you create your staging table as "unlogged".

I understand that I need to create the rawinput table first, but I am unfamiliar with the "UNLOGGED" option. I assume it makes things faster… Does it go something like:

CREATE TABLE UNLOGGED rawinput; 

Do I need to create all the variables (including types) in rawinput as well? If so, then I assume that I do not want rowid to have a primary key… or else I would be back where I started. 



Cheers,
Steve


Thanks again,

Kirk


Re: copy from questions

From
Steve Crawford
Date:
On 12/19/2012 01:06 PM, Kirk Wythers wrote:
Thanks for the reply Steve. These suggestions are new to me, so I'd like to rephrase them back to you in order to make sure I understand the bits and details. 



On Dec 19, 2012, at 10:48 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:

I suppose you could use a trigger to check each record before inserting but that is likely to be inefficient for bulk loads. A quick bash loop is probably your best bet. Something along the lines of:

for inputfile in /infiledirectory/*.csv
do
   cat inputfile | psql [connection-params] -c '\copy rawinput from stdin csv header...'
done

I get this… If my except for the header… bit. Here is my interpretation of your code including my specific connection parameters. 

#!/bin/sh
for inputfile in '/Volumes/disk7/b4warmed3/export/60min2/*.txt'
do
   cat inputfile | psql -p 54321 -h localhost -c '\copy rawinput FROM stdin WITH CSV HEADER DELIMTER AS ',' NULL AS 'NA '
done
You have a mixture of bash and PostgreSQL/psql issues.

Bash first: You don't need to wrap the inputfile glob in quotes. Your filespec will be expanded to a list of the files matching your pattern. I'm assuming that you have not populated your directory with filenames that contain spaces. If so, either correct it or start Googling for how to deal with lists of space-containing file names.

Also, you will have problems with using single-quotes within a string delimited by single-quotes. Combine that with the psql problem where it appears you have to specify delimiter before CSV (though you probably don't need to specify delimiter at all if your delimiter is a comma) and things will blow up. Try:

.... -c "\copy rawinput FROM stdin with delimiter as ',' NULL AS 'NA ' CSV HEADER"


I added single quotes around the path to the input files. Correct right?


This imports everything into a "staging" table (I called it rawinput). From there you can create your final table with SELECT DISTINCT…

This bit needs to be as a separate step right? (rowid is the primary key)

SELECT DISTINCT ON (rowid) *
FROM rawinput;

From here do I need another COPY FROM or some kind of INSERT statement?

I had assumed you had fully duplicated rows in which you could simply do;
INSERT INTO realtable SELECT DISTINCT * FROM rawinput;
(assuming realtable and rawinput have exactly the same structure and column ordering, of course)

If that is not the case, you will need to update your select statement to return one from among your duplicated IDs based on your desired criteria.



For speed make sure that you create your staging table as "unlogged".

I understand that I need to create the rawinput table first, but I am unfamiliar with the "UNLOGGED" option. I assume it makes things faster… Does it go something like:

CREATE TABLE UNLOGGED rawinput;
Something like but not exactly. The syntax is CREATE UNLOGGED TABLE... Except for UNLOGGED before TABLE it's exactly like a regular create-table statement.

BEWARE: Unlogged is *NOT* a magic go-fast directive that comes at no cost. In normal tables data is written to the transaction log allowing the database to recover in the event of an unplanned shutdown. Unlogged tables, as the name implies, skip the writing of the write-ahead log gaining speed at the expense of data safety. For bulk-load operations that can be redone if they fail this is a reasonable approach. For normal operations where you expect your data to be well protected it is not. Depending on a variety of factors the speed improvement you will see ranges from "minimal" to "substantial"


Do I need to create all the variables (including types) in rawinput as well? If so, then I assume that I do not want rowid to have a primary key… or else I would be back where I started.
I would make rawinput exactly the same as your real table except without indexes or constraints. The easiest way is probably:
CREATE UNLOGGED TABLE rawinput as SELECT * FROM realtable limit 0;




Cheers,
Steve


Thanks again,

Kirk


Cheers,
Steve