Re: copy from questions - Mailing list pgsql-general

From Steve Crawford
Subject Re: copy from questions
Date
Msg-id 50D235C8.3010408@pinpointresearch.com
Whole thread Raw
In response to Re: copy from questions  (Kirk Wythers <kirk.wythers@gmail.com>)
List pgsql-general
On 12/19/2012 01:06 PM, Kirk Wythers wrote:
Thanks for the reply Steve. These suggestions are new to me, so I'd like to rephrase them back to you in order to make sure I understand the bits and details. 



On Dec 19, 2012, at 10:48 AM, Steve Crawford <scrawford@pinpointresearch.com> wrote:

I suppose you could use a trigger to check each record before inserting but that is likely to be inefficient for bulk loads. A quick bash loop is probably your best bet. Something along the lines of:

for inputfile in /infiledirectory/*.csv
do
   cat inputfile | psql [connection-params] -c '\copy rawinput from stdin csv header...'
done

I get this… If my except for the header… bit. Here is my interpretation of your code including my specific connection parameters. 

#!/bin/sh
for inputfile in '/Volumes/disk7/b4warmed3/export/60min2/*.txt'
do
   cat inputfile | psql -p 54321 -h localhost -c '\copy rawinput FROM stdin WITH CSV HEADER DELIMTER AS ',' NULL AS 'NA '
done
You have a mixture of bash and PostgreSQL/psql issues.

Bash first: You don't need to wrap the inputfile glob in quotes. Your filespec will be expanded to a list of the files matching your pattern. I'm assuming that you have not populated your directory with filenames that contain spaces. If so, either correct it or start Googling for how to deal with lists of space-containing file names.

Also, you will have problems with using single-quotes within a string delimited by single-quotes. Combine that with the psql problem where it appears you have to specify delimiter before CSV (though you probably don't need to specify delimiter at all if your delimiter is a comma) and things will blow up. Try:

.... -c "\copy rawinput FROM stdin with delimiter as ',' NULL AS 'NA ' CSV HEADER"


I added single quotes around the path to the input files. Correct right?


This imports everything into a "staging" table (I called it rawinput). From there you can create your final table with SELECT DISTINCT…

This bit needs to be as a separate step right? (rowid is the primary key)

SELECT DISTINCT ON (rowid) *
FROM rawinput;

From here do I need another COPY FROM or some kind of INSERT statement?

I had assumed you had fully duplicated rows in which you could simply do;
INSERT INTO realtable SELECT DISTINCT * FROM rawinput;
(assuming realtable and rawinput have exactly the same structure and column ordering, of course)

If that is not the case, you will need to update your select statement to return one from among your duplicated IDs based on your desired criteria.



For speed make sure that you create your staging table as "unlogged".

I understand that I need to create the rawinput table first, but I am unfamiliar with the "UNLOGGED" option. I assume it makes things faster… Does it go something like:

CREATE TABLE UNLOGGED rawinput;
Something like but not exactly. The syntax is CREATE UNLOGGED TABLE... Except for UNLOGGED before TABLE it's exactly like a regular create-table statement.

BEWARE: Unlogged is *NOT* a magic go-fast directive that comes at no cost. In normal tables data is written to the transaction log allowing the database to recover in the event of an unplanned shutdown. Unlogged tables, as the name implies, skip the writing of the write-ahead log gaining speed at the expense of data safety. For bulk-load operations that can be redone if they fail this is a reasonable approach. For normal operations where you expect your data to be well protected it is not. Depending on a variety of factors the speed improvement you will see ranges from "minimal" to "substantial"


Do I need to create all the variables (including types) in rawinput as well? If so, then I assume that I do not want rowid to have a primary key… or else I would be back where I started.
I would make rawinput exactly the same as your real table except without indexes or constraints. The easiest way is probably:
CREATE UNLOGGED TABLE rawinput as SELECT * FROM realtable limit 0;




Cheers,
Steve


Thanks again,

Kirk


Cheers,
Steve

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: pg_xlog is getting bigger
Next
From: Adrian Klaver
Date:
Subject: Re: pg_xlog is getting bigger