Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY" - Mailing list pgsql-general

From Francisco Olarte
Subject Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
Date
Msg-id CA+bJJbwjOgyy5+RFb+CVNntWFrwZKekG-qhcRy65HAA7FyLJJQ@mail.gmail.com
Whole thread Raw
In response to How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"  (Arup Rakshit <aruprakshit@rocketmail.com>)
List pgsql-general
Hi Arup

On Sun, May 24, 2015 at 12:26 PM, Arup Rakshit
<aruprakshit@rocketmail.com> wrote:
> I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip
duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any
inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. 
> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
> I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this
postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. 
> Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody
suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. 

Without looking at your SO answers ( I suppose both consist on copying
to a temp table, then either doing an 'insert into main table select *
from temp where not exists...' or a delete from temp / insert into
main , which is what I would suggest ) I would point out copy is not
the tool for this kind of jobs. Copy is normally used for bulk loading
of correct data, and is great at this. If you need to preclean data,
you are not doing double work using one of those solutions. Many ways
of doing it have been pointed, the exact one depends on your data.
You've pointed there are duplicate rows, if this is true you can
easily do that using text tools and dumps or the temporary table.
There is also the solution of just turning your CSV with your favorite
text handiling tool into a set of insert lines conditioned of
inexistence of an appropiately indexed set of fields in the table (
which I've done but cannot quote exactly, it was an experiment and got
quite hairy ). From my experience, go for the temporary table plus
insert-select/delete+select route, it's easier, is normally faster (
supposing you have appropiate indexes ) and it really is no double
work. You have dirty data ( duplicates ), you have to clean it before
inserting.

Or, you could turn the file into a series of plain inserts and feed it
to psql in autocommit mode, if you have some unique condition on the
table, so it errors out on every duplicate. It's not pretty but should
work. I still recommend the temp table approach.

Francisco Olarte.


pgsql-general by date:

Previous
From: Arup Rakshit
Date:
Subject: Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
Next
From: Tom Lane
Date:
Subject: Re: Re: Server tries to read a different config file than it is supposed to