Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

From Cédric Villemain
Subject Re: How to insert a bulk of data with unique-violations very fast
Date
Msg-id AANLkTim0X5AyDz6K7uPdwCwoqNNSOjAUEsVJ9maS8aO0@mail.gmail.com
Whole thread Raw
In response to How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff <foo@meisterderspiele.de>)
Responses Re: How to insert a bulk of data with unique-violations very fast
List pgsql-performance
2010/6/1 Torsten Zühlsdorff <foo@meisterderspiele.de>:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i get a
> list of data, which contains multiple times of rows than the already stored
> one. Often around 2.000.000.000 rows. Within this rows are many duplicates
> and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import the
> new data with copy into an table 'import'. then i remove the duplicates and
> insert every row which is not already known. after that import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if the
> unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
>                         Tabelle »public.urls«
>  Spalte |   Typ   |                       Attribute
> --------+---------+-------------------------------------------------------
>  url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
>  url    | text    | not null
> Indexe:
>    »urls_url« UNIQUE, btree (url)
>    »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)

I think you need to have a look at pgloader. It does COPY with error
handling. very effective.

http://pgloader.projects.postgresql.org/

>
> Greetings from Germany,
> Torsten
> --
> http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
> verschiedenen Datenbanksystemen abstrahiert,
> Queries von Applikationen trennt und automatisch die Query-Ergebnisse
> auswerten kann.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

pgsql-performance by date:

Previous
From: Anj Adu
Date:
Subject: Re: slow query performance
Next
From: Anj Adu
Date:
Subject: slow query