Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

From Torsten Zühlsdorff
Subject Re: How to insert a bulk of data with unique-violations very fast
Date
Msg-id huiroa$sn6$1@news.eternal-september.org
Whole thread Raw
In response to Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C" <lists@peufeu.com>)
Responses Re: How to insert a bulk of data with unique-violations very fast  ("Pierre C" <lists@peufeu.com>)
List pgsql-performance
Pierre C schrieb:
> Since you have lots of data you can use parallel loading.
>
> Split your data in several files and then do :
>
> CREATE TEMPORARY TABLE loader1 ( ... )
> COPY loader1 FROM ...
>
> Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
> something blows up, you can COPY it again... and it will be much faster
> because no WAL will be written.

That's a good advice, thank yo :)

> If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
> per core, and avoid writing WAL, it will scale.
>
> This doesn't solve the other half of your problem (removing the
> duplicates) which isn't easy to parallelize, but it will make the COPY
> part a lot faster.
>
> Note that you can have 1 core process the INSERT / removing duplicates
> while the others are handling COPY and filling temp tables, so if you
> pipeline it, you could save some time.
>
> Does your data contain a lot of duplicates, or are they rare ? What
> percentage ?

Within the data to import most rows have 20 till 50 duplicates. Sometime
much more, sometimes less.

But over 99,1% of the rows to import are already know. This percentage
is growing, because there is a finite number of rows i want to know.

In my special case i'm collection domain-names. Till now it's completly
for private interests and with normal pc-hardware. I'm collecting them
by crawling known sites and checking them for new hosts. Maybe i will
build later an expired domain service or an reverse ip database or
something like that. But now i'm just interested in the connection of
the sites and the structure people choose domain-names.

(Before someone ask: Till now i have more rows than domains (nearly)
exists, because i collect subdomain of all levels too and do not delete
entries)

Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

pgsql-performance by date:

Previous
From: "Pierre C"
Date:
Subject: Re: performance of temporary vs. regular tables
Next
From: "Pierre C"
Date:
Subject: Re: How to insert a bulk of data with unique-violations very fast