Thread: How to insert a bulk of data with unique-violations very fast

How to insert a bulk of data with unique-violations very fast

From

Torsten Zühlsdorff

Date:

02 June 2010, 16:51:57

Hello,

i have a set of unique data which about 150.000.000 rows. Regullary i
get a list of data, which contains multiple times of rows than the
already stored one. Often around 2.000.000.000 rows. Within this rows
are many duplicates and often the set of already stored data.
I want to store just every entry, which is not within the already stored
one. Also i do not want to store duplicates. Example:

Already stored set:
a,b,c

Given set:
a,b,a,c,d,a,c,d,b

Expected set after import:
a,b,c,d

I now looking for a faster way for the import. At the moment i import
the new data with copy into an table 'import'. then i remove the
duplicates and insert every row which is not already known. after that
import is truncated.

Is there a faster way? Should i just insert every row and ignore it, if
the unique constrain fails?

Here the simplified table-schema. in real life it's with partitions:
test=# \d urls
                          Tabelle »public.urls«
  Spalte |   Typ   |                       Attribute
--------+---------+-------------------------------------------------------
  url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
  url    | text    | not null
Indexe:
     »urls_url« UNIQUE, btree (url)
     »urls_url_id« btree (url_id)

Thanks for every hint or advice! :)

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: How to insert a bulk of data with unique-violations very fast

From

Scott Marlowe

Date:

02 June 2010, 17:00:17

On Tue, Jun 1, 2010 at 9:03 AM, Torsten Zühlsdorff
<foo@meisterderspiele.de> wrote:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i get a
> list of data, which contains multiple times of rows than the already stored
> one. Often around 2.000.000.000 rows. Within this rows are many duplicates
> and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:

The standard method in pgsql is to load the data into a temp table
then insert where not exists in old table.

Re: How to insert a bulk of data with unique-violations very fast

From

Scott Marlowe

Date:

03 June 2010, 17:09:48

On Thu, Jun 3, 2010 at 11:19 AM, Torsten Zühlsdorff
<foo@meisterderspiele.de> wrote:
> Scott Marlowe schrieb:
>>
>> On Tue, Jun 1, 2010 at 9:03 AM, Torsten Zühlsdorff
>> <foo@meisterderspiele.de> wrote:
>>>
>>> Hello,
>>>
>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>> a
>>> list of data, which contains multiple times of rows than the already
>>> stored
>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>> duplicates
>>> and often the set of already stored data.
>>> I want to store just every entry, which is not within the already stored
>>> one. Also i do not want to store duplicates. Example:
>>
>> The standard method in pgsql is to load the data into a temp table
>> then insert where not exists in old table.
>
> Sorry, i didn't get it. I've googled some examples, but no one match at my
> case. Every example i found was a single insert which should be done or
> ignored, if the row is already stored.
>
> But in my case i have a bulk of rows with duplicates. Either your tipp
> doesn't match my case or i didn't unterstand it correctly. Can you provide a
> simple example?

create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
(3 rows)

select * from loader;
 id | info
----+------
  1 | abc
  4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
  4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query.  If the info field
here was different you'd get an error an no insert / update.  For that
case you might want to use "where not in":

insert into main select * from loader where id not in (select id from main);

If you wanted the new rows to update pre-existing rows, then you could
run an update first where the ids matched, then the insert where no id
matches.

Re: How to insert a bulk of data with unique-violations very fast

From

Cédric Villemain

Date:

03 June 2010, 20:04:08

2010/6/1 Torsten Zühlsdorff <foo@meisterderspiele.de>:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i get a
> list of data, which contains multiple times of rows than the already stored
> one. Often around 2.000.000.000 rows. Within this rows are many duplicates
> and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import the
> new data with copy into an table 'import'. then i remove the duplicates and
> insert every row which is not already known. after that import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if the
> unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
>                         Tabelle »public.urls«
>  Spalte |   Typ   |                       Attribute
> --------+---------+-------------------------------------------------------
>  url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
>  url    | text    | not null
> Indexe:
>    »urls_url« UNIQUE, btree (url)
>    »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)

I think you need to have a look at pgloader. It does COPY with error
handling. very effective.

http://pgloader.projects.postgresql.org/

>
> Greetings from Germany,
> Torsten
> --
> http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
> verschiedenen Datenbanksystemen abstrahiert,
> Queries von Applikationen trennt und automatisch die Query-Ergebnisse
> auswerten kann.
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: How to insert a bulk of data with unique-violations very fast

From

Andy Colson

Date:

06 June 2010, 09:45:47

On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i
> get a list of data, which contains multiple times of rows than the
> already stored one. Often around 2.000.000.000 rows. Within this rows
> are many duplicates and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import
> the new data with copy into an table 'import'. then i remove the
> duplicates and insert every row which is not already known. after that
> import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if
> the unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
> Tabelle »public.urls«
> Spalte | Typ | Attribute
> --------+---------+-------------------------------------------------------
> url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
> url | text | not null
> Indexe:
> »urls_url« UNIQUE, btree (url)
> »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)
>
> Greetings from Germany,
> Torsten

I do this with a stored procedure.  I do not care about speed because my db is really small and I only insert a few
recordsa month.  So I dont know how fast this is, but here is my func: 

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) RETURNS character varying
AS $$
declare
     tmp integer;
begin
     insert into blood(adate, alevel) values(idate, ilevel);
     return 'ok';
exception
     when unique_violation then
         select into tmp alevel from blood where adate = idate;
         if tmp <> ilevel then
             return idate || ' levels differ!';
         else
             return 'ok, already in table';
         end if;
end; $$
LANGUAGE plpgsql;


Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but you wouldnt have to.  There is a unique
indexon adate. 

-Andy

Re: How to insert a bulk of data with unique-violations very fast

From

Torsten Zühlsdorff

Date:

06 June 2010, 10:15:30

Cédric Villemain schrieb:

> I think you need to have a look at pgloader. It does COPY with error
> handling. very effective.

Thanks for this advice. I will have a look at it.

Greetings from Germany,
Torsten

Re: How to insert a bulk of data with unique-violations very fast

From

Torsten Zühlsdorff

Date:

06 June 2010, 10:15:31

Scott Marlowe schrieb:

>>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>>> a
>>>> list of data, which contains multiple times of rows than the already
>>>> stored
>>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>>> duplicates
>>>> and often the set of already stored data.
>>>> I want to store just every entry, which is not within the already stored
>>>> one. Also i do not want to store duplicates. Example:
>>> The standard method in pgsql is to load the data into a temp table
>>> then insert where not exists in old table.
>> Sorry, i didn't get it. I've googled some examples, but no one match at my
>> case. Every example i found was a single insert which should be done or
>> ignored, if the row is already stored.
>>
>> But in my case i have a bulk of rows with duplicates. Either your tipp
>> doesn't match my case or i didn't unterstand it correctly. Can you provide a
>> simple example?
>
> create table main (id int primary key, info text);
> create table loader (id int, info text);
> insert into main values (1,'abc'),(2,'def'),(3,'ghi');
> insert into loader values (1,'abc'),(4,'xyz');
> select * from main;
>  id | info
> ----+------
>   1 | abc
>   2 | def
>   3 | ghi
> (3 rows)
>
> select * from loader;
>  id | info
> ----+------
>   1 | abc
>   4 | xyz
> (2 rows)
>
> insert into main select * from loader except select * from main;
> select * from main;
>  id | info
> ----+------
>   1 | abc
>   2 | def
>   3 | ghi
>   4 | xyz
> (4 rows)
>
> Note that for the where not exists to work the fields would need to be
> all the same, or you'd need a more complex query.  If the info field
> here was different you'd get an error an no insert / update.  For that
> case you might want to use "where not in":
>
> insert into main select * from loader where id not in (select id from main);

Thank you very much for your example. Now i've got it :)

I've test your example on a small set of my rows. While testing i've
stumpled over a difference in sql-formulation. Using except seems to be
a little slower than the more complex where not in (subquery) group by.
Here is my example:

CREATE TABLE tseq (value text);
INSERT INTO tseq VALUES ('a') , ('b'), ('c');
CREATE UNIQUE INDEX tseq_unique on tseq (value);
CREATE TEMP TABLE tmpseq(value text);
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('c');
INSERT INTO tmpseq VALUES ('a') , ('b'), ('d');
INSERT INTO tmpseq VALUES ('d') , ('b'), ('d');
SELECT* from tseq;
  value
-------
  a
  b
  c
(3 rows)

SELECT* from tmpseq;
  value
-------
  a
  b
  c
  a
  b
  c
  a
  b
  d
  d
  b
  d
(12 rows)

VACUUM VERBOSE ANALYSE;

explain analyze SELECT value FROM tmpseq except SELECT value FROM tseq;
                                                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------
  HashSetOp Except  (cost=0.00..2.34 rows=4 width=2) (actual
time=0.157..0.158 rows=1 loops=1)
    ->  Append  (cost=0.00..2.30 rows=15 width=2) (actual
time=0.012..0.126 rows=15 loops=1)
          ->  Subquery Scan "*SELECT* 1"  (cost=0.00..1.24 rows=12
width=2) (actual time=0.009..0.060 rows=12 loops=1)
                ->  Seq Scan on tmpseq  (cost=0.00..1.12 rows=12
width=2) (actual time=0.004..0.022 rows=12 loops=1)
          ->  Subquery Scan "*SELECT* 2"  (cost=0.00..1.06 rows=3
width=2) (actual time=0.006..0.018 rows=3 loops=1)
                ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2)
(actual time=0.003..0.009 rows=3 loops=1)
  Total runtime: 0.216 ms
(7 rows)

explain analyze SELECT value FROM tmpseq WHERE value NOT IN (SELECT
value FROM tseq) GROUP BY value;
                                                  QUERY PLAN

------------------------------------------------------------------------------------------------------------
  HashAggregate  (cost=2.20..2.22 rows=2 width=2) (actual
time=0.053..0.055 rows=1 loops=1)
    ->  Seq Scan on tmpseq  (cost=1.04..2.19 rows=6 width=2) (actual
time=0.038..0.043 rows=3 loops=1)
          Filter: (NOT (hashed SubPlan 1))
          SubPlan 1
            ->  Seq Scan on tseq  (cost=0.00..1.03 rows=3 width=2)
(actual time=0.004..0.009 rows=3 loops=1)
  Total runtime: 0.105 ms
(6 rows)

My question: is this an generall behavior or just an effect of the small
case?

Greetings form Germany,
Torsten

Re: How to insert a bulk of data with unique-violations very fast

From

Scott Marlowe

Date:

06 June 2010, 14:47:01

On Sun, Jun 6, 2010 at 6:02 AM, Torsten Zühlsdorff
<foo@meisterderspiele.de> wrote:
> Scott Marlowe schrieb:
> Thank you very much for your example. Now i've got it :)
>
> I've test your example on a small set of my rows. While testing i've
> stumpled over a difference in sql-formulation. Using except seems to be a
> little slower than the more complex where not in (subquery) group by. Here
> is my example:

Yeah, to get a good idea you need a more realistic example.  Build
some tables with millions of rows using generate_series() and then
test against those.

Re: How to insert a bulk of data with unique-violations very fast

From

"Pierre C"

Date:

06 June 2010, 19:42:13

Since you have lots of data you can use parallel loading.

Split your data in several files and then do :

CREATE TEMPORARY TABLE loader1 ( ... )
COPY loader1 FROM ...

Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
something blows up, you can COPY it again... and it will be much faster
because no WAL will be written.

If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
per core, and avoid writing WAL, it will scale.

This doesn't solve the other half of your problem (removing the
duplicates) which isn't easy to parallelize, but it will make the COPY
part a lot faster.

Note that you can have 1 core process the INSERT / removing duplicates
while the others are handling COPY and filling temp tables, so if you
pipeline it, you could save some time.

Does your data contain a lot of duplicates, or are they rare ? What
percentage ?

Re: How to insert a bulk of data with unique-violations very fast

From

Torsten Zühlsdorff

Date:

07 June 2010, 10:45:49

Pierre C schrieb:
> Since you have lots of data you can use parallel loading.
>
> Split your data in several files and then do :
>
> CREATE TEMPORARY TABLE loader1 ( ... )
> COPY loader1 FROM ...
>
> Use a TEMPORARY TABLE for this : you don't need crash-recovery since if
> something blows up, you can COPY it again... and it will be much faster
> because no WAL will be written.

That's a good advice, thank yo :)

> If your disk is fast, COPY is cpu-bound, so if you can do 1 COPY process
> per core, and avoid writing WAL, it will scale.
>
> This doesn't solve the other half of your problem (removing the
> duplicates) which isn't easy to parallelize, but it will make the COPY
> part a lot faster.
>
> Note that you can have 1 core process the INSERT / removing duplicates
> while the others are handling COPY and filling temp tables, so if you
> pipeline it, you could save some time.
>
> Does your data contain a lot of duplicates, or are they rare ? What
> percentage ?

Within the data to import most rows have 20 till 50 duplicates. Sometime
much more, sometimes less.

But over 99,1% of the rows to import are already know. This percentage
is growing, because there is a finite number of rows i want to know.

In my special case i'm collection domain-names. Till now it's completly
for private interests and with normal pc-hardware. I'm collecting them
by crawling known sites and checking them for new hosts. Maybe i will
build later an expired domain service or an reverse ip database or
something like that. But now i'm just interested in the connection of
the sites and the structure people choose domain-names.

(Before someone ask: Till now i have more rows than domains (nearly)
exists, because i collect subdomain of all levels too and do not delete
entries)

Thanks everyone for your advices. This will help me a lot!

Greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: How to insert a bulk of data with unique-violations very fast

From

"Pierre C"

Date:

07 June 2010, 13:52:24

> Within the data to import most rows have 20 till 50 duplicates. Sometime
> much more, sometimes less.

In that case (source data has lots of redundancy), after importing the
data chunks in parallel, you can run a first pass of de-duplication on the
chunks, also in parallel, something like :

CREATE TEMP TABLE foo_1_dedup AS SELECT DISTINCT * FROM foo_1;

or you could compute some aggregates, counts, etc. Same as before, no WAL
needed, and you can use all your cores in parallel.

 From what you say this should reduce the size of your imported data by a
lot (and hence the time spent in the non-parallel operation).

With a different distribution, ie duplicates only between existing and
imported data, and not within the imported data, this strategy would be
useless.

Re: How to insert a bulk of data with unique-violations very fast

From

Torsten Zühlsdorff

Date:

09 June 2010, 04:46:10

Pierre C schrieb:
>
>> Within the data to import most rows have 20 till 50 duplicates.
>> Sometime much more, sometimes less.
>
> In that case (source data has lots of redundancy), after importing the
> data chunks in parallel, you can run a first pass of de-duplication on
> the chunks, also in parallel, something like :
>
> CREATE TEMP TABLE foo_1_dedup AS SELECT DISTINCT * FROM foo_1;
>
> or you could compute some aggregates, counts, etc. Same as before, no
> WAL needed, and you can use all your cores in parallel.
>
>  From what you say this should reduce the size of your imported data by
> a lot (and hence the time spent in the non-parallel operation).

Thank you very much for this advice. I've tried it inanother project
with similar import-problems. This really speed the import up.

Thank everyone for your time and help!

Greetings,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: How to insert a bulk of data with unique-violations very fast

From

"Pierre C"

Date:

09 June 2010, 07:51:40

>>> Within the data to import most rows have 20 till 50 duplicates.
>>> Sometime much more, sometimes less.
>>  In that case (source data has lots of redundancy), after importing the
>> data chunks in parallel, you can run a first pass of de-duplication on
>> the chunks, also in parallel, something like :
>>  CREATE TEMP TABLE foo_1_dedup AS SELECT DISTINCT * FROM foo_1;
>>  or you could compute some aggregates, counts, etc. Same as before, no
>> WAL needed, and you can use all your cores in parallel.
>>   From what you say this should reduce the size of your imported data
>> by a lot (and hence the time spent in the non-parallel operation).
>
> Thank you very much for this advice. I've tried it inanother project
> with similar import-problems. This really speed the import up.

Glad it was useful ;)