Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

From Scott Marlowe
Subject Re: How to insert a bulk of data with unique-violations very fast
Date
Msg-id AANLkTilj2xHAZMnQ-To3lod3mMPw1DBU_8EkRwTC29qb@mail.gmail.com
Whole thread Raw
In response to How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff <foo@meisterderspiele.de>)
Responses Re: How to insert a bulk of data with unique-violations very fast
List pgsql-performance
On Thu, Jun 3, 2010 at 11:19 AM, Torsten Zühlsdorff
<foo@meisterderspiele.de> wrote:
> Scott Marlowe schrieb:
>>
>> On Tue, Jun 1, 2010 at 9:03 AM, Torsten Zühlsdorff
>> <foo@meisterderspiele.de> wrote:
>>>
>>> Hello,
>>>
>>> i have a set of unique data which about 150.000.000 rows. Regullary i get
>>> a
>>> list of data, which contains multiple times of rows than the already
>>> stored
>>> one. Often around 2.000.000.000 rows. Within this rows are many
>>> duplicates
>>> and often the set of already stored data.
>>> I want to store just every entry, which is not within the already stored
>>> one. Also i do not want to store duplicates. Example:
>>
>> The standard method in pgsql is to load the data into a temp table
>> then insert where not exists in old table.
>
> Sorry, i didn't get it. I've googled some examples, but no one match at my
> case. Every example i found was a single insert which should be done or
> ignored, if the row is already stored.
>
> But in my case i have a bulk of rows with duplicates. Either your tipp
> doesn't match my case or i didn't unterstand it correctly. Can you provide a
> simple example?

create table main (id int primary key, info text);
create table loader (id int, info text);
insert into main values (1,'abc'),(2,'def'),(3,'ghi');
insert into loader values (1,'abc'),(4,'xyz');
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
(3 rows)

select * from loader;
 id | info
----+------
  1 | abc
  4 | xyz
(2 rows)

insert into main select * from loader except select * from main;
select * from main;
 id | info
----+------
  1 | abc
  2 | def
  3 | ghi
  4 | xyz
(4 rows)

Note that for the where not exists to work the fields would need to be
all the same, or you'd need a more complex query.  If the info field
here was different you'd get an error an no insert / update.  For that
case you might want to use "where not in":

insert into main select * from loader where id not in (select id from main);

If you wanted the new rows to update pre-existing rows, then you could
run an update first where the ids matched, then the insert where no id
matches.

pgsql-performance by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Weird XFS WAL problem
Next
From: Anj Adu
Date:
Subject: Re: slow query performance