Re: How to insert a bulk of data with unique-violations very fast - Mailing list pgsql-performance

From Andy Colson
Subject Re: How to insert a bulk of data with unique-violations very fast
Date
Msg-id 4C0B986E.2060403@squeakycode.net
Whole thread Raw
In response to How to insert a bulk of data with unique-violations very fast  (Torsten Zühlsdorff <foo@meisterderspiele.de>)
List pgsql-performance
On 06/01/2010 10:03 AM, Torsten Zühlsdorff wrote:
> Hello,
>
> i have a set of unique data which about 150.000.000 rows. Regullary i
> get a list of data, which contains multiple times of rows than the
> already stored one. Often around 2.000.000.000 rows. Within this rows
> are many duplicates and often the set of already stored data.
> I want to store just every entry, which is not within the already stored
> one. Also i do not want to store duplicates. Example:
>
> Already stored set:
> a,b,c
>
> Given set:
> a,b,a,c,d,a,c,d,b
>
> Expected set after import:
> a,b,c,d
>
> I now looking for a faster way for the import. At the moment i import
> the new data with copy into an table 'import'. then i remove the
> duplicates and insert every row which is not already known. after that
> import is truncated.
>
> Is there a faster way? Should i just insert every row and ignore it, if
> the unique constrain fails?
>
> Here the simplified table-schema. in real life it's with partitions:
> test=# \d urls
> Tabelle »public.urls«
> Spalte | Typ | Attribute
> --------+---------+-------------------------------------------------------
> url_id | integer | not null default nextval('urls_url_id_seq'::regclass)
> url | text | not null
> Indexe:
> »urls_url« UNIQUE, btree (url)
> »urls_url_id« btree (url_id)
>
> Thanks for every hint or advice! :)
>
> Greetings from Germany,
> Torsten

I do this with a stored procedure.  I do not care about speed because my db is really small and I only insert a few
recordsa month.  So I dont know how fast this is, but here is my func: 

CREATE FUNCTION addentry(idate timestamp without time zone, ilevel integer) RETURNS character varying
AS $$
declare
     tmp integer;
begin
     insert into blood(adate, alevel) values(idate, ilevel);
     return 'ok';
exception
     when unique_violation then
         select into tmp alevel from blood where adate = idate;
         if tmp <> ilevel then
             return idate || ' levels differ!';
         else
             return 'ok, already in table';
         end if;
end; $$
LANGUAGE plpgsql;


Use it like, select * from addentry('2010-006-06 8:00:00', 130);

I do an extra check that if the date's match that the level's match too, but you wouldnt have to.  There is a unique
indexon adate. 

-Andy


pgsql-performance by date:

Previous
From: Craig Ringer
Date:
Subject: Re: How filesystems matter with PostgreSQL
Next
From: Torsten Zühlsdorff
Date:
Subject: Re: How to insert a bulk of data with unique-violations very fast