Home > mailing lists

Most efficient way to insert without duplicates - Mailing list pgsql-general

From	François Beausoleil
Subject	Most efficient way to insert without duplicates
Date	April 17, 2013 05:51:31
Msg-id	897C3D25-1352-49C7-A9E6-2E9270029FAC@teksol.info Whole thread Raw
Responses	Re: Most efficient way to insert without duplicates Re: Most efficient way to insert without duplicates Re: Most efficient way to insert without duplicates
List	pgsql-general

Tree view

Hi all!

I track Twitter followers in my database. I have the following table:

# \d persona_followers
           Table "public.persona_followers"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 service_id  | bigint                      | not null
 follower_id | bigint                      | not null
 valid_at    | timestamp without time zone |
Indexes:
    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)

The table IS NOT partitioned.

I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have
thousands,if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When
it'stime to insert into the database, I use the following algorithm: 

CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND
import.follower_id= persona_followers.follower_id); 

I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real
dailyimport will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest
willalready exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the
duplicatesin import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or
isthe above already the most efficient way? What other completely different data structure could I use to achieve my
goal?I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which
receiveda particular tweet). 

The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind
ofexperience. 

Thanks!
François

Attachment

smime.p7s

pgsql-general by date:

From: "Joshua D. Drake"
Date: 17 April 2013, 05:34:46
Subject: Re: How to build my own 9.2.4 installer package for IBM Power System ppc64

From: John R Pierce
Date: 17 April 2013, 06:30:27
Subject: Re: How to build my own 9.2.4 installer package for IBM Power System ppc64

Most efficient way to insert without duplicates - Mailing list pgsql-general

Attachment

Previous

Next