Most efficient way to insert without duplicates - Mailing list pgsql-general

From François Beausoleil
Subject Most efficient way to insert without duplicates
Date
Msg-id 897C3D25-1352-49C7-A9E6-2E9270029FAC@teksol.info
Whole thread Raw
Responses Re: Most efficient way to insert without duplicates
Re: Most efficient way to insert without duplicates
Re: Most efficient way to insert without duplicates
List pgsql-general
Hi all!

I track Twitter followers in my database. I have the following table:

# \d persona_followers
           Table "public.persona_followers"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 service_id  | bigint                      | not null
 follower_id | bigint                      | not null
 valid_at    | timestamp without time zone |
Indexes:
    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)

The table IS NOT partitioned.

I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have
thousands,if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When
it'stime to insert into the database, I use the following algorithm: 

CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND
import.follower_id= persona_followers.follower_id); 

I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real
dailyimport will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest
willalready exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the
duplicatesin import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or
isthe above already the most efficient way? What other completely different data structure could I use to achieve my
goal?I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which
receiveda particular tweet). 

The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind
ofexperience. 

Thanks!
François
Attachment

pgsql-general by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: How to build my own 9.2.4 installer package for IBM Power System ppc64
Next
From: John R Pierce
Date:
Subject: Re: How to build my own 9.2.4 installer package for IBM Power System ppc64