Home > mailing lists

Re: Most efficient way to insert without duplicates - Mailing list pgsql-general

From	Amador Alvarez
Subject	Re: Most efficient way to insert without duplicates
Date	April 24, 2013 19:39:58
Msg-id	CA+vGRtijYDy5K9zZzJkaL3ahm5PDvdrtYRhYqfECFTmc8ZXhnw@mail.gmail.com Whole thread Raw
In response to	Re: Most efficient way to insert without duplicates (François Beausoleil <francois@teksol.info>)
List	pgsql-general

Tree view

I would also give it a try on turning on statistics on service_id and follower_id fields and tune collecting of distinct values for the optimizer.

Cheers,

Amador A.

On Wed, Apr 24, 2013 at 9:04 AM, François Beausoleil <francois@teksol.info> wrote:

Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:

Insert on public.persona_followers (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
Buffers: shared hit=33135295 read=4776921
-> Subquery Scan on t1 (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)

It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you. You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)
Buffers: shared hit=206, local hit=1 read=105
-> Sort (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))
Sort Key: persona_followers_import.follower_id
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=176, local hit=1 read=105
-> HashAggregate (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)
Buffers: shared hit=176, local hit=1 read=105
-> Bitmap Heap Scan on pg_temp_35.persona_followers_import (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)
Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at
Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=176, local hit=1 read=105
-> Bitmap Index Scan on persona_followers_import_service_id (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)
Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
Buffers: local hit=1 read=40
SubPlan 1
-> Bitmap Heap Scan on public.persona_followers (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)
Output: public.persona_followers.follower_id
Recheck Cond: (public.persona_followers.service_id = 362513855)
Buffers: shared hit=176
-> Bitmap Index Scan on persona_followers_pkey (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)
Index Cond: (public.persona_followers.service_id = 362513855)
Buffers: shared hit=88
Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );
COPY TO persona_followers_import FROM stdin;
...
\.

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
for each service_id in service_ids:
EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
INSERT INTO persona_followers(service_id, follower_id, valid_at)
SELECT service_id, follower_id, MIN(valid_at)
FROM persona_followers_import
WHERE follower_id IS NOT NULL
AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)
AND service_id = :service_id
GROUP BY service_id, follower_id
ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!
François Beausoleil

pgsql-general by date:

From: François Beausoleil
Date: 24 April 2013, 19:04:51
Subject: Re: Most efficient way to insert without duplicates

From: jonesd@xmission.com
Date: 24 April 2013, 21:12:46
Subject: Re: Log messages regarding automatic vacuum and exclusive locks

Re: Most efficient way to insert without duplicates - Mailing list pgsql-general

Previous

Next