Home > mailing lists

Re: Most efficient way to insert without duplicates - Mailing list pgsql-general

From	François Beausoleil
Subject	Re: Most efficient way to insert without duplicates
Date	April 24, 2013 19:04:51
Msg-id	A35148D6-07F4-4E11-84FD-7FDC0F19C011@teksol.info Whole thread Raw
In response to	Re: Most efficient way to insert without duplicates (Jeff Janes <jeff.janes@gmail.com>)
Responses	Re: Most efficient way to insert without duplicates
List	pgsql-general

Tree view

Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:

Insert on public.persona_followers (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
Buffers: shared hit=33135295 read=4776921
-> Subquery Scan on t1 (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)

It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you. You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)

Buffers: shared hit=206, local hit=1 read=105

-> Sort (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)

Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))

Sort Key: persona_followers_import.follower_id

Sort Method: quicksort Memory: 25kB

Buffers: shared hit=176, local hit=1 read=105

-> HashAggregate (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)

Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)

Buffers: shared hit=176, local hit=1 read=105

-> Bitmap Heap Scan on pg_temp_35.persona_followers_import (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)

Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at

Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))

Filter: (NOT (hashed SubPlan 1))

Buffers: shared hit=176, local hit=1 read=105

-> Bitmap Index Scan on persona_followers_import_service_id (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)

Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))

Buffers: local hit=1 read=40

SubPlan 1

-> Bitmap Heap Scan on public.persona_followers (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)

Output: public.persona_followers.follower_id

Recheck Cond: (public.persona_followers.service_id = 362513855)

Buffers: shared hit=176

-> Bitmap Index Scan on persona_followers_pkey (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)

Index Cond: (public.persona_followers.service_id = 362513855)

Buffers: shared hit=88

Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );

COPY TO persona_followers_import FROM stdin;

...

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;

for each service_id in service_ids:

EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )

INSERT INTO persona_followers(service_id, follower_id, valid_at)

SELECT service_id, follower_id, MIN(valid_at)

FROM persona_followers_import

WHERE follower_id IS NOT NULL

AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)

AND service_id = :service_id

GROUP BY service_id, follower_id

ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!

François Beausoleil

Attachment

smime.p7s

pgsql-general by date:

From: François Beausoleil
Date: 24 April 2013, 18:50:24
Subject: CLUSTER, REINDEX and VACUUM on batch ops

From: Amador Alvarez
Date: 24 April 2013, 19:39:58
Subject: Re: Most efficient way to insert without duplicates

Re: Most efficient way to insert without duplicates - Mailing list pgsql-general

Attachment

Previous

Next