Re: Most efficient way to insert without duplicates - Mailing list pgsql-general

From François Beausoleil
Subject Re: Most efficient way to insert without duplicates
Date
Msg-id A35148D6-07F4-4E11-84FD-7FDC0F19C011@teksol.info
Whole thread Raw
In response to Re: Most efficient way to insert without duplicates  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Most efficient way to insert without duplicates
List pgsql-general
Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:


 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)


It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)


To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers  (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)
  Buffers: shared hit=206, local hit=1 read=105
  ->  Sort  (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)
        Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))
        Sort Key: persona_followers_import.follower_id
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=176, local hit=1 read=105
        ->  HashAggregate  (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)
              Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)
              Buffers: shared hit=176, local hit=1 read=105
              ->  Bitmap Heap Scan on pg_temp_35.persona_followers_import  (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)
                    Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at
                    Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                    Filter: (NOT (hashed SubPlan 1))
                    Buffers: shared hit=176, local hit=1 read=105
                    ->  Bitmap Index Scan on persona_followers_import_service_id  (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)
                          Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                          Buffers: local hit=1 read=40
                    SubPlan 1
                      ->  Bitmap Heap Scan on public.persona_followers  (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)
                            Output: public.persona_followers.follower_id
                            Recheck Cond: (public.persona_followers.service_id = 362513855)
                            Buffers: shared hit=176
                            ->  Bitmap Index Scan on persona_followers_pkey  (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)
                                  Index Cond: (public.persona_followers.service_id = 362513855)
                                  Buffers: shared hit=88
Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );
COPY TO persona_followers_import FROM stdin;
...
\.

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
for each service_id in service_ids:
  EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
  INSERT INTO persona_followers(service_id, follower_id, valid_at)
    SELECT service_id, follower_id, MIN(valid_at)
    FROM persona_followers_import
    WHERE follower_id IS NOT NULL
      AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)
      AND service_id = :service_id
    GROUP BY service_id, follower_id
    ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!
François Beausoleil
Attachment

pgsql-general by date:

Previous
From: François Beausoleil
Date:
Subject: CLUSTER, REINDEX and VACUUM on batch ops
Next
From: Amador Alvarez
Date:
Subject: Re: Most efficient way to insert without duplicates