Thread: Most efficient way to insert without duplicates

Most efficient way to insert without duplicates

From
François Beausoleil
Date:
Hi all!

I track Twitter followers in my database. I have the following table:

# \d persona_followers
           Table "public.persona_followers"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 service_id  | bigint                      | not null
 follower_id | bigint                      | not null
 valid_at    | timestamp without time zone |
Indexes:
    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)

The table IS NOT partitioned.

I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have
thousands,if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When
it'stime to insert into the database, I use the following algorithm: 

CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND
import.follower_id= persona_followers.follower_id); 

I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real
dailyimport will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest
willalready exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the
duplicatesin import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or
isthe above already the most efficient way? What other completely different data structure could I use to achieve my
goal?I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which
receiveda particular tweet). 

The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind
ofexperience. 

Thanks!
François
Attachment

Re: Most efficient way to insert without duplicates

From
François Beausoleil
Date:
Le 2013-04-16 à 22:51, François Beausoleil a écrit :

> Hi all!
>
> I track Twitter followers in my database. I have the following table:
>
> # \d persona_followers
>           Table "public.persona_followers"
>   Column    |            Type             | Modifiers
> -------------+-----------------------------+-----------
> service_id  | bigint                      | not null
> follower_id | bigint                      | not null
> valid_at    | timestamp without time zone |
> Indexes:
>    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)
>
> The table IS NOT partitioned.

A run with the following query:

INSERT INTO "persona_followers"
  SELECT "service_id", "follower_id", now()
  FROM (
    SELECT *
    FROM (
        SELECT DISTINCT "service_id", "follower_id"
        FROM "persona_followers_import"
        WHERE "service_id" IS NOT NULL OR "follower_id" IS NOT NULL
      EXCEPT
         SELECT "service_id", "follower_id" FROM "persona_followers") AS "t1") AS "t1"

results in http://explain.depesz.com/s/Y1c

 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual
time=4729255.535..4729255.535rows=0 loops=1) 
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999
rows=6819520loops=1) 
         Output: t1.service_id, t1.follower_id, now()
         Buffers: shared hit=36891 read=3572263
         ->  HashSetOp Except  (cost=139261.12..20389650.18 rows=6256498 width=16) (actual time=562265.127..566513.759
rows=6819520loops=1) 
               Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, (0)
               Buffers: shared hit=36891 read=3572263
               ->  Append  (cost=139261.12..17054024.97 rows=667125042 width=16) (actual time=4090.462..320879.545
rows=667689321loops=1) 
                     Buffers: shared hit=36891 read=3572263
                     ->  Subquery Scan on "*SELECT* 1"  (cost=139261.12..264391.09 rows=6256498 width=16) (actual
time=4090.461..7798.334rows=6820784 loops=1) 
                           Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, 0
                           Buffers: shared hit=36891
                           ->  HashAggregate  (cost=139261.12..201826.11 rows=6256498 width=16) (actual
time=4090.459..6795.009rows=6820784 loops=1) 
                                 Output: persona_followers_import.service_id, persona_followers_import.follower_id
                                 Buffers: shared hit=36891
                                 ->  Seq Scan on francois.persona_followers_import  (cost=0.00..105137.75 rows=6824675
width=16)(actual time=0.017..1344.916 rows=6824700 loops=1) 
                                       Output: persona_followers_import.service_id,
persona_followers_import.follower_id
                                       Filter: ((persona_followers_import.service_id IS NOT NULL) OR
(persona_followers_import.follower_idIS NOT NULL)) 
                                       Buffers: shared hit=36891
                     ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..16789633.88 rows=660868544 width=16) (actual
time=6.694..238761.499rows=660868537 loops=1) 
                           Output: "*SELECT* 2".service_id, "*SELECT* 2".follower_id, 1
                           Buffers: shared read=3572263
                           ->  Seq Scan on public.persona_followers  (cost=0.00..10180948.44 rows=660868544 width=16)
(actualtime=6.693..137929.808 rows=660868537 loops=1) 
                                 Output: public.persona_followers.service_id, public.persona_followers.follower_id
                                 Buffers: shared read=3572263
 Total runtime: 4729338.157 ms

1h20m for 6.8 million rows inserted. The estimates are spot on, since I had just run VACUUM ANALYZE on both tables
priorto doing this test. Running the original query now, SELECT * FROM ... WHERE NOT EXISTS(...). 

Bye,
François
Attachment

Re: Most efficient way to insert without duplicates

From
Chris Curvey
Date:

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND import.follower_id = persona_followers.follower_id);


I'm wondering if you have an unneeded level of nesting. (I don't know if it would make any difference, but it might).

INSERT INTO persona_followers(service_id, follower_id, valid_at)
SELECT DISTINCT service_id, follower_id, now()
FROM import
WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND import.follower_id = persona_followers.follower_id);

Re: Most efficient way to insert without duplicates

From
Jeff Janes
Date:
On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:


 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)


It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

Cheers,

Jeff

Re: Most efficient way to insert without duplicates

From
Moshe Jacobson
Date:
On Tue, Apr 16, 2013 at 10:51 PM, François Beausoleil <francois@teksol.info> wrote:
INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND import.follower_id = persona_followers.follower_id);

Try this for your insert query instead:

insert into persona_followers( service_id, follower_id, valid_at )
select i.service_id, i.follower_id, now()
from import i
left join persona_followers pf on i.service_id = pf.service_id and i.follower_id = pf.follower_id
where pf.service_id is null
order by 1,2;

This will insert only those rows that are not already present, and involves no subqueries and only one join.

-- 
Moshe Jacobson
Nead Werx, Inc. | Manager of Systems Engineering
2323 Cumberland Parkway, Suite 201 | Atlanta, GA 30339
moshe@neadwerx.com | 
www.neadwerx.com

"Quality is not an act, it is a habit." -- Aristotle

Re: Most efficient way to insert without duplicates

From
François Beausoleil
Date:

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:


 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)


It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

In that case, partitioning the original table by service_id % N would help, since the index would be much smaller, right?

N would have to be reasonable - 10, 100, 256, or something similar.

Thanks,
François

Attachment

Re: Most efficient way to insert without duplicates

From
Jeff Janes
Date:
On Wed, Apr 17, 2013 at 1:19 PM, François Beausoleil <francois@teksol.info> wrote:

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)

This analysis is based on your example, which inserted 7 million rows.  But I just noticed you also said you only have a few thousands rows to insert per day.  So if you make your example better match your use case, perhaps that analysis would no longer hold.
 

In that case, partitioning the original table by service_id % N would help, since the index would be much smaller, right?

Probably not.  If you partition the table but do not change your loading method, then the relevant thing would be the sum of the index sizes over all partitions, which would be about the same as now.

On the other hand, if you change the method to load the data in batches, you don't need to partition the table, you just need to align the batches with the index order.  You could use partitioning as a way to do that, but it is just as easy (or easier) to do so without partitioning.

Once you solve the index maintenance problem, partitioning might help solve the select part of the deduplication problem, though.  You would only need to check against existing data for the partition into which you already know the current batch is going to be loaded.

Also, the constraint_exclusion code is usually not smart enough to deal with constraints that use the modulus (unless the modulus itself appears in the where clause).  You would have to use range partitioning instead.
 
Cheers,

Jeff

Re: Most efficient way to insert without duplicates

From
François Beausoleil
Date:
Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:


 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)


It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)


To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers  (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)
  Buffers: shared hit=206, local hit=1 read=105
  ->  Sort  (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)
        Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))
        Sort Key: persona_followers_import.follower_id
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=176, local hit=1 read=105
        ->  HashAggregate  (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)
              Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)
              Buffers: shared hit=176, local hit=1 read=105
              ->  Bitmap Heap Scan on pg_temp_35.persona_followers_import  (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)
                    Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at
                    Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                    Filter: (NOT (hashed SubPlan 1))
                    Buffers: shared hit=176, local hit=1 read=105
                    ->  Bitmap Index Scan on persona_followers_import_service_id  (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)
                          Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                          Buffers: local hit=1 read=40
                    SubPlan 1
                      ->  Bitmap Heap Scan on public.persona_followers  (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)
                            Output: public.persona_followers.follower_id
                            Recheck Cond: (public.persona_followers.service_id = 362513855)
                            Buffers: shared hit=176
                            ->  Bitmap Index Scan on persona_followers_pkey  (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)
                                  Index Cond: (public.persona_followers.service_id = 362513855)
                                  Buffers: shared hit=88
Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );
COPY TO persona_followers_import FROM stdin;
...
\.

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
for each service_id in service_ids:
  EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
  INSERT INTO persona_followers(service_id, follower_id, valid_at)
    SELECT service_id, follower_id, MIN(valid_at)
    FROM persona_followers_import
    WHERE follower_id IS NOT NULL
      AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)
      AND service_id = :service_id
    GROUP BY service_id, follower_id
    ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!
François Beausoleil
Attachment

Re: Most efficient way to insert without duplicates

From
Amador Alvarez
Date:
I would also give it a try on turning on statistics on service_id and follower_id fields and tune collecting of distinct values for the optimizer.

Cheers,

Amador A.


On Wed, Apr 24, 2013 at 9:04 AM, François Beausoleil <francois@teksol.info> wrote:
Hi!

Le 2013-04-17 à 14:15, Jeff Janes a écrit :

On Wed, Apr 17, 2013 at 4:26 AM, François Beausoleil <francois@teksol.info> wrote:


 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)


It looks like 12% of the time is being spent figuring out what rows to insert, and 88% actually doing the insertions.

So I think that index maintenance is killing you.  You could try adding a sort to your select so that rows are inserted in index order, or inserting in batches in which the batches are partitioned by service_id (which is almost the same thing as sorting, since service_id is the lead column)


To close out the thread, the final results are in http://explain.depesz.com/s/xOe :

Insert on public.persona_followers  (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1)
  Buffers: shared hit=206, local hit=1 read=105
  ->  Sort  (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1)
        Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at))
        Sort Key: persona_followers_import.follower_id
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=176, local hit=1 read=105
        ->  HashAggregate  (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1)
              Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at)
              Buffers: shared hit=176, local hit=1 read=105
              ->  Bitmap Heap Scan on pg_temp_35.persona_followers_import  (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1)
                    Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at
                    Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                    Filter: (NOT (hashed SubPlan 1))
                    Buffers: shared hit=176, local hit=1 read=105
                    ->  Bitmap Index Scan on persona_followers_import_service_id  (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1)
                          Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL))
                          Buffers: local hit=1 read=40
                    SubPlan 1
                      ->  Bitmap Heap Scan on public.persona_followers  (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1)
                            Output: public.persona_followers.follower_id
                            Recheck Cond: (public.persona_followers.service_id = 362513855)
                            Buffers: shared hit=176
                            ->  Bitmap Index Scan on persona_followers_pkey  (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1)
                                  Index Cond: (public.persona_followers.service_id = 362513855)
                                  Buffers: shared hit=88
Total runtime: 19.917 ms

Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this:

CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint );
COPY TO persona_followers_import FROM stdin;
...
\.

CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id);

service_ids := SELECT DISTINCT service_id FROM persona_followers_import;
for each service_id in service_ids:
  EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS )
  INSERT INTO persona_followers(service_id, follower_id, valid_at)
    SELECT service_id, follower_id, MIN(valid_at)
    FROM persona_followers_import
    WHERE follower_id IS NOT NULL
      AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id)
      AND service_id = :service_id
    GROUP BY service_id, follower_id
    ORDER BY follower_id

This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data.

Thanks for all suggestions, and PostgreSQL rocks!
François Beausoleil