Re: Duplicate deletion optimizations - Mailing list pgsql-performance

From Misa Simic
Subject Re: Duplicate deletion optimizations
Date
Msg-id -8455307354303926221@iso-8859-1msgid
Whole thread Raw
In response to Duplicate deletion optimizations  (antoine@inaps.org)
List pgsql-performance
hi,

Maybe these thoughts could help....

1) order by those three columns in your select min query could force
index usage...

2) or

DELETE FROM table
WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 =
table.col1 AND t.col2 = table.col2 AND col3 = table.col3)



Sent from my Windows Phone
From: antoine@inaps.org
Sent: 06/01/2012 15:36
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Duplicate deletion optimizations
Hello,

I've a table with approximately 50 million rows with a schema like
this:

     id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
     t_value integer NOT NULL DEFAULT 0,
     t_record integer NOT NULL DEFAULT 0,
     output_id integer NOT NULL DEFAULT 0,
     count bigint NOT NULL DEFAULT 0,
     CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in
this table,
but sometime, the process have to insert an already existing row (based
on
values in the triplet (t_value, t_record, output_id). In this case, the
row
must be updated with the new count value. I've tried some solution
given on this
stackoverflow question [1] but the insertion rate is always too low for
my needs.

So, I've decided to do it in two times:

  - I insert all my new data with a COPY command
  - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

     SELECT min(id) FROM stats_5mn
     GROUP BY t_value, t_record, output_id
     HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million
rows is ~18s.

To reduce this duration, I've tried to add an index on my triplet:

     CREATE INDEX test
       ON stats_5mn
       USING btree
       (t_value , t_record , output_id );

By default, the PostgreSQL planner doesn't want to use my index and do
a sequential
scan [2], but if I force it with "SET enable_seqscan = off", the index
is used [3]
and query duration is lowered to ~5s.


My questions:

  - Why the planner refuse to use my index?
  - Is there a better method for my problem?


Thanks by advance for your help,
Antoine Millet.


[1]
http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql

http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different

[2] http://explain.depesz.com/s/UzW :
     GroupAggregate  (cost=1167282.380..1294947.770 rows=762182
width=20) (actual time=20067.661..20067.661 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Sort  (cost=1167282.380..1186336.910 rows=7621814 width=20)
(actual time=15663.549..17463.458 rows=7621805 loops=1)
               Sort Key: delta, kilo, four
               Sort Method:  external merge  Disk: 223512kB
             ->  Seq Scan on three  (cost=0.000..139734.140 rows=7621814
width=20) (actual time=0.041..2093.434 rows=7621805 loops=1)

[3] http://explain.depesz.com/s/o9P :
     GroupAggregate  (cost=0.000..11531349.190 rows=762182 width=20)
(actual time=5307.734..5307.734 rows=0 loops=1)
         Filter: (five(*) > 1)
       ->  Index Scan using charlie on three  (cost=0.000..11422738.330
rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805
loops=1)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: Misa Simic
Date:
Subject: Re: Duplicate deletion optimizations
Next
From: Jochen Erwied
Date:
Subject: Re: Duplicate deletion optimizations