Re: Duplicate deletion optimizations - Mailing list pgsql-performance

From Samuel Gendler
Subject Re: Duplicate deletion optimizations
Date
Msg-id CAEV0TzCqavqQTtsORx3thrKM3FjhahK4y5Zx4HF4W+-Bosdrmg@mail.gmail.com
Whole thread Raw
In response to Duplicate deletion optimizations  (antoine@inaps.org)
Responses Re: Duplicate deletion optimizations  (Marc Eberhard <eberhardma@googlemail.com>)
List pgsql-performance


On Fri, Jan 6, 2012 at 6:35 AM, <antoine@inaps.org> wrote:
Hello,

I've a table with approximately 50 million rows with a schema like this:

   id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass),
   t_value integer NOT NULL DEFAULT 0,
   t_record integer NOT NULL DEFAULT 0,
   output_id integer NOT NULL DEFAULT 0,
   count bigint NOT NULL DEFAULT 0,
   CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id)

Every 5 minutes, a process have to insert a few thousand of rows in this table,
but sometime, the process have to insert an already existing row (based on
values in the triplet (t_value, t_record, output_id). In this case, the row
must be updated with the new count value. I've tried some solution given on this
stackoverflow question [1] but the insertion rate is always too low for my needs.

So, I've decided to do it in two times:

 - I insert all my new data with a COPY command
 - When it's done, I run a delete query to remove oldest duplicates

Right now, my delete query look like this:

   SELECT min(id) FROM stats_5mn
   GROUP BY t_value, t_record, output_id
   HAVING count(*) > 1;

The duration of the query on my test machine with approx. 16 million rows is ~18s.

Have you considered doing the insert by doing a bulk insert into a temp table and then pulling rows that don't exist across to the final table in one query and updating rows that do exist in another query?  I did a very brief scan of the SO thread and didn't see it suggested.  Something like this:

update stats_5mn set count = count + t.count 
from temp_table t 
where stats_5mn.t_value = t.t_value and stats_5mn.t_record and stats_5mn.output_id = t.output_id;

insert into stats_5mn 
select * from temp_table t 
where not exists (
select 1 from stats_5mn s 
where s.t_value = t.t_value and s.t_record = t.t_record and s.output_id = t.output_id
);
 
drop table temp_table;

Note - you must do the update before the insert because doing it the other way around will cause every row you just inserted to also be updated.

I'm not sure it'd be markedly faster, but you'd at least be able to retain a unique constraint on the triplet, if desired.  And, to my eye, the logic is easier to comprehend.  The different query structure may make better use of your index, but I imagine that it is not using it currently because your db isn't configured to accurately reflect the real cost of index use vs sequential scan, so it is incorrectly determining the cost of looking up 7.5 million rows.  Its estimate of the row count is correct, so the estimate of the cost must be the problem.  We'd need to know more about your current config and hardware specs to be able to even start making suggestions about config changes to correct the problem.

pgsql-performance by date:

Previous
From: antoine@inaps.org
Date:
Subject: Re: Duplicate deletion optimizations
Next
From: Marc Eberhard
Date:
Subject: Re: Duplicate deletion optimizations