Re: Duplicate deletion optimizations - Mailing list pgsql-performance
From | Misa Simic |
---|---|
Subject | Re: Duplicate deletion optimizations |
Date | |
Msg-id | -8455307354303926221@iso-8859-1msgid Whole thread Raw |
In response to | Duplicate deletion optimizations (antoine@inaps.org) |
List | pgsql-performance |
hi, Maybe these thoughts could help.... 1) order by those three columns in your select min query could force index usage... 2) or DELETE FROM table WHERE EXISTS(SELECT id FROM table t WHERE t.id > table.id AND t.col1 = table.col1 AND t.col2 = table.col2 AND col3 = table.col3) Sent from my Windows Phone From: antoine@inaps.org Sent: 06/01/2012 15:36 To: pgsql-performance@postgresql.org Subject: [PERFORM] Duplicate deletion optimizations Hello, I've a table with approximately 50 million rows with a schema like this: id bigint NOT NULL DEFAULT nextval('stats_5mn'::regclass), t_value integer NOT NULL DEFAULT 0, t_record integer NOT NULL DEFAULT 0, output_id integer NOT NULL DEFAULT 0, count bigint NOT NULL DEFAULT 0, CONSTRAINT stats_mcs_5min_pkey PRIMARY KEY (id) Every 5 minutes, a process have to insert a few thousand of rows in this table, but sometime, the process have to insert an already existing row (based on values in the triplet (t_value, t_record, output_id). In this case, the row must be updated with the new count value. I've tried some solution given on this stackoverflow question [1] but the insertion rate is always too low for my needs. So, I've decided to do it in two times: - I insert all my new data with a COPY command - When it's done, I run a delete query to remove oldest duplicates Right now, my delete query look like this: SELECT min(id) FROM stats_5mn GROUP BY t_value, t_record, output_id HAVING count(*) > 1; The duration of the query on my test machine with approx. 16 million rows is ~18s. To reduce this duration, I've tried to add an index on my triplet: CREATE INDEX test ON stats_5mn USING btree (t_value , t_record , output_id ); By default, the PostgreSQL planner doesn't want to use my index and do a sequential scan [2], but if I force it with "SET enable_seqscan = off", the index is used [3] and query duration is lowered to ~5s. My questions: - Why the planner refuse to use my index? - Is there a better method for my problem? Thanks by advance for your help, Antoine Millet. [1] http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql http://stackoverflow.com/questions/3464750/postgres-upsert-insert-or-update-only-if-value-is-different [2] http://explain.depesz.com/s/UzW : GroupAggregate (cost=1167282.380..1294947.770 rows=762182 width=20) (actual time=20067.661..20067.661 rows=0 loops=1) Filter: (five(*) > 1) -> Sort (cost=1167282.380..1186336.910 rows=7621814 width=20) (actual time=15663.549..17463.458 rows=7621805 loops=1) Sort Key: delta, kilo, four Sort Method: external merge Disk: 223512kB -> Seq Scan on three (cost=0.000..139734.140 rows=7621814 width=20) (actual time=0.041..2093.434 rows=7621805 loops=1) [3] http://explain.depesz.com/s/o9P : GroupAggregate (cost=0.000..11531349.190 rows=762182 width=20) (actual time=5307.734..5307.734 rows=0 loops=1) Filter: (five(*) > 1) -> Index Scan using charlie on three (cost=0.000..11422738.330 rows=7621814 width=20) (actual time=0.046..2062.952 rows=7621805 loops=1) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: