Thread: delete duplicates takes too long
hi , i hava a table:
CREATE TABLE public.cdr_ama_stat (
id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),
abonado_a varchar(30) NULL,
abonado_b varchar(30) NULL,
fecha_llamada timestamp NULL,
duracion int4 NULL,
puerto_a varchar(4) NULL,
puerto_b varchar(4) NULL,
tipo_llamada char(1) NULL,
processed int4 NULL,
PRIMARY KEY(id)
)
GO
CREATE INDEX kpi_fecha_llamada
ON public.cdr_ama_stat(fecha_llamada)
there should be unique values for abonado_a, abonado_b, fecha_llamada, duracion in every row, googling around i found how to delete duplicates in postgresonline site ,
so i run the following query (lets say i want to know how many duplicates exists for 2004-04-18, before delete them):
SELECT * FROM cdr_ama_stat
WHERE id NOT IN
(SELECT MAX(dt.id)
FROM cdr_ama_stat As dt
WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'
GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)
AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'
my problem is that the query take forever, number of rows:
kpi=# select count(*) from cdr_ama_stat;
count
---------
5908065
(1 row)
this the explain result on the above query:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using kpi_fecha_llamada on cdr_ama_stat (cost=115713.94..79528582.40 rows=140809 width=50)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=115713.94..116202.56 rows=28162 width=34)
-> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34)
-> Sort (cost=110902.49..111606.53 rows=281618 width=34)
Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion
-> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30 rows=281618 width=34)
Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62 rows=281618 width=0)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
am i doing something wrong?
I think several minutes should be ok, but not several hours as happens now, i do a bulk load (350k rows aprox) in that table every day, and yes, i did vacuum full analyze the database before posting this results.
Server is Dual Xeon 3.0 Ghx, 2Gb RAM
best regards
CREATE TABLE public.cdr_ama_stat (
id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),
abonado_a varchar(30) NULL,
abonado_b varchar(30) NULL,
fecha_llamada timestamp NULL,
duracion int4 NULL,
puerto_a varchar(4) NULL,
puerto_b varchar(4) NULL,
tipo_llamada char(1) NULL,
processed int4 NULL,
PRIMARY KEY(id)
)
GO
CREATE INDEX kpi_fecha_llamada
ON public.cdr_ama_stat(fecha_llamada)
there should be unique values for abonado_a, abonado_b, fecha_llamada, duracion in every row, googling around i found how to delete duplicates in postgresonline site ,
so i run the following query (lets say i want to know how many duplicates exists for 2004-04-18, before delete them):
SELECT * FROM cdr_ama_stat
WHERE id NOT IN
(SELECT MAX(dt.id)
FROM cdr_ama_stat As dt
WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'
GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)
AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + INTERVAL '1 day'
my problem is that the query take forever, number of rows:
kpi=# select count(*) from cdr_ama_stat;
count
---------
5908065
(1 row)
this the explain result on the above query:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using kpi_fecha_llamada on cdr_ama_stat (cost=115713.94..79528582.40 rows=140809 width=50)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
Filter: (NOT (subplan))
SubPlan
-> Materialize (cost=115713.94..116202.56 rows=28162 width=34)
-> GroupAggregate (cost=110902.49..115478.78 rows=28162 width=34)
-> Sort (cost=110902.49..111606.53 rows=281618 width=34)
Sort Key: dt.abonado_a, dt.abonado_b, dt.fecha_llamada, dt.duracion
-> Bitmap Heap Scan on cdr_ama_stat dt (cost=8580.03..70970.30 rows=281618 width=34)
Recheck Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on kpi_fecha_llamada (cost=0.00..8509.62 rows=281618 width=0)
Index Cond: ((fecha_llamada >= '2009-04-18 00:00:00'::timestamp without time zone) AND (fecha_llamada <= '2009-04-19 00:00:00'::timestamp without time zone))
am i doing something wrong?
I think several minutes should be ok, but not several hours as happens now, i do a bulk load (350k rows aprox) in that table every day, and yes, i did vacuum full analyze the database before posting this results.
Server is Dual Xeon 3.0 Ghx, 2Gb RAM
best regards
On Fri, Apr 24, 2009 at 5:37 PM, Miguel Miranda <miguel.mirandag@gmail.com> wrote: > hi , i hava a table: > CREATE TABLE public.cdr_ama_stat ( > id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass), > abonado_a varchar(30) NULL, > abonado_b varchar(30) NULL, > fecha_llamada timestamp NULL, > duracion int4 NULL, > puerto_a varchar(4) NULL, > puerto_b varchar(4) NULL, > tipo_llamada char(1) NULL, > processed int4 NULL, > PRIMARY KEY(id) > ) > GO > CREATE INDEX kpi_fecha_llamada > ON public.cdr_ama_stat(fecha_llamada) > > there should be unique values for abonado_a, abonado_b, fecha_llamada, > duracion in every row, googling around i found how to delete duplicates in > postgresonline site , Then why not have a unique index on those rows together? > so i run the following query (lets say i want to know how many duplicates > exists for 2004-04-18, before delete them): > > SELECT * FROM cdr_ama_stat > WHERE id NOT IN > (SELECT MAX(dt.id) > FROM cdr_ama_stat As dt > WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + > INTERVAL '1 day' > GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion) > AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp + > INTERVAL '1 day' > > my problem is that the query take forever, number of rows: Have you tried throwing more work_mem at the problem? The other method to do this uses no group by but a join clause. Depending on the number of dupes it can be faster or slow. delete from table x where x.id in (select a.id from table a jon table b on (a.somefield=b.somefield and a.id < b.id)) Or something like that.
I cant use a unique index because i only want to check for duplicates where processed = 2, for simplicity i did not include that condition in the example.
On Fri, Apr 24, 2009 at 5:50 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Fri, Apr 24, 2009 at 5:37 PM, Miguel MirandaThen why not have a unique index on those rows together?
<miguel.mirandag@gmail.com> wrote:
> hi , i hava a table:
> CREATE TABLE public.cdr_ama_stat (
> id int4 NOT NULL DEFAULT nextval('cdr_ama_stat_id_seq'::regclass),
> abonado_a varchar(30) NULL,
> abonado_b varchar(30) NULL,
> fecha_llamada timestamp NULL,
> duracion int4 NULL,
> puerto_a varchar(4) NULL,
> puerto_b varchar(4) NULL,
> tipo_llamada char(1) NULL,
> processed int4 NULL,
> PRIMARY KEY(id)
> )
> GO
> CREATE INDEX kpi_fecha_llamada
> ON public.cdr_ama_stat(fecha_llamada)
>
> there should be unique values for abonado_a, abonado_b, fecha_llamada,
> duracion in every row, googling around i found how to delete duplicates in
> postgresonline site ,Have you tried throwing more work_mem at the problem?
> so i run the following query (lets say i want to know how many duplicates
> exists for 2004-04-18, before delete them):
>
> SELECT * FROM cdr_ama_stat
> WHERE id NOT IN
> (SELECT MAX(dt.id)
> FROM cdr_ama_stat As dt
> WHERE dt.fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +
> INTERVAL '1 day'
> GROUP BY dt.abonado_a, dt.abonado_b,dt.fecha_llamada,dt.duracion)
> AND fecha_llamada BETWEEN '2009-04-18' AND '2009-04-18'::timestamp +
> INTERVAL '1 day'
>
> my problem is that the query take forever, number of rows:
The other method to do this uses no group by but a join clause.
Depending on the number of dupes it can be faster or slow.
delete from table x where x.id in
(select a.id from table a jon table b on (a.somefield=b.somefield
and a.id < b.id))
Or something like that.
Miguel Miranda escribió: > I cant use a unique index because i only want to check for duplicates where > processed = 2, for simplicity i did not include that condition in the > example. In that case you can create a partial unique index: create index foo on cdr_ama_stat (abonado_a, abonado_b, fecha_llamada, duracion) where processed = 2; -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, Apr 25, 2009 at 2:51 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > In that case you can create a partial unique index: > > create index foo on cdr_ama_stat (abonado_a, abonado_b, fecha_llamada, > duracion) where processed = 2; Of course, the unique is missing in this one: CREATE UNIQUE INDEX foo ON cdr_ama_stat (abonado_a, abonado_b, fecha_llamada, duracion) WHERE processed = 2;