Thread: [GENERAL] Delete Duplicates with Using
Hello,
I run the SQL query below to delete duplicates from a table. The subquery is used to identify the duplicated rows (row_num is a BIGSERIAL column).
/** delete older copies of duplicates */
DELETE FROM table_with_duplicates AS T
WHERE row_num IN (
SELECT T1.row_num
FROM table_with_duplicates AS T1
JOIN table_with_duplicates AS T2
ON T1.column_1 = T2.column_1
AND T1.column_2 = T2.column_2
AND T1.column_3 = T2.column_3
AND T1.row_num < T2.row_num
);
Can anyone tell me how to rewrite that query to use the USING clause and hopefully remove the subquery?
The documentation mentions USING but there is no example and the only examples I found online are very trivial.
Thanks,
Igal Sapir
Lucee Core Developer
Lucee.org
DELETE FROM table_with_duplicates AS T1 USING table_with_duplicates AS T2 WHERE T1.column_1 = T2.column_1 AND T1.column_2 = T2.column_2 AND T1.column_3 = T2.column_3 ANDT1.row_num < T2.row_num -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 10/14/2017 12:32 AM, legrand legrand wrote: > DELETE FROM table_with_duplicates AS T1 USING table_with_duplicates AS T2 > WHERE > T1.column_1 = T2.column_1 > AND T1.column_2 = T2.column_2 > AND T1.column_3 = T2.column_3 > AND T1.row_num < T2.row_num Thank you, I actually thought about that at first but it seemed "too easy" and I was looking for some solution with JOIN. Anyway, this seems to work great. Thanks, Igal -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Am 14.10.2017 um 08:20 schrieb Igal @ Lucee.org: > > Hello, > > I run the SQL query below to delete duplicates from a table. The > subquery is used to identify the duplicated rows (row_num is a > BIGSERIAL column). > other solution, using the CTID-column: (rows with (1,1,1) and (5,5,5) are identical) test=*# select * from dubletten ; c1 | c2 | c3 ----+----+---- 1 | 1 | 1 1 | 1 | 1 1 | 2 | 3 2 | 3 | 4 3 | 4 | 5 4 | 5 | 5 5 | 5 | 5 5 | 5 | 5 (8 Zeilen) test=*# with keep as (select max(ctid) as ctid from dubletten group by c1,c2,c3) delete from dubletten where ctid not in (select ctid from keep);; DELETE 2 test=*# select * from dubletten ; c1 | c2 | c3 ----+----+---- 1 | 1 | 1 1 | 2 | 3 2 | 3 | 4 3 | 4 | 5 4 | 5 | 5 5 | 5 | 5 (6 Zeilen) test=*# Regards, Andreas -- 2ndQuadrant - The PostgreSQL Support Company. www.2ndQuadrant.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
other solution, using the CTID-column: (rows with (1,1,1) and (5,5,5) are identical)
test=*# select * from dubletten ;
c1 | c2 | c3
----+----+----
1 | 1 | 1
1 | 1 | 1
1 | 2 | 3
2 | 3 | 4
3 | 4 | 5
4 | 5 | 5
5 | 5 | 5
5 | 5 | 5
(8 Zeilen)
test=*# with keep as (select max(ctid) as ctid from dubletten group by c1,c2,c3) delete from dubletten where ctid not in (select ctid from keep);;
DELETE 2
test=*# select * from dubletten ;
c1 | c2 | c3
----+----+----
1 | 1 | 1
1 | 2 | 3
2 | 3 | 4
3 | 4 | 5
4 | 5 | 5
5 | 5 | 5
(6 Zeilen)
test=*#
Regards, Andreas
I like this solution, but would using a subquery be much slower than the implicit join of `using`? My tables are not big in Postgres-standards, so it's probably not an issue, but I'm trying to learn as much as I can about Postgres now that I'm getting ready to move it to production.
Thanks,
Igal Sapir
Lucee Core Developer
Lucee.org
On 10/16/2017 8:58 AM, Igal @ Lucee.org wrote:
test=*# with keep as (select max(ctid) as ctid from dubletten group by c1,c2,c3) delete from dubletten where ctid not in (select ctid from keep);
I like this solution, but would using a subquery be much slower than the implicit join of `using`? My tables are not big in Postgres-standards, so it's probably not an issue, but I'm trying to learn as much as I can about Postgres now that I'm getting ready to move it to production.
I was able to compare the performance on a table with about 350k rows, with duplicates over 7 columns and no indices.
The GROUP BY solution with the subquery (though I was using a simpler version of it without a CTE), was taking over 30 seconds so I killed the request.
I then ran the USING version which completed in 16 seconds and deleted 39 rows.
Best,
Igal Sapir
Lucee Core Developer
Lucee.org