Re: Need efficient way to remove (almost) duplicate rows from a table - Mailing list pgsql-general

From Paul Jungwirth
Subject Re: Need efficient way to remove (almost) duplicate rows from a table
Date
Msg-id c6e4f48d-3af4-4d65-4031-b1e447e09214@illuminatedcomputing.com
Whole thread Raw
In response to Need efficient way to remove (almost) duplicate rows from a table  (David Gauthier <dfgpostgres@gmail.com>)
List pgsql-general
On 10/17/23 15:48, David Gauthier wrote:
> I have a table with millions of records and there are a lot of "almost" 
> duplicates that I want to get rid of in an efficient way.  Best to 
> illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery". 
Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       3 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        1 |         2 |       4 |    128.1
        1 |         2 |       5 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, 
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > 
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower 
twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com



pgsql-general by date:

Previous
From: Ron
Date:
Subject: Re: Need efficient way to remove (almost) duplicate rows from a table
Next
From: Merlin Moncure
Date:
Subject: Re: Inheritance in PostgreSQL