Thread: Need efficient way to remove (almost) duplicate rows from a table

Need efficient way to remove (almost) duplicate rows from a table

From

David Gauthier

Date:

17 October 2023, 22:48:50

v11.5 on linux (I know, I know, IT is working on the upgrade to 15.2)

I have a table with millions of records and there are a lot of "almost" duplicates that I want to get rid of in an efficient way. Best to illustrate with a simple example...

We'll talk about deleting leaves on a tree where each twig can have many leaves, but never 2 or more leaves of the exact same surface area. What I have how are a lot of twigs with leaves having the same surface area (and some different one too) and I want to get rid of the duplicates for that twig.

create table twigs (limb_id int, branch_id int, twig_id int, surfarea float);

insert into twigs (linb_id,branch_id,twig_id,surfarea) values

(1,1,1,100.0),

(1,1,2,103.7),

(1,1,3,103.7),

(1,1,4,110.4),

(1,2,1,120.0),

(1,2,2,123.6),

(1,2,3,128.1),

(1,2,4,128.1),

(1,2,5,128.1),

(2,1,1,100.0),

(2,1,3,104.4),

(2,1,4,131.9);

You can see the duplicates in red. I want to get rid of all but one of the dups. Which "twig_id" that's left behind doesn't matter.

This would do it...

delete from twigs where limb_id=1 and branch_id=1 and twig_id=23;

delete from twigs where limb_id=1 and branch_id=2 and twig_id in (4,5);

But there are millions of these duplicates and it'll take forever like this.

I was going to approach this with a perl/DBI script, shoving the duplicate record identifiers (limb_id, branch_id, twig_id) into perl arrays and then submitting the delete command in a prepared statement that accepts the arrays as values ... (a-la... my $cnt = $sth->execute_array({ ArrayTupleStatus => \my @tuple_status},\@limb_id_arr,\@branch_id_arr,\@twig_id_arr) or die "-F- Failed to execute '$sqf'\n";) But I'm not sure that'll be faster. Depends on how perl/DBI handles it I suppose.

Then I was thinking it might just be better to create a parallel table and insert records in there by copying from the table with the dups, taking care to avoid inserting duplicates. Once done, delete the original table and rename the copy to the original's name. If that's a good approach, then how exactly do I articulate the insert statement to avoid the duplicates ? This approach might also be good in that I can do it outside a transaction. If the target table gets goofed up, the original data is still OK, no harm done, can try again.

Any other ideas ?

Again, millions of duplicates and I don't want to overload any PG system stuff in the process.

Thanks in Advance !

Re: Need efficient way to remove (almost) duplicate rows from a table

From

Ron

Date:

17 October 2023, 23:06:41

https://dba.stackexchange.com/a/138327/63913

On 10/17/23 17:48, David Gauthier wrote:

v11.5 on linux (I know, I know, IT is working on the upgrade to 15.2)

I have a table with millions of records and there are a lot of "almost" duplicates that I want to get rid of in an efficient way. Best to illustrate with a simple example...

We'll talk about deleting leaves on a tree where each twig can have many leaves, but never 2 or more leaves of the exact same surface area. What I have how are a lot of twigs with leaves having the same surface area (and some different one too) and I want to get rid of the duplicates for that twig.

create table twigs (limb_id int, branch_id int, twig_id int, surfarea float);
insert into twigs (linb_id,branch_id,twig_id,surfarea) values
(1,1,1,100.0),
(1,1,2,103.7),
(1,1,3,103.7),
(1,1,4,110.4),

(1,2,1,120.0),
(1,2,2,123.6),
(1,2,3,128.1),
(1,2,4,128.1),
(1,2,5,128.1),

(2,1,1,100.0),
(2,1,3,104.4),
(2,1,4,131.9);

You can see the duplicates in red. I want to get rid of all but one of the dups. Which "twig_id" that's left behind doesn't matter.

This would do it...
delete from twigs where limb_id=1 and branch_id=1 and twig_id=23;
delete from twigs where limb_id=1 and branch_id=2 and twig_id in (4,5);

But there are millions of these duplicates and it'll take forever like this.

I was going to approach this with a perl/DBI script, shoving the duplicate record identifiers (limb_id, branch_id, twig_id) into perl arrays and then submitting the delete command in a prepared statement that accepts the arrays as values ... (a-la... my $cnt = $sth->execute_array({ ArrayTupleStatus => \my @tuple_status},\@limb_id_arr,\@branch_id_arr,\@twig_id_arr) or die "-F- Failed to execute '$sqf'\n";) But I'm not sure that'll be faster. Depends on how perl/DBI handles it I suppose.

Then I was thinking it might just be better to create a parallel table and insert records in there by copying from the table with the dups, taking care to avoid inserting duplicates. Once done, delete the original table and rename the copy to the original's name. If that's a good approach, then how exactly do I articulate the insert statement to avoid the duplicates ? This approach might also be good in that I can do it outside a transaction. If the target table gets goofed up, the original data is still OK, no harm done, can try again.

Any other ideas ?

Again, millions of duplicates and I don't want to overload any PG system stuff in the process.

Thanks in Advance !

--
Born in Arizona, moved to Babylonia.

Re: Need efficient way to remove (almost) duplicate rows from a table

From

Paul Jungwirth

Date:

17 October 2023, 23:09:17

On 10/17/23 15:48, David Gauthier wrote:
> I have a table with millions of records and there are a lot of "almost" 
> duplicates that I want to get rid of in an efficient way.  Best to 
> illustrate with a simple example...

This sounds like a classic use-case for a "correlated subquery". 
Assuming you want to keep the lowest twig_id, you could do this:

```
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       3 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        1 |         2 |       4 |    128.1
        1 |         2 |       5 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(12 rows)

Time: 0.218 ms
paul=# delete from twigs as t1
where exists (select 1 from twigs as t2 where (t1.limb_id, t1.branch_id, 
t1.surfarea) = (t2.limb_id, t2.branch_id, t2.surfarea) and t1.twig_id > 
t2.twig_id);
DELETE 3
Time: 8.555 ms
paul=# select * from twigs;
  limb_id | branch_id | twig_id | surfarea
---------+-----------+---------+----------
        1 |         1 |       1 |      100
        1 |         1 |       2 |    103.7
        1 |         1 |       4 |    110.4
        1 |         2 |       1 |      120
        1 |         2 |       2 |    123.6
        1 |         2 |       3 |    128.1
        2 |         1 |       1 |      100
        2 |         1 |       3 |    104.4
        2 |         1 |       4 |    131.9
(9 rows)

Time: 0.231 ms
```

That says "delete from t1 where there is a matching t2 with a lower 
twig_id."

If you want to keep the highest-numbered twig_id, the sql is easy to adjust.

Regards,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com

Fwd: Need efficient way to remove (almost) duplicate rows from a table

From

Michael Nolan

Date:

18 October 2023, 01:52:01

I'm not clear which rows are duplicates.

It looks like limb_id, branch_id and twig_id make up a unique key.

If there was a row (2, 2, 7,103.7) would it be a duplicate of rows
(1,1,2,103.7) and (1,1,3,103.7)?
--
MIke Nolan