Thread: limited field duplicates

limited field duplicates

From
Don Isgitt
Date:
Hello,

I am trying to find rows in a table that are duplicates only in the
sense of certain columns; i.e, not all columns are the same, and then
for each duplicate entry, I want to retrieve one of the columns that is
not part of the duplication criteria. Ok, enough English description,
here is the real information (simplified).
                Table "master" --~925000 records
      Column       |     Type     | Modifiers
-------------------+--------------+-----------
 apinum            | text         | not null
 state             | character(2) | not null
 county            | text         | not null
 township_n        | integer      |
 township_d        | character(1) |
 range_n           | integer      |
 range_d           | character(1) |
 section           | integer      |
 location          | text         | not null

(Indexes on state,county,township,range,apinum)

Find duplicates based on
state,county,township_n||township_d,range_n||range_d,section

select state,county,township_n||township_d as township,range_n||range_d
as range,section,location,count(*) from master where state='OK' and
county='MAJOR' group by state,county,township,range,section,location
having count(*)>1;

Works great, is efficient: now, how do I get the apinum associated with
each of the limited duplicate rows (i.e., those returned in the above
query). I need the apinum because another system involved only knows
about apinum.

It must be simple, but I have failed miserably in finding the answer.
The following horrible query works, but it takes forever for even the
small sample above.

 explain select m.apinum,m.township_n||m.township_d as
township,m.range_n||m.range_d as range,m.section,m.location from master
m where (select count(*) from master m1 where
m.township_n||m.township_d=m1.township_n||m1.township_d and
m.range_n||m.range_d=m1.range_n||m1.range_d and m.section=m1.section and
m.location=m1.location and m1.state='OK' and m1.county='MAJOR') > 1 and
m.state='OK' and m.county='MAJOR' order by township,range,section;
NOTICE:  QUERY PLAN:

Sort  (cost=7040332.16..7040332.16 rows=450 width=55)
  ->  Index Scan using mstcty on master m  (cost=0.00..7040312.32
rows=450 width=55)
        SubPlan
          ->  Aggregate  (cost=5208.86..5208.86 rows=1 width=0)
                ->  Index Scan using mstcty on master m1
(cost=0.00..5208.85 rows=1 width=0)

EXPLAIN


So, somebody please enlighten me. All help is appreciated.

Don



Re: limited field duplicates

From
Tom Lane
Date:
Don Isgitt <djisgitt@soundenergy.com> writes:
> I am trying to find rows in a table that are duplicates only in the
> sense of certain columns; i.e, not all columns are the same, and then
> for each duplicate entry, I want to retrieve one of the columns that is
> not part of the duplication criteria.

Try SELECT DISTINCT ON.  There's a simple example in the SELECT
reference page (something about weather reports I think).

            regards, tom lane

Re: limited field duplicates

From
Don Isgitt
Date:
Yes! Thank you very much!!

Tom Lane wrote:

>Don Isgitt <djisgitt@soundenergy.com> writes:
>
>>I am trying to find rows in a table that are duplicates only in the
>>sense of certain columns; i.e, not all columns are the same, and then
>>for each duplicate entry, I want to retrieve one of the columns that is
>>not part of the duplication criteria.
>>
>
>Try SELECT DISTINCT ON.  There's a simple example in the SELECT
>reference page (something about weather reports I think).
>
>            regards, tom lane
>
>



Re: limited field duplicates

From
Jean-Luc Lachance
Date:
Here something I tried:

create index master_all on master( country, state, township_n, range_n,
section);

select distinct m.*
from master m, master m1
where m.country = m1.country and
  m.state = m1.state and
  m.township_n = m1.township_n and
  m.range_n = m1.range_n and
  m.section = m1.section and
  m.apinum != m1.apinum;

Distinct is needed if there are more than 2 duplicates (ie triplicates
etc...) to reduce de number of combinations.

Make sure you vacuum analyze "master" after creating the index.

It is not clear if there is a one to one relation between the _n and the
_d fields.

JLL


Don Isgitt wrote:
>
> Hello,
>
> I am trying to find rows in a table that are duplicates only in the
> sense of certain columns; i.e, not all columns are the same, and then
> for each duplicate entry, I want to retrieve one of the columns that is
> not part of the duplication criteria. Ok, enough English description,
> here is the real information (simplified).
>                 Table "master" --~925000 records
>       Column       |     Type     | Modifiers
> -------------------+--------------+-----------
>  apinum            | text         | not null
>  state             | character(2) | not null
>  county            | text         | not null
>  township_n        | integer      |
>  township_d        | character(1) |
>  range_n           | integer      |
>  range_d           | character(1) |
>  section           | integer      |
>  location          | text         | not null
>
> (Indexes on state,county,township,range,apinum)
>
> Find duplicates based on
> state,county,township_n||township_d,range_n||range_d,section
>
> select state,county,township_n||township_d as township,range_n||range_d
> as range,section,location,count(*) from master where state='OK' and
> county='MAJOR' group by state,county,township,range,section,location
> having count(*)>1;
>
> Works great, is efficient: now, how do I get the apinum associated with
> each of the limited duplicate rows (i.e., those returned in the above
> query). I need the apinum because another system involved only knows
> about apinum.
>
> It must be simple, but I have failed miserably in finding the answer.
> The following horrible query works, but it takes forever for even the
> small sample above.
>
>  explain select m.apinum,m.township_n||m.township_d as
> township,m.range_n||m.range_d as range,m.section,m.location from master
> m where (select count(*) from master m1 where
> m.township_n||m.township_d=m1.township_n||m1.township_d and
> m.range_n||m.range_d=m1.range_n||m1.range_d and m.section=m1.section and
> m.location=m1.location and m1.state='OK' and m1.county='MAJOR') > 1 and
> m.state='OK' and m.county='MAJOR' order by township,range,section;
> NOTICE:  QUERY PLAN:
>
> Sort  (cost=7040332.16..7040332.16 rows=450 width=55)
>   ->  Index Scan using mstcty on master m  (cost=0.00..7040312.32
> rows=450 width=55)
>         SubPlan
>           ->  Aggregate  (cost=5208.86..5208.86 rows=1 width=0)
>                 ->  Index Scan using mstcty on master m1
> (cost=0.00..5208.85 rows=1 width=0)
>
> EXPLAIN
>
> So, somebody please enlighten me. All help is appreciated.
>
> Don
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)