Thread: limited field duplicates
Hello, I am trying to find rows in a table that are duplicates only in the sense of certain columns; i.e, not all columns are the same, and then for each duplicate entry, I want to retrieve one of the columns that is not part of the duplication criteria. Ok, enough English description, here is the real information (simplified). Table "master" --~925000 records Column | Type | Modifiers -------------------+--------------+----------- apinum | text | not null state | character(2) | not null county | text | not null township_n | integer | township_d | character(1) | range_n | integer | range_d | character(1) | section | integer | location | text | not null (Indexes on state,county,township,range,apinum) Find duplicates based on state,county,township_n||township_d,range_n||range_d,section select state,county,township_n||township_d as township,range_n||range_d as range,section,location,count(*) from master where state='OK' and county='MAJOR' group by state,county,township,range,section,location having count(*)>1; Works great, is efficient: now, how do I get the apinum associated with each of the limited duplicate rows (i.e., those returned in the above query). I need the apinum because another system involved only knows about apinum. It must be simple, but I have failed miserably in finding the answer. The following horrible query works, but it takes forever for even the small sample above. explain select m.apinum,m.township_n||m.township_d as township,m.range_n||m.range_d as range,m.section,m.location from master m where (select count(*) from master m1 where m.township_n||m.township_d=m1.township_n||m1.township_d and m.range_n||m.range_d=m1.range_n||m1.range_d and m.section=m1.section and m.location=m1.location and m1.state='OK' and m1.county='MAJOR') > 1 and m.state='OK' and m.county='MAJOR' order by township,range,section; NOTICE: QUERY PLAN: Sort (cost=7040332.16..7040332.16 rows=450 width=55) -> Index Scan using mstcty on master m (cost=0.00..7040312.32 rows=450 width=55) SubPlan -> Aggregate (cost=5208.86..5208.86 rows=1 width=0) -> Index Scan using mstcty on master m1 (cost=0.00..5208.85 rows=1 width=0) EXPLAIN So, somebody please enlighten me. All help is appreciated. Don
Don Isgitt <djisgitt@soundenergy.com> writes: > I am trying to find rows in a table that are duplicates only in the > sense of certain columns; i.e, not all columns are the same, and then > for each duplicate entry, I want to retrieve one of the columns that is > not part of the duplication criteria. Try SELECT DISTINCT ON. There's a simple example in the SELECT reference page (something about weather reports I think). regards, tom lane
Yes! Thank you very much!! Tom Lane wrote: >Don Isgitt <djisgitt@soundenergy.com> writes: > >>I am trying to find rows in a table that are duplicates only in the >>sense of certain columns; i.e, not all columns are the same, and then >>for each duplicate entry, I want to retrieve one of the columns that is >>not part of the duplication criteria. >> > >Try SELECT DISTINCT ON. There's a simple example in the SELECT >reference page (something about weather reports I think). > > regards, tom lane > >
Here something I tried: create index master_all on master( country, state, township_n, range_n, section); select distinct m.* from master m, master m1 where m.country = m1.country and m.state = m1.state and m.township_n = m1.township_n and m.range_n = m1.range_n and m.section = m1.section and m.apinum != m1.apinum; Distinct is needed if there are more than 2 duplicates (ie triplicates etc...) to reduce de number of combinations. Make sure you vacuum analyze "master" after creating the index. It is not clear if there is a one to one relation between the _n and the _d fields. JLL Don Isgitt wrote: > > Hello, > > I am trying to find rows in a table that are duplicates only in the > sense of certain columns; i.e, not all columns are the same, and then > for each duplicate entry, I want to retrieve one of the columns that is > not part of the duplication criteria. Ok, enough English description, > here is the real information (simplified). > Table "master" --~925000 records > Column | Type | Modifiers > -------------------+--------------+----------- > apinum | text | not null > state | character(2) | not null > county | text | not null > township_n | integer | > township_d | character(1) | > range_n | integer | > range_d | character(1) | > section | integer | > location | text | not null > > (Indexes on state,county,township,range,apinum) > > Find duplicates based on > state,county,township_n||township_d,range_n||range_d,section > > select state,county,township_n||township_d as township,range_n||range_d > as range,section,location,count(*) from master where state='OK' and > county='MAJOR' group by state,county,township,range,section,location > having count(*)>1; > > Works great, is efficient: now, how do I get the apinum associated with > each of the limited duplicate rows (i.e., those returned in the above > query). I need the apinum because another system involved only knows > about apinum. > > It must be simple, but I have failed miserably in finding the answer. > The following horrible query works, but it takes forever for even the > small sample above. > > explain select m.apinum,m.township_n||m.township_d as > township,m.range_n||m.range_d as range,m.section,m.location from master > m where (select count(*) from master m1 where > m.township_n||m.township_d=m1.township_n||m1.township_d and > m.range_n||m.range_d=m1.range_n||m1.range_d and m.section=m1.section and > m.location=m1.location and m1.state='OK' and m1.county='MAJOR') > 1 and > m.state='OK' and m.county='MAJOR' order by township,range,section; > NOTICE: QUERY PLAN: > > Sort (cost=7040332.16..7040332.16 rows=450 width=55) > -> Index Scan using mstcty on master m (cost=0.00..7040312.32 > rows=450 width=55) > SubPlan > -> Aggregate (cost=5208.86..5208.86 rows=1 width=0) > -> Index Scan using mstcty on master m1 > (cost=0.00..5208.85 rows=1 width=0) > > EXPLAIN > > So, somebody please enlighten me. All help is appreciated. > > Don > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)