Re: Finding uniques across a big join - Mailing list pgsql-general
From | John D. Burger |
---|---|
Subject | Re: Finding uniques across a big join |
Date | |
Msg-id | fa649e52f7c458b6eca83789d7648ad2@mitre.org Whole thread Raw |
In response to | Re: Finding uniques across a big join (Martijn van Oosterhout <kleptog@svana.org>) |
Responses |
Re: Finding uniques across a big join
|
List | pgsql-general |
On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote: > On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote: >> I could use some help with the following: >> >> I have a database of geographic entities with attributes spread across >> several tables. I need to determine which entities are unique with >> respect to some of those attributes. I'm using the following query: > > <snip> > > If you put the gazPlaceID as a result of the uniqs subquery, that would > avoid the second lookup, right? Whether it's much faster is the > question. So something like: > > select p1.gazPlaceID > from gazPlaces as p1 > join gazNamings as n1 using (gazPlaceID) > join gazContainers as c1 using (gazPlaceID) > group by p1.gazPlaceID, p1.featureType, n1.placeNameID, > c1.containerID > having count(*) = 1 The problem is that then every row is unique, because gazPlaceID is a primary key. As far as I can see, I need to group on just the other three columns - they're the ones for which I'm interested in uniqueness. > Secondly, what does the plan look like? Is it materialising or sorting > at any stage? > Finally, what version of postgres? Version is 7.4.8 for Solaris. Below is (a version of) the query again, as well as the plan. No materialization, I think, but it appears to be sorting the first three-way join to do the counts, then sorting the second one to merge. Cost estimates are way off, as the final result has almost 10M rows, but everything is analyzed, with indexes on every column, although none of them get used. Again, any suggestions on tweaking this query, or a completely different approach to finding the entities with unique combinations, would be much appreciated. - John Burger MITRE select p2.gazPlaceID, u.* into table tempCanonical_nameMatchEquiver_3435_1 from (select p1.featureType, n1.placeNameID, c1.containerID from gazPlaces as p1 join gazNamings as n1 using (gazPlaceID) join gazContainers as c1 using (gazPlaceID) group by p1.featureType, n1.placeNameID, c1.containerID having count(*) = 1) as u, gazPlaces as p2 join gazNamings as n2 using (gazPlaceID) join gazContainers as c2 using (gazPlaceID) where u.featureType = p2.featureType and u.placeNameID = n2.placeNameID and u.containerID = c2.containerID; Hash Join (cost=4009535.96..4316595.92 rows=306 width=16) Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND ("outer".containerid = "inner".containerid)) -> Seq Scan on gazcontainers c2 (cost=0.00..141636.45 rows=9193945 width=8) -> Hash (cost=4006472.81..4006472.81 rows=282029 width=20) -> Merge Join (cost=3777226.54..4006472.81 rows=282029 width=20) Merge Cond: (("outer".featuretype = "inner".featuretype) AND ("outer".placenameid = "inner".placenameid)) -> Subquery Scan u (cost=2107001.20..2259698.67 rows=5552635 width=12) -> GroupAggregate (cost=2107001.20..2204172.32 rows=5552635 width=12) Filter: (count(*) = 1) -> Sort (cost=2107001.20..2120882.79 rows=5552635 width=12) Sort Key: p1.featuretype, n1.placenameid, c1.containerid -> Hash Join (cost=688064.17..1217844.46 rows=5552635 width=12) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gaznamings n1 (cost=0.00..156331.05 rows=10147805 width=8) -> Hash (cost=642816.39..642816.39 rows=6128713 width=16) -> Hash Join (cost=160244.91..642816.39 rows=6128713 width=16) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gazcontainers c1 (cost=0.00..141636.45 rows=9193945 width=8) -> Hash (cost=120982.13..120982.13 rows=6128713 width=8) -> Seq Scan on gazplaces p1 (cost=0.00..120982.13 rows=6128713 width=8) -> Sort (cost=1670225.33..1685547.11 rows=6128713 width=16) Sort Key: p2.featuretype, n2.placenameid -> Hash Join (cost=160244.91..684040.19 rows=6128713 width=16) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gaznamings n2 (cost=0.00..156331.05 rows=10147805 width=8) -> Hash (cost=120982.13..120982.13 rows=6128713 width=8) -> Seq Scan on gazplaces p2 (cost=0.00..120982.13 rows=6128713 width=8)
pgsql-general by date: