Re: Finding uniques across a big join - Mailing list pgsql-general

From John D. Burger
Subject Re: Finding uniques across a big join
Date
Msg-id fa649e52f7c458b6eca83789d7648ad2@mitre.org
Whole thread Raw
In response to Re: Finding uniques across a big join  (Martijn van Oosterhout <kleptog@svana.org>)
Responses Re: Finding uniques across a big join
List pgsql-general
On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote:

> On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:
>> I could use some help with the following:
>>
>> I have a database of geographic entities with attributes spread across
>> several tables.  I need to determine which entities are unique with
>> respect to some of those attributes.  I'm using the following query:
>
> <snip>
>
> If you put the gazPlaceID as a result of the uniqs subquery, that would
> avoid the second lookup, right? Whether it's much faster is the
> question. So something like:
>
> select p1.gazPlaceID
>      from gazPlaces as p1
>        join gazNamings as n1 using (gazPlaceID)
>        join gazContainers as c1 using (gazPlaceID)
>      group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
> c1.containerID
>      having count(*) = 1

The problem is that then every row is unique, because gazPlaceID is a
primary key.  As far as I can see, I need to group on just the other
three columns - they're the ones for which I'm interested in
uniqueness.

> Secondly, what does the plan look like? Is it materialising or sorting
> at any stage?
> Finally, what version of postgres?

Version is 7.4.8 for Solaris.  Below is (a version of) the query again,
as well as the plan.  No materialization, I think, but it appears to be
sorting the first three-way join to do the counts, then sorting the
second one to merge.  Cost estimates are way off, as the final result
has almost 10M rows, but everything is analyzed, with indexes on every
column, although none of them get used.

Again, any suggestions on tweaking this query, or a completely
different approach to finding the entities with unique combinations,
would be much appreciated.

- John Burger
   MITRE

select p2.gazPlaceID, u.*
   into table tempCanonical_nameMatchEquiver_3435_1
   from
   (select p1.featureType, n1.placeNameID, c1.containerID
     from gazPlaces as p1
       join gazNamings as n1 using (gazPlaceID)
         join gazContainers as c1 using (gazPlaceID)
     group by p1.featureType, n1.placeNameID, c1.containerID
     having count(*) = 1) as u,
   gazPlaces as p2
     join gazNamings as n2 using (gazPlaceID)
       join gazContainers as c2 using (gazPlaceID)
   where u.featureType = p2.featureType
   and u.placeNameID = n2.placeNameID
   and u.containerID = c2.containerID;

  Hash Join  (cost=4009535.96..4316595.92 rows=306 width=16)
    Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND
("outer".containerid = "inner".containerid))
    ->  Seq Scan on gazcontainers c2  (cost=0.00..141636.45 rows=9193945
width=8)
    ->  Hash  (cost=4006472.81..4006472.81 rows=282029 width=20)
          ->  Merge Join  (cost=3777226.54..4006472.81 rows=282029
width=20)
                Merge Cond: (("outer".featuretype = "inner".featuretype)
AND ("outer".placenameid = "inner".placenameid))
                ->  Subquery Scan u  (cost=2107001.20..2259698.67
rows=5552635 width=12)
                      ->  GroupAggregate  (cost=2107001.20..2204172.32
rows=5552635 width=12)
                            Filter: (count(*) = 1)
                            ->  Sort  (cost=2107001.20..2120882.79
rows=5552635 width=12)
                                  Sort Key: p1.featuretype,
n1.placenameid, c1.containerid
                                  ->  Hash Join
(cost=688064.17..1217844.46 rows=5552635 width=12)
                                        Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
                                        ->  Seq Scan on gaznamings n1
(cost=0.00..156331.05 rows=10147805 width=8)
                                        ->  Hash
(cost=642816.39..642816.39 rows=6128713 width=16)
                                              ->  Hash Join
(cost=160244.91..642816.39 rows=6128713 width=16)
                                                    Hash Cond:
("outer".gazplaceid = "inner".gazplaceid)
                                                    ->  Seq Scan on
gazcontainers c1  (cost=0.00..141636.45 rows=9193945 width=8)
                                                    ->  Hash
(cost=120982.13..120982.13 rows=6128713 width=8)
                                                          ->  Seq Scan
on gazplaces p1  (cost=0.00..120982.13 rows=6128713 width=8)
                ->  Sort  (cost=1670225.33..1685547.11 rows=6128713
width=16)
                      Sort Key: p2.featuretype, n2.placenameid
                      ->  Hash Join  (cost=160244.91..684040.19
rows=6128713 width=16)
                            Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
                            ->  Seq Scan on gaznamings n2
(cost=0.00..156331.05 rows=10147805 width=8)
                            ->  Hash  (cost=120982.13..120982.13
rows=6128713 width=8)
                                  ->  Seq Scan on gazplaces p2
(cost=0.00..120982.13 rows=6128713 width=8)


pgsql-general by date:

Previous
From: frank church
Date:
Subject: How to check options PostgreSQL was started with
Next
From: "Jim C. Nasby"
Date:
Subject: Re: What is the deal with mailing lists?