Thread: Finding uniques across a big join

Finding uniques across a big join

From
"John D. Burger"
Date:
I could use some help with the following:

I have a database of geographic entities with attributes spread across
several tables.  I need to determine which entities are unique with
respect to some of those attributes.  I'm using the following query:

select p2.gazPlaceID from
   (select p1.featureType, n1.placeNameID, c1.containerID
     from gazPlaces as p1
       join gazNamings as n1 using (gazPlaceID)
       join gazContainers as c1 using (gazPlaceID)
     group by p1.featureType, n1.placeNameID, c1.containerID
     having count(*) = 1) as uniqs,
   gazPlaces as p2
     join gazNamings as n2 using (gazPlaceID)
     join gazContainers as c2 using (gazPlaceID)
   where uniqs.featureType = p2.featureType
   and uniqs.placeNameID = n2.placeNameID
   and uniqs.containerID = c2.containerID;

The basic idea is to compute featureType-placeNameID-containerID
combinations with a three-way join, determine which of those have a
count of 1, and then join that back to the same three-way join to get
the gazPlaceIDs corresponding to the unique combos (whew!).

gazPlaces has about 6M entries, gazNamings and gazContainers each about
10M.  All of the fields above are integers, and I have indexes on
everything relevant, but the query still takes about eight hours.  My
question is not (necessarily) how to improve the efficiency of this
query, but whether anyone can think of a faster way to compute the
uniques.  Again, the goal is to find entries in gazPlaces that are the
only ones with their particular combination of feature type, name and
container.

Any help is appreciated!

- John Burger
   MITRE


Re: Finding uniques across a big join

From
Martijn van Oosterhout
Date:
On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:
> I could use some help with the following:
>
> I have a database of geographic entities with attributes spread across
> several tables.  I need to determine which entities are unique with
> respect to some of those attributes.  I'm using the following query:

<snip>

If you put the gazPlaceID as a result of the uniqs subquery, that would
avoid the second lookup, right? Whether it's much faster is the
question. So something like:

select p1.gazPlaceID
     from gazPlaces as p1
       join gazNamings as n1 using (gazPlaceID)
       join gazContainers as c1 using (gazPlaceID)
     group by p1.gazPlaceID, p1.featureType, n1.placeNameID, c1.containerID
     having count(*) = 1

Secondly, what does the plan look like? Is it materialising or sorting
at any stage?

Finally, what version of postgres?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Finding uniques across a big join

From
"Jim C. Nasby"
Date:
It will probably be a win to come up with a list of potential records
from each table, instead of after doing the 3-way join. so something
like:

(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING count(*)=1)
JOIN
(SELECT ...)

If you post the output of explain (or explain analyze is even better)
then people could probably make better suggestions.

On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:
> I could use some help with the following:
>
> I have a database of geographic entities with attributes spread across
> several tables.  I need to determine which entities are unique with
> respect to some of those attributes.  I'm using the following query:
>
> select p2.gazPlaceID from
>   (select p1.featureType, n1.placeNameID, c1.containerID
>     from gazPlaces as p1
>       join gazNamings as n1 using (gazPlaceID)
>       join gazContainers as c1 using (gazPlaceID)
>     group by p1.featureType, n1.placeNameID, c1.containerID
>     having count(*) = 1) as uniqs,
>   gazPlaces as p2
>     join gazNamings as n2 using (gazPlaceID)
>     join gazContainers as c2 using (gazPlaceID)
>   where uniqs.featureType = p2.featureType
>   and uniqs.placeNameID = n2.placeNameID
>   and uniqs.containerID = c2.containerID;
>
> The basic idea is to compute featureType-placeNameID-containerID
> combinations with a three-way join, determine which of those have a
> count of 1, and then join that back to the same three-way join to get
> the gazPlaceIDs corresponding to the unique combos (whew!).
>
> gazPlaces has about 6M entries, gazNamings and gazContainers each about
> 10M.  All of the fields above are integers, and I have indexes on
> everything relevant, but the query still takes about eight hours.  My
> question is not (necessarily) how to improve the efficiency of this
> query, but whether anyone can think of a faster way to compute the
> uniques.  Again, the goal is to find entries in gazPlaces that are the
> only ones with their particular combination of feature type, name and
> container.
>
> Any help is appreciated!
>
> - John Burger
>   MITRE
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Finding uniques across a big join

From
"John D. Burger"
Date:
On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote:

> On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote:
>> I could use some help with the following:
>>
>> I have a database of geographic entities with attributes spread across
>> several tables.  I need to determine which entities are unique with
>> respect to some of those attributes.  I'm using the following query:
>
> <snip>
>
> If you put the gazPlaceID as a result of the uniqs subquery, that would
> avoid the second lookup, right? Whether it's much faster is the
> question. So something like:
>
> select p1.gazPlaceID
>      from gazPlaces as p1
>        join gazNamings as n1 using (gazPlaceID)
>        join gazContainers as c1 using (gazPlaceID)
>      group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
> c1.containerID
>      having count(*) = 1

The problem is that then every row is unique, because gazPlaceID is a
primary key.  As far as I can see, I need to group on just the other
three columns - they're the ones for which I'm interested in
uniqueness.

> Secondly, what does the plan look like? Is it materialising or sorting
> at any stage?
> Finally, what version of postgres?

Version is 7.4.8 for Solaris.  Below is (a version of) the query again,
as well as the plan.  No materialization, I think, but it appears to be
sorting the first three-way join to do the counts, then sorting the
second one to merge.  Cost estimates are way off, as the final result
has almost 10M rows, but everything is analyzed, with indexes on every
column, although none of them get used.

Again, any suggestions on tweaking this query, or a completely
different approach to finding the entities with unique combinations,
would be much appreciated.

- John Burger
   MITRE

select p2.gazPlaceID, u.*
   into table tempCanonical_nameMatchEquiver_3435_1
   from
   (select p1.featureType, n1.placeNameID, c1.containerID
     from gazPlaces as p1
       join gazNamings as n1 using (gazPlaceID)
         join gazContainers as c1 using (gazPlaceID)
     group by p1.featureType, n1.placeNameID, c1.containerID
     having count(*) = 1) as u,
   gazPlaces as p2
     join gazNamings as n2 using (gazPlaceID)
       join gazContainers as c2 using (gazPlaceID)
   where u.featureType = p2.featureType
   and u.placeNameID = n2.placeNameID
   and u.containerID = c2.containerID;

  Hash Join  (cost=4009535.96..4316595.92 rows=306 width=16)
    Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND
("outer".containerid = "inner".containerid))
    ->  Seq Scan on gazcontainers c2  (cost=0.00..141636.45 rows=9193945
width=8)
    ->  Hash  (cost=4006472.81..4006472.81 rows=282029 width=20)
          ->  Merge Join  (cost=3777226.54..4006472.81 rows=282029
width=20)
                Merge Cond: (("outer".featuretype = "inner".featuretype)
AND ("outer".placenameid = "inner".placenameid))
                ->  Subquery Scan u  (cost=2107001.20..2259698.67
rows=5552635 width=12)
                      ->  GroupAggregate  (cost=2107001.20..2204172.32
rows=5552635 width=12)
                            Filter: (count(*) = 1)
                            ->  Sort  (cost=2107001.20..2120882.79
rows=5552635 width=12)
                                  Sort Key: p1.featuretype,
n1.placenameid, c1.containerid
                                  ->  Hash Join
(cost=688064.17..1217844.46 rows=5552635 width=12)
                                        Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
                                        ->  Seq Scan on gaznamings n1
(cost=0.00..156331.05 rows=10147805 width=8)
                                        ->  Hash
(cost=642816.39..642816.39 rows=6128713 width=16)
                                              ->  Hash Join
(cost=160244.91..642816.39 rows=6128713 width=16)
                                                    Hash Cond:
("outer".gazplaceid = "inner".gazplaceid)
                                                    ->  Seq Scan on
gazcontainers c1  (cost=0.00..141636.45 rows=9193945 width=8)
                                                    ->  Hash
(cost=120982.13..120982.13 rows=6128713 width=8)
                                                          ->  Seq Scan
on gazplaces p1  (cost=0.00..120982.13 rows=6128713 width=8)
                ->  Sort  (cost=1670225.33..1685547.11 rows=6128713
width=16)
                      Sort Key: p2.featuretype, n2.placenameid
                      ->  Hash Join  (cost=160244.91..684040.19
rows=6128713 width=16)
                            Hash Cond: ("outer".gazplaceid =
"inner".gazplaceid)
                            ->  Seq Scan on gaznamings n2
(cost=0.00..156331.05 rows=10147805 width=8)
                            ->  Hash  (cost=120982.13..120982.13
rows=6128713 width=8)
                                  ->  Seq Scan on gazplaces p2
(cost=0.00..120982.13 rows=6128713 width=8)


Re: Finding uniques across a big join

From
"John D. Burger"
Date:
Jim C. Nasby wrote:

> It will probably be a win to come up with a list of potential records
> from each table, instead of after doing the 3-way join. so something
> like:
>
> (SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING
> count(*)=1)
> JOIN
> (SELECT ...)

Hmm, not sure I understand.  Joining the uniques wrt each of the three
columns does not result in the unique triples, or even a superset of
them, at least in this case.

> If you post the output of explain (or explain analyze is even better)
> then people could probably make better suggestions.

Okay, I just posted the query plan. I will try to run an EXPLAIN
ANALYZE tonight.

Again, I'm also interested in completely different approaches to
discovering the entities with unique attribute combinations.
Intuitively, the query I posted is doing "too much work", because it's
computing the total count for each combo, when all I really need is to
know if the count is 1 or greater than 1.

Thanks!

- John Burger
   MITRE


Re: Finding uniques across a big join

From
Martijn van Oosterhout
Date:
On Wed, Nov 30, 2005 at 01:20:19PM -0500, John D. Burger wrote:
> >select p1.gazPlaceID
> >     from gazPlaces as p1
> >       join gazNamings as n1 using (gazPlaceID)
> >       join gazContainers as c1 using (gazPlaceID)
> >     group by p1.gazPlaceID, p1.featureType, n1.placeNameID,
> >c1.containerID
> >     having count(*) = 1
>
> The problem is that then every row is unique, because gazPlaceID is a
> primary key.  As far as I can see, I need to group on just the other
> three columns - they're the ones for which I'm interested in
> uniqueness.

AIUI, according to the JOIN conditions in the query you have
n1.gazPlaceID = c1.gazPlaceID = p1.gazPlaceID so grouping by one of
those shouldn't affect the query result.

Are the tables wide? Maybe you're losing a lot of time transferring
data you don't need. Other than that I can't think of any neat
tricks...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Finding uniques across a big join

From
Scott Marlowe
Date:
On Tue, 2005-11-29 at 20:58, John D. Burger wrote:
> I could use some help with the following:
>
> I have a database of geographic entities with attributes spread across
> several tables.  I need to determine which entities are unique with
> respect to some of those attributes.  I'm using the following query:
>
> select p2.gazPlaceID from
>    (select p1.featureType, n1.placeNameID, c1.containerID
>      from gazPlaces as p1
>        join gazNamings as n1 using (gazPlaceID)
>        join gazContainers as c1 using (gazPlaceID)
>      group by p1.featureType, n1.placeNameID, c1.containerID
>      having count(*) = 1) as uniqs,
>    gazPlaces as p2
>      join gazNamings as n2 using (gazPlaceID)
>      join gazContainers as c2 using (gazPlaceID)
>    where uniqs.featureType = p2.featureType
>    and uniqs.placeNameID = n2.placeNameID
>    and uniqs.containerID = c2.containerID;
>
> The basic idea is to compute featureType-placeNameID-containerID
> combinations with a three-way join, determine which of those have a
> count of 1, and then join that back to the same three-way join to get
> the gazPlaceIDs corresponding to the unique combos (whew!).
>
> gazPlaces has about 6M entries, gazNamings and gazContainers each about
> 10M.  All of the fields above are integers, and I have indexes on
> everything relevant, but the query still takes about eight hours.  My
> question is not (necessarily) how to improve the efficiency of this
> query, but whether anyone can think of a faster way to compute the
> uniques.  Again, the goal is to find entries in gazPlaces that are the
> only ones with their particular combination of feature type, name and
> container.

OK, let's assume that the basic part of it, before the group by, has
been put into a view, so we can then do:

select pkey1, field2, field3, field4 from view;

And we know that pkey1 is unique, but we want the records where pkey1 is
the only thing different between them, right?

select
    v1.pkey1,
    v1.field2,
    v1.field3,
    v1.field4,
    v2.pkey1,
    v2.field2,
    v2.field3,
    v2.field4,
from
    view v1
join
    view v2
on (
    v1.field2=v2.field2 and
    v1.field3=v2.field3 and
    v1.field3=v2.field3 and
    v1.pkey1<>v2.pkey
)

How does that work?


Re: Finding uniques across a big join

From
"John D. Burger"
Date:
Scott Marlowe wrote:

> OK, let's assume that the basic part of it, before the group by, has
> been put into a view, so we can then do:
>
> select pkey1, field2, field3, field4 from view;
>
> And we know that pkey1 is unique, but we want the records where pkey1
> is
> the only thing different between them, right?

Hmm, I'm explaining this really badly :).  I should have defined a view
like you suggest to help simplify it.  What I want is the pkeys (and
the field values) where no other pkey has that triple of field values.
That's why my earlier query does a group by the fields and then having
count(*) = 1.  Also, FWIW, pkey1 is unique in its original table, but
not in the view, since some of the other tables are one-to-many.

> select
>     v1.pkey1,
>     v1.field2,
>     v1.field3,
>     v1.field4,
>     v2.pkey1,
>     v2.field2,
>     v2.field3,
>     v2.field4,
> from
>     view v1
> join
>     view v2
> on (
>     v1.field2=v2.field2 and
>     v1.field3=v2.field3 and
>     v1.field3=v2.field3 and
>     v1.pkey1<>v2.pkey
> )
>
> How does that work?

Won't this be a massive cross product of all pkey pairs that have the
same field values?

Here's what I'm currently using, in terms of your very helpful view:

select v1.pkey1, v1.field2, v1.field3, v1.field4
   from view as v1
   join
   (select v2.field1, v2.field2, v2.field3
    from view as v2
    group by v2.field2,  v2.field3, v2.field4
    having count(*) = 1)
   using (field2, field3, field4);

This is the one that takes eight hours. :(  Another way to express what
I want is this:

select v1.pkey1, v1.field2, v1.field3, v1.field4
   from view as v1
   where not exists
     (select true from view as v2
      where v1.field2 = v2.field2
        and v1.field3 = v2.field3
        and v1.field4 = v2.field4
        and v1.pkey1 <> v2.pkey1);

That looks like a horrible nested loop, but I suppose I should try it
to make sure it is indeed slower then the previous query.

- John Burger
   MITRE


Re: Finding uniques across a big join

From
Scott Marlowe
Date:
On Wed, 2005-11-30 at 16:27, John D. Burger wrote:
> Scott Marlowe wrote:
>
> > select
> >     v1.pkey1,
> >     v1.field2,
> >     v1.field3,
> >     v1.field4,
> >     v2.pkey1,
> >     v2.field2,
> >     v2.field3,
> >     v2.field4,
> > from
> >     view v1
> > join
> >     view v2
> > on (
> >     v1.field2=v2.field2 and
> >     v1.field3=v2.field3 and
> >     v1.field3=v2.field3 and
> >     v1.pkey1<>v2.pkey
> > )
> >
> > How does that work?
>
> Won't this be a massive cross product of all pkey pairs that have the
> same field values?
>

Yes, assuming there are a lot of them.  OTOH, if there are only a few
duplicates you're looking for...

How many are you expecting, percentage wise, to get back?

> Here's what I'm currently using, in terms of your very helpful view:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    join
>    (select v2.field1, v2.field2, v2.field3
>     from view as v2
>     group by v2.field2,  v2.field3, v2.field4
>     having count(*) = 1)
>    using (field2, field3, field4);
>
> This is the one that takes eight hours. :(  Another way to express what
> I want is this:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>    from view as v1
>    where not exists
>      (select true from view as v2
>       where v1.field2 = v2.field2
>         and v1.field3 = v2.field3
>         and v1.field4 = v2.field4
>         and v1.pkey1 <> v2.pkey1);
>
> That looks like a horrible nested loop, but I suppose I should try it
> to make sure it is indeed slower then the previous query.

If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.

In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates.  I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate.  That gets me down to about 1 week.  So, 8 hours is seeming
quite fast.  :)

Re: Finding uniques across a big join

From
"Jim C. Nasby"
Date:
On Wed, Nov 30, 2005 at 01:29:17PM -0500, John D. Burger wrote:
> Jim C. Nasby wrote:
>
> >It will probably be a win to come up with a list of potential records
> >from each table, instead of after doing the 3-way join. so something
> >like:
> >
> >(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING
> >count(*)=1)
> >JOIN
> >(SELECT ...)
>
> Hmm, not sure I understand.  Joining the uniques wrt each of the three
> columns does not result in the unique triples, or even a superset of
> them, at least in this case.

So featureType is *unique* within gazPlaces?

If you look at the query plan, it's estimating 120k to scan one of the
tables, but if you look at the hash join that joins the 3rd table to the
first two it's estimating 1.2M. That's because it's dealing with a
monsterous dataset; it's joining every single row with every other
possible row.

What I was trying to do was eliminate rows from that join that would
mean you couldn't have a count of only 1. But my example actually
changes the output, so that won't work.

> >If you post the output of explain (or explain analyze is even better)
> >then people could probably make better suggestions.
>
> Okay, I just posted the query plan. I will try to run an EXPLAIN
> ANALYZE tonight.
>
> Again, I'm also interested in completely different approaches to
> discovering the entities with unique attribute combinations.
> Intuitively, the query I posted is doing "too much work", because it's
> computing the total count for each combo, when all I really need is to
> know if the count is 1 or greater than 1.

The problem I see is that as you add more criteria (which is what each
JOIN does), you reduce the count that you might get for every
combination. Of course adding more rows increases the potential count.
Combine the two and I don't think there's any other way to come up with
the list of what's truely unique any other way.

What I'm not sure of is if knowing certain things about gazPlaceID would
allow you to make some assumptions that could simplify the query. I
can't think of anything that would help there.

Basically, I think you're stuck with the ugly 3-way join that's the core
of the slowness. But, there is a gain to be had: doing the second set of
joins bumps you from 1.2M to over 4M. Someone else suggested adding
gazPlaceID to the GROUP BY; I definately think you should do that.
Because of the equality conditions I can't conceive of any way that
it would change anything. What *might* make a difference is which table
you pull gazPlaceID from; again, because of the equality condition it
doesn't matter which table you get them from, but I don't know if the
optimizer understands that. So changing which table you pull that from
could drastically effect how the 3-way join is done, which could also
drastically effect the cost involved. Ideally you'd like the first 2-way
join to be the one that results in the smallest amount of overlap.

Also, if you're on 8.1, you might try experimenting with different
indexes on the 3 tables. I think this could be an ideal candidate for a
bitmap scan; I believe it's possible to build the entire set you want
with a scan of 6 bitmaps, but I don't know if there's that much brains
in the bitmap code yet. Actually, I think what would be ideal would be
to build 3 bitmaps for the 3 fields you really want to group by and
compare those; the trick would be building those bitmaps using indexes
of gazPlaceID and the field you care about from each table.

BTW, it might be easier for someone to come up with a solution if you
post a plain-english description of what you're trying to find out,
along with any facts about the data (such as what's unique, etc).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Finding uniques across a big join

From
"John D. Burger"
Date:
Scott Marlowe wrote:

>> Won't this be a massive cross product of all pkey pairs that have the
>> same field values?
>
> Yes, assuming there are a lot of them.  OTOH, if there are only a few
> duplicates you're looking for...

I'm not looking for duplicates, I'm looking for uniques - note the
Subject line :).  Here's an analogous problem:  everybody in the room
has eye color, hair color, and skin color.  I want to find the people
whose particular three-way combination is unique - no one else has that
eye-hair-skin combo.  The analogue to my current query is then like
this:

   select p1.personID, p1.eyeColor, p1.hairColor, p1.skinColor
     from persons as p1
     join
     (select p2.eyeColor, p2.hairColor, p2.skinColor
       from persons as p2
       group by p2.eyeColor, p2.hairColor, p2.skinColor
       having count(*) = 1)
     using (eyeColor, hairColor, skinColor);

The inner select finds the unique combinations, the outer one goes back
and finds the peopleID corresponding to each unique combo.  And the
persons table is actually a view on a big three-way join.

Jim Nasby wrote:

> Someone else suggested adding gazPlaceID to the GROUP BY; I definately
> think you should do that.

That changes the semantics of what I want.  If I group by personID
above, then every FOUR-way combo is of course unique.  What I'd like to
do is group by the three attributes, and select for personID as well.
But of course you can't select for columns you haven't grouped by.

Sorry, I can't think of any other ways to explain what I'm doing.  But
thank you for your replies.

- John Burger
   MITRE


Re: Finding uniques across a big join

From
Bruno Wolff III
Date:
On Wed, Nov 30, 2005 at 20:44:30 -0500,
  "John D. Burger" <john@mitre.org> wrote:
>
> That changes the semantics of what I want.  If I group by personID
> above, then every FOUR-way combo is of course unique.  What I'd like to
> do is group by the three attributes, and select for personID as well.
> But of course you can't select for columns you haven't grouped by.

Assuming that personID is an ordered type, you can select max(personID)
in the GROUP BY and save the join at the end.

Re: Finding uniques across a big join

From
"John D. Burger"
Date:
Bruno Wolff III wrote:

>> That changes the semantics of what I want.  If I group by personID
>> above, then every FOUR-way combo is of course unique.  What I'd like
>> to
>> do is group by the three attributes, and select for personID as well.
>> But of course you can't select for columns you haven't grouped by.
>
> Assuming that personID is an ordered type, you can select max(personID)
> in the GROUP BY and save the join at the end.

I'm not sure what this means - do you mean:

select p2.eyeColor, p2.hairColor, p2.skinColor
   from persons as p2
   group by max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
   having count(*) = 1;

I don't know what that does.  If you mean:

select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
   from persons as p2
   group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor
   having count(*) = 1;

then I don't think that works either - if I include personID in the
GROUP BY, then the COUNT doesn't do what I want, right?  I just want
uniques wrt the three attribute fields.  If I group by personID, then
personID counts towards uniqueness.

Thanks for all the suggestions, folks.

- John Burger
   MITRE


Re: Finding uniques across a big join

From
Marcin Inkielman
Date:
John D. Burger napisał(a):

>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>   from view as v1
>   join
>   (select v2.field1, v2.field2, v2.field3
>     from view as v2
>     group by v2.field2,  v2.field3, v2.field4
>     having count(*) = 1)
>   using (field2, field3, field4);
>
> This is the one that takes eight hours. :(  Another way to express
> what I want is this:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
>   from view as v1
>   where not exists
>     (select true from view as v2
>       where v1.field2 = v2.field2
>        and v1.field3 = v2.field3
>        and v1.field4 = v2.field4
>        and v1.pkey1 <> v2.pkey1);
>
> That looks like a horrible nested loop, but I suppose I should try it
> to make sure it is indeed slower then the previous query.
>
Hi!

Did you try the second query? I guess I should take consirerably less
time than the first one. Usualy I do "these things" like this...
This is the only possibility for the planner to use indexes. The query
plan you send us shows that are mostly seq scans are used.

Regards,

Marcin Inkielman

Re: Finding uniques across a big join

From
Greg Stark
Date:
"John D. Burger" <john@mitre.org> writes:

> I don't know what that does.  If you mean:
>
> select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor
>    from persons as p2
>    group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor
>    having count(*) = 1;
>
> then I don't think that works either - if I include personID in the GROUP BY,

select max(personid) as personid, eyecolor, haircolor, skincolor
  from persons
 group by eyecolor, haircolor, skincolor
having count(*) = 1

--
greg

Re: Finding uniques across a big join

From
"John D. Burger"
Date:
Greg Stark wrote:

> select max(personid) as personid, eyecolor, haircolor, skincolor
>   from persons
>  group by eyecolor, haircolor, skincolor
> having count(*) = 1

Aha, I understand Bruno's suggestion now!  I was actually trying to
think of some way of using an aggregate on personid, but couldn't
figure it out.  Of course max does just what I want in this case, since
max on one value gives that value - min and some other aggregate
functions would work too.  Very clever!

On my actual problem, where "persons" is actually three joined tables,
my original query took eight hours.  The new query, modeled after the
above, runs in almost exactly a tenth of the time.

Thanks for all the suggestions!

- John D. Burger
   MITRE