Re: Finding uniques across a big join - Mailing list pgsql-general

From Jim C. Nasby
Subject Re: Finding uniques across a big join
Date
Msg-id 20051201005954.GT13642@nasby.net
Whole thread Raw
In response to Re: Finding uniques across a big join  ("John D. Burger" <john@mitre.org>)
List pgsql-general
On Wed, Nov 30, 2005 at 01:29:17PM -0500, John D. Burger wrote:
> Jim C. Nasby wrote:
>
> >It will probably be a win to come up with a list of potential records
> >from each table, instead of after doing the 3-way join. so something
> >like:
> >
> >(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING
> >count(*)=1)
> >JOIN
> >(SELECT ...)
>
> Hmm, not sure I understand.  Joining the uniques wrt each of the three
> columns does not result in the unique triples, or even a superset of
> them, at least in this case.

So featureType is *unique* within gazPlaces?

If you look at the query plan, it's estimating 120k to scan one of the
tables, but if you look at the hash join that joins the 3rd table to the
first two it's estimating 1.2M. That's because it's dealing with a
monsterous dataset; it's joining every single row with every other
possible row.

What I was trying to do was eliminate rows from that join that would
mean you couldn't have a count of only 1. But my example actually
changes the output, so that won't work.

> >If you post the output of explain (or explain analyze is even better)
> >then people could probably make better suggestions.
>
> Okay, I just posted the query plan. I will try to run an EXPLAIN
> ANALYZE tonight.
>
> Again, I'm also interested in completely different approaches to
> discovering the entities with unique attribute combinations.
> Intuitively, the query I posted is doing "too much work", because it's
> computing the total count for each combo, when all I really need is to
> know if the count is 1 or greater than 1.

The problem I see is that as you add more criteria (which is what each
JOIN does), you reduce the count that you might get for every
combination. Of course adding more rows increases the potential count.
Combine the two and I don't think there's any other way to come up with
the list of what's truely unique any other way.

What I'm not sure of is if knowing certain things about gazPlaceID would
allow you to make some assumptions that could simplify the query. I
can't think of anything that would help there.

Basically, I think you're stuck with the ugly 3-way join that's the core
of the slowness. But, there is a gain to be had: doing the second set of
joins bumps you from 1.2M to over 4M. Someone else suggested adding
gazPlaceID to the GROUP BY; I definately think you should do that.
Because of the equality conditions I can't conceive of any way that
it would change anything. What *might* make a difference is which table
you pull gazPlaceID from; again, because of the equality condition it
doesn't matter which table you get them from, but I don't know if the
optimizer understands that. So changing which table you pull that from
could drastically effect how the 3-way join is done, which could also
drastically effect the cost involved. Ideally you'd like the first 2-way
join to be the one that results in the smallest amount of overlap.

Also, if you're on 8.1, you might try experimenting with different
indexes on the 3 tables. I think this could be an ideal candidate for a
bitmap scan; I believe it's possible to build the entire set you want
with a scan of 6 bitmaps, but I don't know if there's that much brains
in the bitmap code yet. Actually, I think what would be ideal would be
to build 3 bitmaps for the 3 fields you really want to group by and
compare those; the trick would be building those bitmaps using indexes
of gazPlaceID and the field you care about from each table.

BTW, it might be easier for someone to come up with a solution if you
post a plain-english description of what you're trying to find out,
along with any facts about the data (such as what's unique, etc).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

pgsql-general by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Finding uniques across a big join
Next
From: "John D. Burger"
Date:
Subject: Re: Finding uniques across a big join