Re: Finding uniques across a big join - Mailing list pgsql-general
From | Jim C. Nasby |
---|---|
Subject | Re: Finding uniques across a big join |
Date | |
Msg-id | 20051201005954.GT13642@nasby.net Whole thread Raw |
In response to | Re: Finding uniques across a big join ("John D. Burger" <john@mitre.org>) |
List | pgsql-general |
On Wed, Nov 30, 2005 at 01:29:17PM -0500, John D. Burger wrote: > Jim C. Nasby wrote: > > >It will probably be a win to come up with a list of potential records > >from each table, instead of after doing the 3-way join. so something > >like: > > > >(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING > >count(*)=1) > >JOIN > >(SELECT ...) > > Hmm, not sure I understand. Joining the uniques wrt each of the three > columns does not result in the unique triples, or even a superset of > them, at least in this case. So featureType is *unique* within gazPlaces? If you look at the query plan, it's estimating 120k to scan one of the tables, but if you look at the hash join that joins the 3rd table to the first two it's estimating 1.2M. That's because it's dealing with a monsterous dataset; it's joining every single row with every other possible row. What I was trying to do was eliminate rows from that join that would mean you couldn't have a count of only 1. But my example actually changes the output, so that won't work. > >If you post the output of explain (or explain analyze is even better) > >then people could probably make better suggestions. > > Okay, I just posted the query plan. I will try to run an EXPLAIN > ANALYZE tonight. > > Again, I'm also interested in completely different approaches to > discovering the entities with unique attribute combinations. > Intuitively, the query I posted is doing "too much work", because it's > computing the total count for each combo, when all I really need is to > know if the count is 1 or greater than 1. The problem I see is that as you add more criteria (which is what each JOIN does), you reduce the count that you might get for every combination. Of course adding more rows increases the potential count. Combine the two and I don't think there's any other way to come up with the list of what's truely unique any other way. What I'm not sure of is if knowing certain things about gazPlaceID would allow you to make some assumptions that could simplify the query. I can't think of anything that would help there. Basically, I think you're stuck with the ugly 3-way join that's the core of the slowness. But, there is a gain to be had: doing the second set of joins bumps you from 1.2M to over 4M. Someone else suggested adding gazPlaceID to the GROUP BY; I definately think you should do that. Because of the equality conditions I can't conceive of any way that it would change anything. What *might* make a difference is which table you pull gazPlaceID from; again, because of the equality condition it doesn't matter which table you get them from, but I don't know if the optimizer understands that. So changing which table you pull that from could drastically effect how the 3-way join is done, which could also drastically effect the cost involved. Ideally you'd like the first 2-way join to be the one that results in the smallest amount of overlap. Also, if you're on 8.1, you might try experimenting with different indexes on the 3 tables. I think this could be an ideal candidate for a bitmap scan; I believe it's possible to build the entire set you want with a scan of 6 bitmaps, but I don't know if there's that much brains in the bitmap code yet. Actually, I think what would be ideal would be to build 3 bitmaps for the 3 fields you really want to group by and compare those; the trick would be building those bitmaps using indexes of gazPlaceID and the field you care about from each table. BTW, it might be easier for someone to come up with a solution if you post a plain-english description of what you're trying to find out, along with any facts about the data (such as what's unique, etc). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
pgsql-general by date: