Thread: Finding uniques across a big join
I could use some help with the following: I have a database of geographic entities with attributes spread across several tables. I need to determine which entities are unique with respect to some of those attributes. I'm using the following query: select p2.gazPlaceID from (select p1.featureType, n1.placeNameID, c1.containerID from gazPlaces as p1 join gazNamings as n1 using (gazPlaceID) join gazContainers as c1 using (gazPlaceID) group by p1.featureType, n1.placeNameID, c1.containerID having count(*) = 1) as uniqs, gazPlaces as p2 join gazNamings as n2 using (gazPlaceID) join gazContainers as c2 using (gazPlaceID) where uniqs.featureType = p2.featureType and uniqs.placeNameID = n2.placeNameID and uniqs.containerID = c2.containerID; The basic idea is to compute featureType-placeNameID-containerID combinations with a three-way join, determine which of those have a count of 1, and then join that back to the same three-way join to get the gazPlaceIDs corresponding to the unique combos (whew!). gazPlaces has about 6M entries, gazNamings and gazContainers each about 10M. All of the fields above are integers, and I have indexes on everything relevant, but the query still takes about eight hours. My question is not (necessarily) how to improve the efficiency of this query, but whether anyone can think of a faster way to compute the uniques. Again, the goal is to find entries in gazPlaces that are the only ones with their particular combination of feature type, name and container. Any help is appreciated! - John Burger MITRE
On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote: > I could use some help with the following: > > I have a database of geographic entities with attributes spread across > several tables. I need to determine which entities are unique with > respect to some of those attributes. I'm using the following query: <snip> If you put the gazPlaceID as a result of the uniqs subquery, that would avoid the second lookup, right? Whether it's much faster is the question. So something like: select p1.gazPlaceID from gazPlaces as p1 join gazNamings as n1 using (gazPlaceID) join gazContainers as c1 using (gazPlaceID) group by p1.gazPlaceID, p1.featureType, n1.placeNameID, c1.containerID having count(*) = 1 Secondly, what does the plan look like? Is it materialising or sorting at any stage? Finally, what version of postgres? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
It will probably be a win to come up with a list of potential records from each table, instead of after doing the 3-way join. so something like: (SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING count(*)=1) JOIN (SELECT ...) If you post the output of explain (or explain analyze is even better) then people could probably make better suggestions. On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote: > I could use some help with the following: > > I have a database of geographic entities with attributes spread across > several tables. I need to determine which entities are unique with > respect to some of those attributes. I'm using the following query: > > select p2.gazPlaceID from > (select p1.featureType, n1.placeNameID, c1.containerID > from gazPlaces as p1 > join gazNamings as n1 using (gazPlaceID) > join gazContainers as c1 using (gazPlaceID) > group by p1.featureType, n1.placeNameID, c1.containerID > having count(*) = 1) as uniqs, > gazPlaces as p2 > join gazNamings as n2 using (gazPlaceID) > join gazContainers as c2 using (gazPlaceID) > where uniqs.featureType = p2.featureType > and uniqs.placeNameID = n2.placeNameID > and uniqs.containerID = c2.containerID; > > The basic idea is to compute featureType-placeNameID-containerID > combinations with a three-way join, determine which of those have a > count of 1, and then join that back to the same three-way join to get > the gazPlaceIDs corresponding to the unique combos (whew!). > > gazPlaces has about 6M entries, gazNamings and gazContainers each about > 10M. All of the fields above are integers, and I have indexes on > everything relevant, but the query still takes about eight hours. My > question is not (necessarily) how to improve the efficiency of this > query, but whether anyone can think of a faster way to compute the > uniques. Again, the goal is to find entries in gazPlaces that are the > only ones with their particular combination of feature type, name and > container. > > Any help is appreciated! > > - John Burger > MITRE > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Nov 30, 2005, at 01:55, Martijn van Oosterhout wrote: > On Tue, Nov 29, 2005 at 09:58:49PM -0500, John D. Burger wrote: >> I could use some help with the following: >> >> I have a database of geographic entities with attributes spread across >> several tables. I need to determine which entities are unique with >> respect to some of those attributes. I'm using the following query: > > <snip> > > If you put the gazPlaceID as a result of the uniqs subquery, that would > avoid the second lookup, right? Whether it's much faster is the > question. So something like: > > select p1.gazPlaceID > from gazPlaces as p1 > join gazNamings as n1 using (gazPlaceID) > join gazContainers as c1 using (gazPlaceID) > group by p1.gazPlaceID, p1.featureType, n1.placeNameID, > c1.containerID > having count(*) = 1 The problem is that then every row is unique, because gazPlaceID is a primary key. As far as I can see, I need to group on just the other three columns - they're the ones for which I'm interested in uniqueness. > Secondly, what does the plan look like? Is it materialising or sorting > at any stage? > Finally, what version of postgres? Version is 7.4.8 for Solaris. Below is (a version of) the query again, as well as the plan. No materialization, I think, but it appears to be sorting the first three-way join to do the counts, then sorting the second one to merge. Cost estimates are way off, as the final result has almost 10M rows, but everything is analyzed, with indexes on every column, although none of them get used. Again, any suggestions on tweaking this query, or a completely different approach to finding the entities with unique combinations, would be much appreciated. - John Burger MITRE select p2.gazPlaceID, u.* into table tempCanonical_nameMatchEquiver_3435_1 from (select p1.featureType, n1.placeNameID, c1.containerID from gazPlaces as p1 join gazNamings as n1 using (gazPlaceID) join gazContainers as c1 using (gazPlaceID) group by p1.featureType, n1.placeNameID, c1.containerID having count(*) = 1) as u, gazPlaces as p2 join gazNamings as n2 using (gazPlaceID) join gazContainers as c2 using (gazPlaceID) where u.featureType = p2.featureType and u.placeNameID = n2.placeNameID and u.containerID = c2.containerID; Hash Join (cost=4009535.96..4316595.92 rows=306 width=16) Hash Cond: (("outer".gazplaceid = "inner".gazplaceid) AND ("outer".containerid = "inner".containerid)) -> Seq Scan on gazcontainers c2 (cost=0.00..141636.45 rows=9193945 width=8) -> Hash (cost=4006472.81..4006472.81 rows=282029 width=20) -> Merge Join (cost=3777226.54..4006472.81 rows=282029 width=20) Merge Cond: (("outer".featuretype = "inner".featuretype) AND ("outer".placenameid = "inner".placenameid)) -> Subquery Scan u (cost=2107001.20..2259698.67 rows=5552635 width=12) -> GroupAggregate (cost=2107001.20..2204172.32 rows=5552635 width=12) Filter: (count(*) = 1) -> Sort (cost=2107001.20..2120882.79 rows=5552635 width=12) Sort Key: p1.featuretype, n1.placenameid, c1.containerid -> Hash Join (cost=688064.17..1217844.46 rows=5552635 width=12) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gaznamings n1 (cost=0.00..156331.05 rows=10147805 width=8) -> Hash (cost=642816.39..642816.39 rows=6128713 width=16) -> Hash Join (cost=160244.91..642816.39 rows=6128713 width=16) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gazcontainers c1 (cost=0.00..141636.45 rows=9193945 width=8) -> Hash (cost=120982.13..120982.13 rows=6128713 width=8) -> Seq Scan on gazplaces p1 (cost=0.00..120982.13 rows=6128713 width=8) -> Sort (cost=1670225.33..1685547.11 rows=6128713 width=16) Sort Key: p2.featuretype, n2.placenameid -> Hash Join (cost=160244.91..684040.19 rows=6128713 width=16) Hash Cond: ("outer".gazplaceid = "inner".gazplaceid) -> Seq Scan on gaznamings n2 (cost=0.00..156331.05 rows=10147805 width=8) -> Hash (cost=120982.13..120982.13 rows=6128713 width=8) -> Seq Scan on gazplaces p2 (cost=0.00..120982.13 rows=6128713 width=8)
Jim C. Nasby wrote: > It will probably be a win to come up with a list of potential records > from each table, instead of after doing the 3-way join. so something > like: > > (SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING > count(*)=1) > JOIN > (SELECT ...) Hmm, not sure I understand. Joining the uniques wrt each of the three columns does not result in the unique triples, or even a superset of them, at least in this case. > If you post the output of explain (or explain analyze is even better) > then people could probably make better suggestions. Okay, I just posted the query plan. I will try to run an EXPLAIN ANALYZE tonight. Again, I'm also interested in completely different approaches to discovering the entities with unique attribute combinations. Intuitively, the query I posted is doing "too much work", because it's computing the total count for each combo, when all I really need is to know if the count is 1 or greater than 1. Thanks! - John Burger MITRE
On Wed, Nov 30, 2005 at 01:20:19PM -0500, John D. Burger wrote: > >select p1.gazPlaceID > > from gazPlaces as p1 > > join gazNamings as n1 using (gazPlaceID) > > join gazContainers as c1 using (gazPlaceID) > > group by p1.gazPlaceID, p1.featureType, n1.placeNameID, > >c1.containerID > > having count(*) = 1 > > The problem is that then every row is unique, because gazPlaceID is a > primary key. As far as I can see, I need to group on just the other > three columns - they're the ones for which I'm interested in > uniqueness. AIUI, according to the JOIN conditions in the query you have n1.gazPlaceID = c1.gazPlaceID = p1.gazPlaceID so grouping by one of those shouldn't affect the query result. Are the tables wide? Maybe you're losing a lot of time transferring data you don't need. Other than that I can't think of any neat tricks... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
On Tue, 2005-11-29 at 20:58, John D. Burger wrote: > I could use some help with the following: > > I have a database of geographic entities with attributes spread across > several tables. I need to determine which entities are unique with > respect to some of those attributes. I'm using the following query: > > select p2.gazPlaceID from > (select p1.featureType, n1.placeNameID, c1.containerID > from gazPlaces as p1 > join gazNamings as n1 using (gazPlaceID) > join gazContainers as c1 using (gazPlaceID) > group by p1.featureType, n1.placeNameID, c1.containerID > having count(*) = 1) as uniqs, > gazPlaces as p2 > join gazNamings as n2 using (gazPlaceID) > join gazContainers as c2 using (gazPlaceID) > where uniqs.featureType = p2.featureType > and uniqs.placeNameID = n2.placeNameID > and uniqs.containerID = c2.containerID; > > The basic idea is to compute featureType-placeNameID-containerID > combinations with a three-way join, determine which of those have a > count of 1, and then join that back to the same three-way join to get > the gazPlaceIDs corresponding to the unique combos (whew!). > > gazPlaces has about 6M entries, gazNamings and gazContainers each about > 10M. All of the fields above are integers, and I have indexes on > everything relevant, but the query still takes about eight hours. My > question is not (necessarily) how to improve the efficiency of this > query, but whether anyone can think of a faster way to compute the > uniques. Again, the goal is to find entries in gazPlaces that are the > only ones with their particular combination of feature type, name and > container. OK, let's assume that the basic part of it, before the group by, has been put into a view, so we can then do: select pkey1, field2, field3, field4 from view; And we know that pkey1 is unique, but we want the records where pkey1 is the only thing different between them, right? select v1.pkey1, v1.field2, v1.field3, v1.field4, v2.pkey1, v2.field2, v2.field3, v2.field4, from view v1 join view v2 on ( v1.field2=v2.field2 and v1.field3=v2.field3 and v1.field3=v2.field3 and v1.pkey1<>v2.pkey ) How does that work?
Scott Marlowe wrote: > OK, let's assume that the basic part of it, before the group by, has > been put into a view, so we can then do: > > select pkey1, field2, field3, field4 from view; > > And we know that pkey1 is unique, but we want the records where pkey1 > is > the only thing different between them, right? Hmm, I'm explaining this really badly :). I should have defined a view like you suggest to help simplify it. What I want is the pkeys (and the field values) where no other pkey has that triple of field values. That's why my earlier query does a group by the fields and then having count(*) = 1. Also, FWIW, pkey1 is unique in its original table, but not in the view, since some of the other tables are one-to-many. > select > v1.pkey1, > v1.field2, > v1.field3, > v1.field4, > v2.pkey1, > v2.field2, > v2.field3, > v2.field4, > from > view v1 > join > view v2 > on ( > v1.field2=v2.field2 and > v1.field3=v2.field3 and > v1.field3=v2.field3 and > v1.pkey1<>v2.pkey > ) > > How does that work? Won't this be a massive cross product of all pkey pairs that have the same field values? Here's what I'm currently using, in terms of your very helpful view: select v1.pkey1, v1.field2, v1.field3, v1.field4 from view as v1 join (select v2.field1, v2.field2, v2.field3 from view as v2 group by v2.field2, v2.field3, v2.field4 having count(*) = 1) using (field2, field3, field4); This is the one that takes eight hours. :( Another way to express what I want is this: select v1.pkey1, v1.field2, v1.field3, v1.field4 from view as v1 where not exists (select true from view as v2 where v1.field2 = v2.field2 and v1.field3 = v2.field3 and v1.field4 = v2.field4 and v1.pkey1 <> v2.pkey1); That looks like a horrible nested loop, but I suppose I should try it to make sure it is indeed slower then the previous query. - John Burger MITRE
On Wed, 2005-11-30 at 16:27, John D. Burger wrote: > Scott Marlowe wrote: > > > select > > v1.pkey1, > > v1.field2, > > v1.field3, > > v1.field4, > > v2.pkey1, > > v2.field2, > > v2.field3, > > v2.field4, > > from > > view v1 > > join > > view v2 > > on ( > > v1.field2=v2.field2 and > > v1.field3=v2.field3 and > > v1.field3=v2.field3 and > > v1.pkey1<>v2.pkey > > ) > > > > How does that work? > > Won't this be a massive cross product of all pkey pairs that have the > same field values? > Yes, assuming there are a lot of them. OTOH, if there are only a few duplicates you're looking for... How many are you expecting, percentage wise, to get back? > Here's what I'm currently using, in terms of your very helpful view: > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > join > (select v2.field1, v2.field2, v2.field3 > from view as v2 > group by v2.field2, v2.field3, v2.field4 > having count(*) = 1) > using (field2, field3, field4); > > This is the one that takes eight hours. :( Another way to express what > I want is this: > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > where not exists > (select true from view as v2 > where v1.field2 = v2.field2 > and v1.field3 = v2.field3 > and v1.field4 = v2.field4 > and v1.pkey1 <> v2.pkey1); > > That looks like a horrible nested loop, but I suppose I should try it > to make sure it is indeed slower then the previous query. If you can allocated enough shared memory for the set to fit in memory, you might be able to get a hash agg method, which is much faster than most other methods for this kind of thing, since it requires no sort. In a side point, I'm currently mushing 888,000,000 6 character codes up against each other to check for duplicates. I have 6 machines doing this, at 1 million codes compared to 1 million codes every 0.5 seconds aggregate. That gets me down to about 1 week. So, 8 hours is seeming quite fast. :)
On Wed, Nov 30, 2005 at 01:29:17PM -0500, John D. Burger wrote: > Jim C. Nasby wrote: > > >It will probably be a win to come up with a list of potential records > >from each table, instead of after doing the 3-way join. so something > >like: > > > >(SELECT gazPlaceID FROM gazPlaces GROUP BY featureType HAVING > >count(*)=1) > >JOIN > >(SELECT ...) > > Hmm, not sure I understand. Joining the uniques wrt each of the three > columns does not result in the unique triples, or even a superset of > them, at least in this case. So featureType is *unique* within gazPlaces? If you look at the query plan, it's estimating 120k to scan one of the tables, but if you look at the hash join that joins the 3rd table to the first two it's estimating 1.2M. That's because it's dealing with a monsterous dataset; it's joining every single row with every other possible row. What I was trying to do was eliminate rows from that join that would mean you couldn't have a count of only 1. But my example actually changes the output, so that won't work. > >If you post the output of explain (or explain analyze is even better) > >then people could probably make better suggestions. > > Okay, I just posted the query plan. I will try to run an EXPLAIN > ANALYZE tonight. > > Again, I'm also interested in completely different approaches to > discovering the entities with unique attribute combinations. > Intuitively, the query I posted is doing "too much work", because it's > computing the total count for each combo, when all I really need is to > know if the count is 1 or greater than 1. The problem I see is that as you add more criteria (which is what each JOIN does), you reduce the count that you might get for every combination. Of course adding more rows increases the potential count. Combine the two and I don't think there's any other way to come up with the list of what's truely unique any other way. What I'm not sure of is if knowing certain things about gazPlaceID would allow you to make some assumptions that could simplify the query. I can't think of anything that would help there. Basically, I think you're stuck with the ugly 3-way join that's the core of the slowness. But, there is a gain to be had: doing the second set of joins bumps you from 1.2M to over 4M. Someone else suggested adding gazPlaceID to the GROUP BY; I definately think you should do that. Because of the equality conditions I can't conceive of any way that it would change anything. What *might* make a difference is which table you pull gazPlaceID from; again, because of the equality condition it doesn't matter which table you get them from, but I don't know if the optimizer understands that. So changing which table you pull that from could drastically effect how the 3-way join is done, which could also drastically effect the cost involved. Ideally you'd like the first 2-way join to be the one that results in the smallest amount of overlap. Also, if you're on 8.1, you might try experimenting with different indexes on the 3 tables. I think this could be an ideal candidate for a bitmap scan; I believe it's possible to build the entire set you want with a scan of 6 bitmaps, but I don't know if there's that much brains in the bitmap code yet. Actually, I think what would be ideal would be to build 3 bitmaps for the 3 fields you really want to group by and compare those; the trick would be building those bitmaps using indexes of gazPlaceID and the field you care about from each table. BTW, it might be easier for someone to come up with a solution if you post a plain-english description of what you're trying to find out, along with any facts about the data (such as what's unique, etc). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Scott Marlowe wrote: >> Won't this be a massive cross product of all pkey pairs that have the >> same field values? > > Yes, assuming there are a lot of them. OTOH, if there are only a few > duplicates you're looking for... I'm not looking for duplicates, I'm looking for uniques - note the Subject line :). Here's an analogous problem: everybody in the room has eye color, hair color, and skin color. I want to find the people whose particular three-way combination is unique - no one else has that eye-hair-skin combo. The analogue to my current query is then like this: select p1.personID, p1.eyeColor, p1.hairColor, p1.skinColor from persons as p1 join (select p2.eyeColor, p2.hairColor, p2.skinColor from persons as p2 group by p2.eyeColor, p2.hairColor, p2.skinColor having count(*) = 1) using (eyeColor, hairColor, skinColor); The inner select finds the unique combinations, the outer one goes back and finds the peopleID corresponding to each unique combo. And the persons table is actually a view on a big three-way join. Jim Nasby wrote: > Someone else suggested adding gazPlaceID to the GROUP BY; I definately > think you should do that. That changes the semantics of what I want. If I group by personID above, then every FOUR-way combo is of course unique. What I'd like to do is group by the three attributes, and select for personID as well. But of course you can't select for columns you haven't grouped by. Sorry, I can't think of any other ways to explain what I'm doing. But thank you for your replies. - John Burger MITRE
On Wed, Nov 30, 2005 at 20:44:30 -0500, "John D. Burger" <john@mitre.org> wrote: > > That changes the semantics of what I want. If I group by personID > above, then every FOUR-way combo is of course unique. What I'd like to > do is group by the three attributes, and select for personID as well. > But of course you can't select for columns you haven't grouped by. Assuming that personID is an ordered type, you can select max(personID) in the GROUP BY and save the join at the end.
Bruno Wolff III wrote: >> That changes the semantics of what I want. If I group by personID >> above, then every FOUR-way combo is of course unique. What I'd like >> to >> do is group by the three attributes, and select for personID as well. >> But of course you can't select for columns you haven't grouped by. > > Assuming that personID is an ordered type, you can select max(personID) > in the GROUP BY and save the join at the end. I'm not sure what this means - do you mean: select p2.eyeColor, p2.hairColor, p2.skinColor from persons as p2 group by max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor having count(*) = 1; I don't know what that does. If you mean: select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor from persons as p2 group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor having count(*) = 1; then I don't think that works either - if I include personID in the GROUP BY, then the COUNT doesn't do what I want, right? I just want uniques wrt the three attribute fields. If I group by personID, then personID counts towards uniqueness. Thanks for all the suggestions, folks. - John Burger MITRE
John D. Burger napisał(a): > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > join > (select v2.field1, v2.field2, v2.field3 > from view as v2 > group by v2.field2, v2.field3, v2.field4 > having count(*) = 1) > using (field2, field3, field4); > > This is the one that takes eight hours. :( Another way to express > what I want is this: > > select v1.pkey1, v1.field2, v1.field3, v1.field4 > from view as v1 > where not exists > (select true from view as v2 > where v1.field2 = v2.field2 > and v1.field3 = v2.field3 > and v1.field4 = v2.field4 > and v1.pkey1 <> v2.pkey1); > > That looks like a horrible nested loop, but I suppose I should try it > to make sure it is indeed slower then the previous query. > Hi! Did you try the second query? I guess I should take consirerably less time than the first one. Usualy I do "these things" like this... This is the only possibility for the planner to use indexes. The query plan you send us shows that are mostly seq scans are used. Regards, Marcin Inkielman
"John D. Burger" <john@mitre.org> writes: > I don't know what that does. If you mean: > > select max(p2.personID), p2.eyeColor, p2.hairColor, p2.skinColor > from persons as p2 > group by p2.personID, p2.eyeColor, p2.hairColor, p2.skinColor > having count(*) = 1; > > then I don't think that works either - if I include personID in the GROUP BY, select max(personid) as personid, eyecolor, haircolor, skincolor from persons group by eyecolor, haircolor, skincolor having count(*) = 1 -- greg
Greg Stark wrote: > select max(personid) as personid, eyecolor, haircolor, skincolor > from persons > group by eyecolor, haircolor, skincolor > having count(*) = 1 Aha, I understand Bruno's suggestion now! I was actually trying to think of some way of using an aggregate on personid, but couldn't figure it out. Of course max does just what I want in this case, since max on one value gives that value - min and some other aggregate functions would work too. Very clever! On my actual problem, where "persons" is actually three joined tables, my original query took eight hours. The new query, modeled after the above, runs in almost exactly a tenth of the time. Thanks for all the suggestions! - John D. Burger MITRE