Thread: SQL Help: Multiple LEFT OUTER JOINs
I need to generate a table of teachers, and the count of classes they taught in the past and are scheduled to teach in the future. id | last_name | total future_class_count | past_class_count -----+--------------+----------+--------------------+----------------- 3 | Smith | 12 | 3 | 9 8 | Jones | 0 | 0 | 0 table person id last_name table class id class_time table role id role_name -- for limiting to a type of teacher -- link tables table person_role person references person role references role -- This table ties a person to a class, thus making them an instructor table instructors person references person class references class I can easily get instructors and the total count of their classes: SELECT person.id AS id, last_name, count(instructors.class) FROM person LEFT OUTER JOIN instructors ON (person.id = instructors.person), person_role WHERE person_role.person = person.id AND person_role.role = 3 -- limit to this type of teacher GROUP BY id, last_name; Here's where I'm missing something. Trying to do an outer join on to bring in the class row with its class_time column: SELECT person.id AS id, last_name, count(instructors.class) as total, sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count FROM (person LEFT OUTER JOIN instructors ON (person.id = instructors.person)) t LEFT OUTER JOIN class on ( t.class = class.id ), person_role WHERE person_role.person = person.id AND person_role.role = 3 GROUP BY person.id, last_name; -- Bill Moseley moseley@hank.org
On Mon, Nov 21, 2005 at 05:40:10 -0800, Bill Moseley <moseley@hank.org> wrote: > > Here's where I'm missing something. Trying to do an outer join on > to bring in the class row with its class_time column: You don't say exactly why you are having a problem with this, but I think you would be better off doing an inner join between instructors and class and then do an outer join of that result to person. > > > SELECT person.id AS id, last_name, > count(instructors.class) as total, > sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, > sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count > > > FROM (person LEFT OUTER JOIN instructors ON (person.id = instructors.person)) t > LEFT OUTER JOIN class on ( t.class = class.id ), > person_role > > WHERE person_role.person = person.id > AND person_role.role = 3 > > GROUP BY person.id, last_name; > > > > > > -- > Bill Moseley > moseley@hank.org > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Mon, Nov 21, 2005 at 11:45:34AM -0600, Bruno Wolff III wrote: > On Mon, Nov 21, 2005 at 05:40:10 -0800, > Bill Moseley <moseley@hank.org> wrote: > > > > Here's where I'm missing something. Trying to do an outer join on > > to bring in the class row with its class_time column: > > You don't say exactly why you are having a problem with this, but I think you > would be better off doing an inner join between instructors and class and > then do an outer join of that result to person. Sorry, I thought I was so far off it might be obvious. I suspect I'm making the query harder than it really is. This query just eats CPU and doesn't seem to finish, but I didn't let it run more than a minute (which is forever as far as I'm concerned). The tables are not that big (10,000 people, 1500 classes) > > SELECT person.id AS id, last_name, > > count(instructors.class) as total, > > sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, > > sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count > > > > > > FROM (person LEFT OUTER JOIN instructors ON (person.id = instructors.person)) t > > LEFT OUTER JOIN class on ( t.class = class.id ), > > person_role > > > > WHERE person_role.person = person.id > > AND person_role.role = 3 > > > > GROUP BY person.id, last_name; Well, I'm stabbing in the dark now. You mean like: SELECT person.id AS id, first_name, last_name, count(instructors.class) as total_classes, sum (CASE WHEN class.id IS NULL THEN 0 ELSE 1 END) as total_class_count, -- which is better? sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count FROM (class INNER JOIN instructors ON ( class.id = instructors.class )) t LEFT OUTER JOIN person ON ( person.id = t.person ), person_role WHERE person_role.person = person.id AND person_role.role = 3 GROUP BY person.id, first_name, last_name; Still eats CPU. GroupAggregate (cost=1750458.67..1890662.91 rows=10212 width=39) -> Sort (cost=1750458.67..1767958.67 rows=7000000 width=39) Sort Key: person.id, person.first_name, person.last_name -> Nested Loop (cost=111.27..140276.35 rows=7000000 width=39) -> Nested Loop (cost=91.27..256.35 rows=7000 width=35) -> Hash Join (cost=71.27..96.35 rows=7 width=31) Hash Cond: ("outer".id = "inner"."class") -> Seq Scan on "class" (cost=0.00..20.00 rows=1000 width=12) -> Hash (cost=71.25..71.25 rows=7 width=27) -> Nested Loop (cost=3.20..71.25 rows=7 width=27) -> Hash Join (cost=3.20..30.77 rows=7 width=12) Hash Cond: ("outer".person = "inner".person) -> Seq Scan on instructors (cost=0.00..20.00 rows=1000 width=8) -> Hash (cost=3.01..3.01 rows=75 width=4) -> Index Scan using person_role_role_index on person_role (cost=0.00..3.01rows=75 width=4) Index Cond: (role = 3) -> Index Scan using person_pkey on person (cost=0.00..5.77 rows=1 width=23) Index Cond: ("outer".person = person.id) -> Materialize (cost=20.00..30.00 rows=1000 width=4) -> Seq Scan on instructors (cost=0.00..20.00 rows=1000 width=4) -> Materialize (cost=20.00..30.00 rows=1000 width=4) -> Seq Scan on "class" (cost=0.00..20.00 rows=1000 width=4) (22 rows) > > > > > > > > > > > > -- > > Bill Moseley > > moseley@hank.org > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > -- Bill Moseley moseley@hank.org
It looks to me like your problem is that weird area where you alias your inner join as "t" and thenn inner join based on this alias. You're getting a cartesian product somewhere, as evidenced by the "rows=7000000" in your explain. I already deleted the old mail with your table structure, but try changing that FROM section to: class INNER JOIN instructors ON class.id = instructors.class LEFT OUTER JOIN person ON person.id = (whatevertable).person Bill Moseley wrote: > > Well, I'm stabbing in the dark now. You mean like: > > SELECT person.id AS id, first_name, last_name, > count(instructors.class) as total_classes, > sum (CASE WHEN class.id IS NULL THEN 0 ELSE 1 END) as > total_class_count, -- which is better? > sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as > future_class_count, > sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) > as past_class_count > > > FROM (class INNER JOIN instructors ON ( class.id = > instructors.class )) t > LEFT OUTER JOIN person ON ( person.id = t.person ), > person_role > WHERE person_role.person = person.id > AND person_role.role = 3 > > GROUP BY person.id, first_name, last_name; > > Still eats CPU. > > GroupAggregate (cost=1750458.67..1890662.91 rows=10212 width=39) > -> Sort (cost=1750458.67..1767958.67 rows=7000000 width=39) > Sort Key: person.id, person.first_name, person.last_name > -> Nested Loop (cost=111.27..140276.35 rows=7000000 width=39) > -> Nested Loop (cost=91.27..256.35 rows=7000 width=35) > -> Hash Join (cost=71.27..96.35 rows=7 width=31) > Hash Cond: ("outer".id = "inner"."class") > -> Seq Scan on "class" (cost=0.00..20.00 > rows=1000 width=12) > -> Hash (cost=71.25..71.25 rows=7 width=27) > -> Nested Loop (cost=3.20..71.25 > rows=7 width=27) > -> Hash Join (cost=3.20..30.77 > rows=7 width=12) > Hash Cond: ("outer".person > = "inner".person) > -> Seq Scan on > instructors (cost=0.00..20.00 rows=1000 width=8) > -> Hash (cost=3.01..3.01 > rows=75 width=4) > -> Index Scan using > person_role_role_index on person_role (cost=0.00..3.01 rows=75 width=4) > Index Cond: > (role = 3) > -> Index Scan using person_pkey > on person (cost=0.00..5.77 rows=1 width=23) > Index Cond: > ("outer".person = person.id) > -> Materialize (cost=20.00..30.00 rows=1000 > width=4) > -> Seq Scan on instructors > (cost=0.00..20.00 rows=1000 width=4) > -> Materialize (cost=20.00..30.00 rows=1000 width=4) > -> Seq Scan on "class" (cost=0.00..20.00 > rows=1000 width=4) > (22 rows) > > > > >
I just noticed, also goofy is your ", person_role" in your from with no criteria. I would generally put the "person_role.person = person.id" as an INNER JOIN, and then only have the "person_role.role=3" in the where. It doesn't look like that's the specific problem, but I generally find that kind of mixed syntax muddles a query. John McCawley wrote: >> Well, I'm stabbing in the dark now. You mean like: >> >> SELECT person.id AS id, first_name, last_name, >> count(instructors.class) as total_classes, >> sum (CASE WHEN class.id IS NULL THEN 0 ELSE 1 END) as >> total_class_count, -- which is better? >> sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) >> as future_class_count, >> sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) >> as past_class_count >> >> >> FROM (class INNER JOIN instructors ON ( class.id = >> instructors.class )) t >> LEFT OUTER JOIN person ON ( person.id = t.person ), >> person_role >> WHERE person_role.person = person.id >> AND person_role.role = 3 >> >> GROUP BY person.id, first_name, last_name; >> >> Still eats CPU. >> >> GroupAggregate (cost=1750458.67..1890662.91 rows=10212 width=39) >> -> Sort (cost=1750458.67..1767958.67 rows=7000000 width=39) >> Sort Key: person.id, person.first_name, person.last_name >> -> Nested Loop (cost=111.27..140276.35 rows=7000000 width=39) >> -> Nested Loop (cost=91.27..256.35 rows=7000 width=35) >> -> Hash Join (cost=71.27..96.35 rows=7 width=31) >> Hash Cond: ("outer".id = "inner"."class") >> -> Seq Scan on "class" (cost=0.00..20.00 >> rows=1000 width=12) >> -> Hash (cost=71.25..71.25 rows=7 width=27) >> -> Nested Loop (cost=3.20..71.25 >> rows=7 width=27) >> -> Hash Join >> (cost=3.20..30.77 rows=7 width=12) >> Hash Cond: >> ("outer".person = "inner".person) >> -> Seq Scan on >> instructors (cost=0.00..20.00 rows=1000 width=8) >> -> Hash >> (cost=3.01..3.01 rows=75 width=4) >> -> Index Scan >> using person_role_role_index on person_role (cost=0.00..3.01 rows=75 >> width=4) >> Index Cond: >> (role = 3) >> -> Index Scan using >> person_pkey on person (cost=0.00..5.77 rows=1 width=23) >> Index Cond: >> ("outer".person = person.id) >> -> Materialize (cost=20.00..30.00 rows=1000 >> width=4) >> -> Seq Scan on instructors >> (cost=0.00..20.00 rows=1000 width=4) >> -> Materialize (cost=20.00..30.00 rows=1000 width=4) >> -> Seq Scan on "class" (cost=0.00..20.00 >> rows=1000 width=4) >> (22 rows) > >
On Mon, Nov 21, 2005 at 03:25:56PM -0600, John McCawley wrote: > I just noticed, also goofy is your ", person_role" in your from with no > criteria. I would generally put the "person_role.person = person.id" as > an INNER JOIN, and then only have the "person_role.role=3" in the > where. It doesn't look like that's the specific problem, but I > generally find that kind of mixed syntax muddles a query. I need to read more about the FROM clause, as I can't seem to get what you are suggesting. The now working query (thanks to you!) is: SELECT person.id AS id, first_name, last_name, count(instructors.class) as total_classes, -- sum (CASE WHEN class.id IS NULL THEN 0 ELSE 1 END) as total_class_count, sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count FROM class INNER JOIN instructors ON class.id = instructors.class LEFT OUTER JOIN person ON person.id = instructors.person, person_role WHERE person_role.person = person.id AND person_role.role = 2 GROUP BY person.id, first_name, last_name ORDER BY future_class_count; Not sure how to construct that. Not this, as it returns odd counts SELECT person.id AS id, count(instructors.class) as total_classes, sum (CASE WHEN class.id IS NULL THEN 0 ELSE 1 END) as total_class_count, -- which is better? sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count FROM class INNER JOIN instructors ON class.id = instructors.class LEFT OUTER JOIN person ON person.id = instructors.person INNER JOIN person_role ON person_role.person = person.id WHERE person_role.role = 2 GROUP BY person.id, first_name, last_name ORDER BY total_classes; id | total_classes | total_class_count | future_class_count | past_class_count -----+---------------+-------------------+--------------------+------------------ 90 | 1 | 1 | 0 | 1 98 | 1 | 1 | 0 | 1 92 | 1 | 1 | 0 | 1 123 | 1 | 1 | 0 | 1 122 | 1 | 1 | 0 | 1 121 | 2 | 2 | 0 | 2 66 | 2 | 2 | 0 | 2 74 | 2 | 2 | 0 | 2 56 | 2 | 2 | 0 | 2 85 | 2 | 2 | 0 | 2 119 | 2 | 2 | 0 | 2 41 | 2 | 2 | 0 | 2 33 | 2 | 2 | 0 | 2 65 | 2 | 2 | 0 | 2 105 | 3 | 3 | 0 | 3 83 | 3 | 3 | 0 | 3 102 | 3 | 3 | 0 | 3 32 | 4 | 4 | 0 | 4 71 | 4 | 4 | 0 | 4 70 | 4 | 4 | 0 | 4 14 | 4 | 4 | 0 | 4 29 | 4 | 4 | 0 | 4 77 | 4 | 4 | 0 | 4 86 | 4 | 4 | 0 | 4 50 | 4 | 4 | 0 | 4 107 | 4 | 4 | 0 | 4 8 | 4 | 4 | 0 | 4 114 | 4 | 4 | 0 | 4 42 | 4 | 4 | 0 | 4 82 | 4 | 4 | 0 | 4 28 | 4 | 4 | 0 | 4 17 | 4 | 4 | 0 | 4 52 | 4 | 4 | 0 | 4 9 | 4 | 4 | 0 | 4 31 | 4 | 4 | 0 | 4 -- Bill Moseley moseley@hank.org
On Mon, Nov 21, 2005 at 02:06:35PM -0800, Bill Moseley wrote: > The now working query (thanks to you!) is: No that doesn't work. It's dropping the people that have never been assigned a class to teach (i.e. don't have a row in the "instructors" link table). > FROM class INNER JOIN instructors ON class.id = instructors.class > LEFT OUTER JOIN person ON person.id = instructors.person, > person_role I really seem to need the multiple left outer join. This works: SELECT person.id AS id, last_name, person_role.role AS role, count(instructors.class), sum (CASE WHEN class_time > now() THEN 1 ELSE 0 END) as future_class_count, sum (CASE WHEN class_time <= now() THEN 1 ELSE 0 END) as past_class_count FROM person LEFT OUTER JOIN instructors ON (person.id = instructors.person) LEFT OUTER JOIN class ON (instructors.class = class.id), person_role WHERE person_role.person = person.id -- AND person_role.role = 2 GROUP BY person.id, last_name, person_role.role; I'm not clear how to move that "person_role.person = person.id" into the FROM statement. Does it matter? -- Bill Moseley moseley@hank.org
> I'm not clear how to move that "person_role.person = person.id" into the FROM statement. Does it matter? This should work: FROM person INNER JOIN person_role ON person.id = person_role.person LEFT OUTER JOIN instructors ON (person.id = instructors.person) LEFT OUTER JOIN class ON (instructors.class = class.id), person_role GROUP BY person.id, last_name, person_role.role; The reason *I* think it matters is that I like to keep my "constant" join clauses in the from, and my variable criteria in the where. i.e. in your query, you always want the role that joins to the person...there are no cases (in this query) where you want other roles, therefore I wouldn't put it in the where. I leave my where clause free for criteria such as "where person.id = $variable" etc. I will occasionally put variables higher in my joins, but only if EXPLAIN ANALYZE tells me it's faster (which rarely happens for me), but I NEVER wait until my where to clarify the fundamental criteria whereby a table in the FROM is joined. It's just my preference for the sake of clarity. In other words I would: SELECT person.last_name, person_role.role FROM person_role INNER JOIN person ON person_role.person = person.id WHERE person.id = $variable rather than: SELECT person.last_name, person_role.role FROM person_role, person WHERE person_role.person = person.id AND person.id = $variable > >