Thread: SELECT too complex?
This is a rather ill-defined enquiry. The main reason behind it is to find out if I'm going down the right path as a Postgres newbie. I have a set of complex selects that I'm worried about from the point of view of 1) can't these be made simpler 2) performance 3) describing the selects to other developers in my documentation. One of the complex selects is shown below. The database schema consists of objects, which are shared on boards by being recorded in board_objects. Individuals can view things on a board they themselves are joined to (people are like objects). People are joined to specific board_objects as board_object_members, or may comment on specific board_objects. It is relatively straightforward doing a query on a board_object and discerning what object it refers to and what comments it has aggregated against it. The complex query below provides a list of objects which a person has commented on or joined, garnering summaries of the comments on those objects across boards the user is joined to, and sorting the objects on the basis of the last updated board_object instance of that object which, again, the user has access to. I realise life would be easier if there was more duplicated data. General comments on the nature of the select, how to optimise it or rewrite it, and how to document what it does would be most appreciated. Rory A rough sketch of the main tables: +----------+ +----------+ | people | +-----------+ | boards | +----+-----+ | objects | +----+-----+ | +-----+-----+ | | | | +----+---------------+--------------+-----+ | board objects | + ---------+--------------------+---------+ | | +-------+-------+ +------+-------+ | obj members | | comments | +---------------+ +--------------+ Example for a personal view as described above: (the query runs within a function): FOR resulter IN SELECT o.t_text_id as itemid, pers.t_nickname as itemauthor, o.n_type as itemtype, o.t_name as itemtitle, o.t_description as itembody, to_char(o.dt_created, 'DD Mon'') as itemcreated, to_char(botimer, ''DD on'') as itemupdated, COALESCE (c_count, 0) as itemcommentcount, p.n_id as imageid, COALESCE(p.t_path, ''none'') as imagesrc, p.n_width as imagewidth, p.n_height as imageheight FROM people pers, objects o LEFT OUTER JOIN photo p ON o.n_id_photo = p.n_id LEFT OUTER JOIN ( SELECT bo.n_object_id as boider, max(bo.dt_modified) as botimer FROM board_objects bo, board_object_members bom WHERE bom.n_person_id = personid AND bom.n_board_object_id = bo.n_id GROUP BY bo.n_object_id ) as timer ON o.n_id = timer.boider LEFT OUTER JOIN ( SELECT bo.n_object_id as c_bo_id, count(comm.n_id) as c_count FROM board_objects bo, comments comm WHERE comm.n_board_object_id = bo.n_id AND bo.n_board_id IN (SELECT n_board_id FROM board_objects WHERE n_object_id = persobjid ) GROUP BY bo.n_object_id, bo.n_object_id ) as counter ON o.n_id = counter.c_bo_id WHERE o.n_creator = pers.n_id AND o.n_id IN ( SELECT DISTINCT ON (bo.dt_modified, bo.n_object_id) bo.n_object_id as bo_obid FROM board_objects bo, board_object_members bom WHERE bo.n_id = bom.n_board_object_id AND bom.n_person_id = personid ) ORDER BY botimer DESC LIMIT locallimit OFFSET localoffset LOOP RETURN NEXT resulter; END LOOP; -- Rory Campbell-Lange <rory@campbell-lange.net> <www.campbell-lange.net>
Rory, > I have a set of complex selects that I'm worried about from the point of > view of 1) can't these be made simpler 2) performance 3) describing the > selects to other developers in my documentation. One of the complex > selects is shown below. Keep in mind that (2) is not in sync with (1) or (3); I have frequently run into situations where huge, messy-looking versions of queries run much faster than simpler-looking queries that produce the same results. i.e. "simpler" to a human is not necessarity "simpler" to the planner. > General comments on the nature of the select, how to optimise it or > rewrite it, and how to document what it does would be most appreciated. Sure. My general experience with complex queries has been that: 1) The more complex a query, the more possible ways there are to write it. 2) With a query which can be written many ways, the best way to "performance-tune" it is to try different query structures and keep re-executing it against the back end until you get something you're happy with (EXPLAIN ANALYZE can help) As for simplifying, I'm afraid that I don't really understand what you're trying to get with this query. If I'm clueless, send me a version of the query with each clause commented. > FOR resulter IN > SELECT > o.t_text_id as itemid, > pers.t_nickname as itemauthor, > o.n_type as itemtype, > o.t_name as itemtitle, > o.t_description as itembody, > to_char(o.dt_created, 'DD Mon'') as itemcreated, > to_char(botimer, ''DD on'') as itemupdated, > COALESCE (c_count, 0) as itemcommentcount, > p.n_id as imageid, > COALESCE(p.t_path, ''none'') as imagesrc, > p.n_width as imagewidth, > p.n_height as imageheight > FROM > people pers, > objects o > LEFT OUTER JOIN photo p ON o.n_id_photo = p.n_id > LEFT OUTER JOIN ( > SELECT > bo.n_object_id as boider, > max(bo.dt_modified) as botimer > FROM > board_objects bo, board_object_members bom > WHERE > bom.n_person_id = personid > AND > bom.n_board_object_id = bo.n_id > GROUP BY > bo.n_object_id > ) as timer You have a lot of left outer joins on aggregate subselects in this query. try putting the subselect in the SELECT clause, and using ORDER BY ____ DESC LIMIT 1 instead of MAX Also, I fail to see the reason you need to left outer join on Board_Objects, when as subselect to board_objects is in your WHERE criteria. Presumably, if there are no records in board_objects, you'll lose the row from person anyway. What about doing a straight join to board_objects and simplifing your outer joins thereby? > ON o.n_id = timer.boider > LEFT OUTER JOIN ( > SELECT > bo.n_object_id as c_bo_id, > count(comm.n_id) as c_count > FROM > board_objects bo, comments comm > WHERE > comm.n_board_object_id = bo.n_id > AND > bo.n_board_id IN You *definitely* don't want to use IN in this subquery, unless you're on 7.4. Use EXISTS instead. I'm also unclear on why you need this IN subquery at all. You're checking board_objects against itself; why not just put it in the main subselect? > (SELECT > n_board_id > FROM > board_objects > WHERE > n_object_id = persobjid > ) > GROUP BY > bo.n_object_id, bo.n_object_id > ) as counter > ON o.n_id = counter.c_bo_id > WHERE > o.n_creator = pers.n_id > AND > o.n_id IN ( Once again, through 7.3, use EXISTS, not IN. > SELECT DISTINCT ON (bo.dt_modified, bo.n_object_id) > bo.n_object_id as bo_obid Why do you need a SELECT DISTINCT if you're just checking for existence? > FROM > board_objects bo, board_object_members bom > WHERE > bo.n_id = bom.n_board_object_id > AND > bom.n_person_id = personid > ) > ORDER BY > botimer DESC > LIMIT > locallimit > OFFSET > localoffset LOOP Overall, yeah, I think you're making the query too complex. You have 3 subselects consulting the same table; surely at least 2 of these could be consolidated. -- Josh Berkus Aglio Database Solutions San Francisco
Hi Josh. Thanks very much for your input. On 27/06/03, Josh Berkus (josh@agliodbs.com) wrote: > As for simplifying, I'm afraid that I don't really understand what > you're trying to get with this query. If I'm clueless, send me a > version of the query with each clause commented. The aim of the select is to find all objects which a person has joined. Objects are shared on boards, recorded in board_objects. People join objects as board_object_members, which point to instances of board_objects. So, an object can show up on one or more boards. One can be a member of the object on one or more boards. One may also not have the ability to "see" a board_object record of an object because one isn't joined to that particular board. So I am also an object and my membership of a board depends on whether or not a board_objects record exists for that board. (See ***) Mmm. Clear as mud. > > FOR resulter IN > > SELECT > > o.t_text_id as itemid, > > pers.t_nickname as itemauthor, > > o.n_type as itemtype, > > o.t_name as itemtitle, > > o.t_description as itembody, > > to_char(o.dt_created, 'DD Mon'') as itemcreated, > > to_char(botimer, ''DD on'') as itemupdated, > > COALESCE (c_count, 0) as itemcommentcount, > > p.n_id as imageid, > > COALESCE(p.t_path, ''none'') as imagesrc, > > p.n_width as imagewidth, > > p.n_height as imageheight > > FROM > > people pers, > > objects o > > LEFT OUTER JOIN photo p ON o.n_id_photo = p.n_id -- join photo on objects. There may or may -- not be a photo. > > LEFT OUTER JOIN ( > > SELECT > > bo.n_object_id as boider, > > max(bo.dt_modified) as botimer > > FROM > > board_objects bo, board_object_members bom > > WHERE > > bom.n_person_id = personid *** > > AND > > bom.n_board_object_id = bo.n_id > > GROUP BY > > bo.n_object_id > > ) as timer -- here I grab the most recently updated instance of a -- board_ object to which I am attached. It is important to -- grab the latest as I may be attached to the same object -- on several boards, each which may have different -- update times. -- personid is my object id. > > You have a lot of left outer joins on aggregate subselects in this query. > try putting the subselect in the SELECT clause, and using ORDER BY ____ DESC > LIMIT 1 instead of MAX. OK. Sounds sensible. > Also, I fail to see the reason you need to left outer join on Board_Objects, > when as subselect to board_objects is in your WHERE criteria. Presumably, if > there are no records in board_objects, you'll lose the row from person > anyway. > What about doing a straight join to board_objects and simplifing your outer > joins thereby? Good point! Excellent analysis; thanks. Presumably reducing outer joins will in theory reduce the query time? > > > ON o.n_id = timer.boider > > LEFT OUTER JOIN ( > > SELECT > > bo.n_object_id as c_bo_id, > > count(comm.n_id) as c_count > > FROM > > board_objects bo, comments comm > > WHERE > > comm.n_board_object_id = bo.n_id > > AND > > bo.n_board_id IN > > You *definitely* don't want to use IN in this subquery, unless you're on 7.4. > Use EXISTS instead. I'm also unclear on why you need this IN subquery at all. > You're checking board_objects against itself; why not just put it in the main > subselect? Mmm. This bit of the query grabs the count of all comments on all versions of the object that I'm attached to. It disregards comments on board_objects I'm not joined to. To paraphrase: Get count of comments on board_objects where board_objects.object_id is this object id BUT only get objects on boards where I'm a member. I haven't really used EXISTS. Makes sense, thanks. > > > (SELECT > > n_board_id > > FROM > > board_objects > > WHERE > > n_object_id = persobjid > > ) > > GROUP BY > > bo.n_object_id, bo.n_object_id > > ) as counter > > ON o.n_id = counter.c_bo_id > > WHERE > > o.n_creator = pers.n_id > > AND > > o.n_id IN ( > > Once again, through 7.3, use EXISTS, not IN. Yup. > > > SELECT DISTINCT ON (bo.dt_modified, bo.n_object_id) > > bo.n_object_id as bo_obid > > Why do you need a SELECT DISTINCT if you're just checking for existence? I found that a normal query was finding more copies of an object because an object can be reflected several times in board_objects. > > > FROM > > board_objects bo, board_object_members bom > > WHERE > > bo.n_id = bom.n_board_object_id > > AND > > bom.n_person_id = personid > > ) > > ORDER BY > > botimer DESC > > LIMIT > > locallimit > > OFFSET > > localoffset LOOP > > Overall, yeah, I think you're making the query too complex. You have 3 > subselects consulting the same table; surely at least 2 of these could be > consolidated. Going through it I'm not sure the aims of the query subselects are all that wrong (i.e. most recent object I'm attached to; sum of comments on this object that I'm attached to but only in the context of the boards I belong to; ensuring that I'm WHEREing only objects that I can access). Although I could be horribly wrong :-). However your analysis has made me think much more deeply about how the joins work and whether or not I need left outer joins. More than that, the difficulty I've had explaining the database structure to you means that I should carefully reconsider it. Many thanks for the time you've taken to answer my question. Rory -- Rory Campbell-Lange <rory@campbell-lange.net> <www.campbell-lange.net>
Rory, > However your analysis has made me think much more deeply about how the > joins work and whether or not I need left outer joins. More than that, > the difficulty I've had explaining the database structure to you means > that I should carefully reconsider it. Sure. Although sometimes a very good, well-normalized structure is impossible to explain without a clearer diagram than you can do in ASCII -- and without a clear understanding of what the database is for. -- Josh Berkus Aglio Database Solutions San Francisco