Thread: improving a badly optimized query
A query has surprised me by the amount of time it takes, and it seems that PostgreSQL is performing insufficient optimization. To make the example simple, consider the following database: CREATE TABLE role_keys ( role SERIAL PRIMARY KEY ); CREATE TABLE role_person ( role INTEGER UNIQUE NOT NULL REFERENCES role_keys, person INTEGER NOT NULL ); CREATE INDEX role_person_index ON role_person (person); CREATE VIEW roles AS SELECT role_keys.role, person FROM role_keys NATURAL LEFT JOIN role_person; Having populated these tables, I attempted the following query: SELECT * FROM roles WHERE person = 28389; It turns out that this query - equivalent to query (a) shown below - takes more than ten times the amount of time required by query (b), despite being guaranteed to give exactly the same result! (a) (slow) SELECT * FROM role_keys NATURAL LEFT JOIN role_person WHERE person = 28389; (b) (fast) SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389; Apparently PostgreSQL does not realize that the rows created for unmatched role_keys rows by the LEFT JOIN are guaranteed to be thrown out by the WHERE clause (their `person' fields will be null). Because of this it reads through the entire role_keys table: (a) (when run with EXPLAIN) Merge Join (cost=0.00..3990.83 rows=67524 width=12) -> Index Scan using role_keys_pkey on role_keys (cost=0.00..1280.67 rows=67524 width=4) -> Index Scan using role_person_role_key on role_person (cost=0.00..1359.68 rows=67525 width=8) (b) (when run with EXPLAIN) Nested Loop (cost=0.00..6.91 rows=1 width=12) -> Index Scan using role_person_index on role_person (cost=0.00..3.02 rows=1 width=8) -> Index Scan using role_keys_pkey on role_keys (cost=0.00..3.01 rows=1 width=4) It is not obvious to me where in PostgreSQL's optimization routine to insert the intelligence to reduce this from a `LEFT JOIN' to a `JOIN'. Has anyone else had to deal with this case? The VIEW itself must be a LEFT JOIN because I need all roles to appear when I query the view; but I will frequently need to do queries like the above, and would like to avoid either (a) having to create a separate view for each combination of fields on which I might search, or (b) querying using the raw database tables since I would like the actualy design hidden from my business logic. Thanks for any ideas, -- Brandon Craig Rhodes http://www.rhodesmill.org/brandon Georgia Tech brandon@oit.gatech.edu
Brandon Craig Rhodes <brandon@oit.gatech.edu> writes: > (a) (slow) > SELECT * FROM role_keys NATURAL LEFT JOIN role_person > WHERE person = 28389; > (b) (fast) > SELECT * FROM role_keys NATURAL JOIN role_person > WHERE person = 28389; > Apparently PostgreSQL does not realize that the rows created for > unmatched role_keys rows by the LEFT JOIN are guaranteed to be thrown > out by the WHERE clause (their `person' fields will be null). > [ and hence the LEFT JOIN could be reduced to a JOIN ] Hmm ... you are right, there is no such logic in there. It seems like a useful optimization, but I have an uncomfortable feeling that there's something wrong with it. Can you point to a rigorous proof that this is okay in complicated contexts such as nested outer joins? regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Brandon Craig Rhodes <brandon@oit.gatech.edu> writes: > > (a) (slow) > > SELECT * FROM role_keys NATURAL LEFT JOIN role_person > > WHERE person = 28389; > > > (b) (fast) > > SELECT * FROM role_keys NATURAL JOIN role_person > > WHERE person = 28389; > > > ... the rows created for unmatched role_keys rows by the LEFT JOIN > > are guaranteed to be thrown out by the WHERE clause ... [so] the > > LEFT JOIN could be reduced to a JOIN > > It seems like a useful optimization, but I have an uncomfortable > feeling that there's something wrong with it. Can you point to a > rigorous proof that this is okay in complicated contexts such as > nested outer joins? We can optimize the above query simply by observing that the result of a LEFT JOIN includes both the rows that would have been produced by a simple JOIN, and those rows of the left table that did not match any from the right. That is: SELECT * FROM role_keys NATURAL LEFT JOIN role_person WHERE person = 28389; is the equivalent of: SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389 UNION SELECT * FROM (SELECT role, CAST(NULL AS integer) AS person FROM role_keys) AS rp WHERE person = 28389 EXCEPT SELECT role, CAST(NULL AS integer) AS person FROM role_person; Since the parenthesized sub-select generates only NULL person fields, the WHERE condition will yield an empty result, so the entire query can be reduced to: SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389; Similar technique will yield transforms for RIGHT and OUTER JOINs that can reveal similar optimizations. I should state our original intention: We wanted to gain simplicity and save space by inserting rows into role_person only for those roles associated with people, leaving all other roles unmentioned; but obviously a view that contains all roles and mentions their people requires a LEFT JOIN, which kills query performance. There are three alternatives: (1) Mention every role in the role_person table, using NULLs for those roles without associated people, and then our view can be implemented with a simple JOIN rather than an expensive LEFT JOIN. (This wastes space and we find it logically unattractive, as well as more difficult to maintain.) (2) Build queries on top of the actual tables, instead of the view, choosing for each query the least sufficient JOIN. (This destroys the usefulness of views, and makes our queries more complex since we must hand-craft each JOIN to fit the query.) (3) Include the above optimization into at least our own copy of PostgreSQL. (This will be our first experience dealing directly with the query optimizer, and we are not sure where to begin.) Please advise, -- Brandon Craig Rhodes http://www.rhodesmill.org/brandon Georgia Tech brandon@oit.gatech.edu
Brandon Craig Rhodes <brandon@oit.gatech.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> It seems like a useful optimization, but I have an uncomfortable >> feeling that there's something wrong with it. Can you point to a >> rigorous proof that this is okay in complicated contexts such as >> nested outer joins? > We can optimize the above query simply by observing that the result of > a LEFT JOIN includes both the rows that would have been produced by a > simple JOIN, and those rows of the left table that did not match any > from the right. [snip] You didn't answer my question: when there are *nested* outer joins, how does this transformation apply? Can a clause from WHERE or an upper-level JOIN/ON clause be pushed down past one outer join and into another? regards, tom lane