Thread: improving a badly optimized query

improving a badly optimized query

From
Brandon Craig Rhodes
Date:
A query has surprised me by the amount of time it takes, and it seems
that PostgreSQL is performing insufficient optimization.  To make the
example simple, consider the following database:

        CREATE TABLE role_keys (
                role SERIAL PRIMARY KEY
        );
        CREATE TABLE role_person (
                role INTEGER UNIQUE NOT NULL REFERENCES role_keys,
                person INTEGER NOT NULL
        );
        CREATE INDEX role_person_index ON role_person (person);
        CREATE VIEW roles AS SELECT
                role_keys.role, person
                FROM role_keys NATURAL LEFT JOIN role_person;

Having populated these tables, I attempted the following query:

        SELECT * FROM roles WHERE person = 28389;

It turns out that this query - equivalent to query (a) shown below -
takes more than ten times the amount of time required by query (b),
despite being guaranteed to give exactly the same result!

(a) (slow)
        SELECT * FROM role_keys NATURAL LEFT JOIN role_person
                WHERE person = 28389;

(b) (fast)
        SELECT * FROM role_keys NATURAL JOIN role_person
                WHERE person = 28389;

Apparently PostgreSQL does not realize that the rows created for
unmatched role_keys rows by the LEFT JOIN are guaranteed to be thrown
out by the WHERE clause (their `person' fields will be null).  Because
of this it reads through the entire role_keys table:

(a) (when run with EXPLAIN)
        Merge Join  (cost=0.00..3990.83 rows=67524 width=12)
        ->  Index Scan using role_keys_pkey on role_keys
                (cost=0.00..1280.67 rows=67524 width=4)
        ->  Index Scan using role_person_role_key on role_person
                (cost=0.00..1359.68 rows=67525 width=8)

(b) (when run with EXPLAIN)
        Nested Loop  (cost=0.00..6.91 rows=1 width=12)
        ->  Index Scan using role_person_index on role_person
                (cost=0.00..3.02 rows=1 width=8)
        ->  Index Scan using role_keys_pkey on role_keys
                (cost=0.00..3.01 rows=1 width=4)

It is not obvious to me where in PostgreSQL's optimization routine to
insert the intelligence to reduce this from a `LEFT JOIN' to a `JOIN'.
Has anyone else had to deal with this case?

The VIEW itself must be a LEFT JOIN because I need all roles to appear
when I query the view; but I will frequently need to do queries like
the above, and would like to avoid either (a) having to create a
separate view for each combination of fields on which I might search,
or (b) querying using the raw database tables since I would like the
actualy design hidden from my business logic.

Thanks for any ideas,
--
Brandon Craig Rhodes                         http://www.rhodesmill.org/brandon
Georgia Tech                                            brandon@oit.gatech.edu

Re: improving a badly optimized query

From
Tom Lane
Date:
Brandon Craig Rhodes <brandon@oit.gatech.edu> writes:
> (a) (slow)
>         SELECT * FROM role_keys NATURAL LEFT JOIN role_person
>                 WHERE person = 28389;

> (b) (fast)
>         SELECT * FROM role_keys NATURAL JOIN role_person
>                 WHERE person = 28389;

> Apparently PostgreSQL does not realize that the rows created for
> unmatched role_keys rows by the LEFT JOIN are guaranteed to be thrown
> out by the WHERE clause (their `person' fields will be null).
> [ and hence the LEFT JOIN could be reduced to a JOIN ]

Hmm ... you are right, there is no such logic in there.  It seems like
a useful optimization, but I have an uncomfortable feeling that there's
something wrong with it.  Can you point to a rigorous proof that this is
okay in complicated contexts such as nested outer joins?

            regards, tom lane

Re: improving a badly optimized query

From
Brandon Craig Rhodes
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Brandon Craig Rhodes <brandon@oit.gatech.edu> writes:
> > (a) (slow)
> >         SELECT * FROM role_keys NATURAL LEFT JOIN role_person
> >                 WHERE person = 28389;
>
> > (b) (fast)
> >         SELECT * FROM role_keys NATURAL JOIN role_person
> >                 WHERE person = 28389;
>
> > ... the rows created for unmatched role_keys rows by the LEFT JOIN
> > are guaranteed to be thrown out by the WHERE clause ...  [so] the
> > LEFT JOIN could be reduced to a JOIN
>
> It seems like a useful optimization, but I have an uncomfortable
> feeling that there's something wrong with it.  Can you point to a
> rigorous proof that this is okay in complicated contexts such as
> nested outer joins?

We can optimize the above query simply by observing that the result of
a LEFT JOIN includes both the rows that would have been produced by a
simple JOIN, and those rows of the left table that did not match any
from the right.  That is:

  SELECT * FROM role_keys NATURAL LEFT JOIN role_person WHERE person = 28389;

is the equivalent of:

  SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389
  UNION
  SELECT * FROM
    (SELECT role, CAST(NULL AS integer) AS person FROM role_keys) AS rp
      WHERE person = 28389
    EXCEPT
    SELECT role, CAST(NULL AS integer) AS person FROM role_person;

Since the parenthesized sub-select generates only NULL person fields,
the WHERE condition will yield an empty result, so the entire query
can be reduced to:

  SELECT * FROM role_keys NATURAL JOIN role_person WHERE person = 28389;

Similar technique will yield transforms for RIGHT and OUTER JOINs that
can reveal similar optimizations.

I should state our original intention:

We wanted to gain simplicity and save space by inserting rows into
role_person only for those roles associated with people, leaving all
other roles unmentioned; but obviously a view that contains all roles
and mentions their people requires a LEFT JOIN, which kills query
performance.  There are three alternatives:

  (1) Mention every role in the role_person table, using NULLs for
  those roles without associated people, and then our view can be
  implemented with a simple JOIN rather than an expensive LEFT JOIN.
  (This wastes space and we find it logically unattractive, as well as
  more difficult to maintain.)

  (2) Build queries on top of the actual tables, instead of the view,
  choosing for each query the least sufficient JOIN.  (This destroys
  the usefulness of views, and makes our queries more complex since we
  must hand-craft each JOIN to fit the query.)

  (3) Include the above optimization into at least our own copy of
  PostgreSQL.  (This will be our first experience dealing directly
  with the query optimizer, and we are not sure where to begin.)

Please advise,
--
Brandon Craig Rhodes                         http://www.rhodesmill.org/brandon
Georgia Tech                                            brandon@oit.gatech.edu

Re: improving a badly optimized query

From
Tom Lane
Date:
Brandon Craig Rhodes <brandon@oit.gatech.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> It seems like a useful optimization, but I have an uncomfortable
>> feeling that there's something wrong with it.  Can you point to a
>> rigorous proof that this is okay in complicated contexts such as
>> nested outer joins?

> We can optimize the above query simply by observing that the result of
> a LEFT JOIN includes both the rows that would have been produced by a
> simple JOIN, and those rows of the left table that did not match any
> from the right. [snip]

You didn't answer my question: when there are *nested* outer joins, how
does this transformation apply?  Can a clause from WHERE or an
upper-level JOIN/ON clause be pushed down past one outer join and into
another?

            regards, tom lane