I have encountered what seems to be a bad decision made by the query
planner, on a quite simple query. I have attached a shell script which
creates a large (180MB) sql script, to demonstrate the problem. To
reproduce, execute the bash script, and pipe the stdout into psql.
Redirect the stdout of psql somewhere, as it isn't very interesting. psql
produces several EXPLAIN ANALYSE results on its stderr. I have also
attached a copy of the results I get.
The test creates three tables:
jointest3_1 (keya int, infoa int)
jointest3_2 (keya int, keyb int)
jointest3_3 (keyb int, infob int)
It then fills tables jointest3_1 and jointest3_3 each with 1 million rows,
where the key value is sequential (0 to 999999), and the info value is a
random zero or positive integer below 10000.
It then fills table jointest3_2 with 2 million rows, each with random
6-digit integers in both of its columns.
Therefore, table jointest3_2 represents a many-to-many mapping between
jointest3_1 and jointest3_3, and the natural join of all three tables
should have 2 million rows.
The test then creates lots of indexes, clusters on the indexes, and does a
vacuum analyse.
The test does six EXPLAIN ANALYSE commands. There are actually three
distinct commands, each of which is performed twice, to get a
before-cached and after-cached performance indication. The first command
does a natural join of all three tables, and restricts the results on
jointest3_1.infoa. This is performed quickly, as there is an index on
infoa which returns approximately 100 rows.
The second distinct command also does a natural join of all three tables,
but restricts the results by jointest3_3.infob. The query planner appears
to make a bad decision, and the query takes a long time to return.
The third distinct command does the same query as the second distinct
command, but it forces the query planner to join tables jointest3_2 and
jointest3_3 together before joining with jointest3_1 by doing a subselect.
This performs as quickly as the first distinct command.
I would have expected the query planner to find the fast query plan, given
there are only three tables being joined together.
I am using PostgreSQL version 7.2.1, running on Debian unstable/testing,
reasonably up to date. If you need more details, please email back.
select version(); gives the following:
PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4
PostgreSQL was installed as a Debian package.
The machine is a Athlon XP1900, with 512MB of RAM, a 40GB hard drive.
Kernel: Linux 2.4.19-686
libc6: 2.2.5-14.1
Matthew Wakeling