Thread: Inconsistant SQL results - Suspected error with query planing or query optimisation.


Hay all :)

Firstly, just wanted to say how much I appreceate all the hard work that has gorn into postgres. It really is a fine system.

Anyway, mybug: I have a test SELECT statement (Listing A - see "sql listings.txt") wich produces different results under two simular setups (Listing B and Listing C). Each setup should product the same result for the given SELECT statement.

In the first setup (Listing B) the table "items" has 10,000 records id'ed from 1 to 10,000.  The field "number" for every record except two records (id:500 and 600) has the value NULL., the two exceptions (id: 500 and 600) have the value 1.
This first setup produces the correct results for the given SELECT statement.

The SELECT statement is essentlually a dubble negitive - using two LEFT JOINS.
 - The most inner nested query selects WHERE number = 1
 -  Then next most inner selects everything but, and
 -  The most outer selects every thing but everything but WHERE number = 1

The second setup (Listing C) is identicle to the first execpt that the table "items" has an extra field and a primary key index. The goal of this setup is to produce a cirtian query plan that I beleive is broken, where it seems that the "Nested Loop Left Join" has forced the filter for "WHERE number = 1" outside or (perhaps after) a join one of the more nested joins causeing that  more nested join to cancel it self out.

Well.... that's what I make of it.   I really hope you are able to reproduce this, it took me ages to find a setup that would be reproduce what I was observing in my developement system.

See the attached output.txt for the results and the EXPLAIN statement for the SELECT statement under each setup.

Postgres:
        "PostgreSQL 8.2.3 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)"

Hardware:
        AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
        2GB Mem

OS:
        Kernel:  Linux version 2.6.18-4-amd64 (Debian 2.6.18.dfsg.1-12)
        OS: GNU/Linux Debian "Etch" - AMD64




Kind regards,
Adam Terrey

Data Systems - Marketing Strategy and Planning
University Relations Directorate
Australian Catholic University Limited (ABN 15 050 192 660)

42 Edward Street, North Sydney NSW 2060
PO Box 968 North Sydney NSW 2059
Telephone 02 9739 2517
Facsimile 02 9739 2960
Mobile    0421 430 824
Email: adam.terrey@acu.edu.au
http://www.acu.edu.au

Australian Catholic University and the courses offered by the University are
registered on the Commonwealth Register of Institutions and Courses for
Overseas Students (CRICOS). Provider registration codes: 00004G, 0012C,
Attachment
adam terrey wrote:

> The second setup (Listing C) is identicle to the first execpt that the
> table "items" has an extra field and a primary key index. The goal of this
> setup is to produce a cirtian query plan that I beleive is broken, where
> it seems that the "Nested Loop Left Join" has forced the filter for "WHERE
> number = 1" outside or (perhaps after) a join one of the more nested joins
> causeing that  more nested join to cancel it self out.

It's easy to confirm that the nested loop is the culprit here: if you
SET enable_nestloop to off, the query returns different results (the
expected two tuples).

8.1 seems to work OK, but both 8.2 and HEAD don't.

alvherre=# set enable_nestloop to off;
SET
alvherre=#  SELECT items.id
FROM items
LEFT JOIN (
        -- Query i.
        SELECT  items.id
        FROM items
        LEFT JOIN (
                -- Query ii.
                SELECT id FROM items WHERE number = 1
        ) AS moded_items USING (id)
        WHERE moded_items.id IS NULL
) AS sub_items USING (id)
WHERE sub_items.id IS NULL;
 id
-----
 500
 600
(2 rows)

alvherre=# set enable_nestloop to on;
SET
alvherre=#  SELECT items.id
FROM items
LEFT JOIN (
        -- Query i.
        SELECT  items.id
        FROM items
        LEFT JOIN (
                -- Query ii.
                SELECT id FROM items WHERE number = 1
        ) AS moded_items USING (id)
        WHERE moded_items.id IS NULL
) AS sub_items USING (id)
WHERE sub_items.id IS NULL;
 id
----
(0 rows)


--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
adam terrey <a.terrey@mackillop.acu.edu.au> writes at
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00187.php

> Anyway, mybug: I have a test SELECT statement (Listing A - see "sql 
> listings.txt") wich produces different results under two simular setups 
> (Listing B and Listing C). Each setup should product the same result for 
> the given SELECT statement.

The problem here is that 8.2 is incorrectly concluding that it can
rearrange the order of the two LEFT JOIN steps in the query:

SELECT a.id
FROM items a
LEFT JOIN (SELECT     b.idFROM items bLEFT JOIN (    SELECT c.id FROM items c WHERE number = 1) AS moded_items USING
(id)WHEREmoded_items.id IS NULL
 
) AS sub_items USING (id)
WHERE sub_items.id IS NULL;

The plan it comes up with is:
Nested Loop Left Join  (cost=469.00..1063.39 rows=1 width=4) (actual time=288.962..288.962 rows=0 loops=1)  Filter:
(c.idIS NULL)  ->  Hash Left Join  (cost=469.00..1063.00 rows=1 width=8) (actual time=288.946..288.946 rows=0 loops=1)
     Hash Cond: (a.id = b.id)        Filter: (b.id IS NULL)        ->  Seq Scan on items a  (cost=0.00..344.00
rows=10000width=4) (actual time=0.080..50.973 rows=10000 loops=1)        ->  Hash  (cost=344.00..344.00 rows=10000
width=4)(actual time=140.880..140.880 rows=10000 loops=1)              ->  Seq Scan on items b  (cost=0.00..344.00
rows=10000width=4) (actual time=0.046..69.395 rows=10000 loops=1)  ->  Index Scan using items_pkey on items c
(cost=0.00..0.38rows=1 width=4) (never executed)        Index Cond: (b.id = c.id)        Filter: (c.number = 1)
 

After reducing join_collapse_limit to 1, we get the right join order and
the right answers:
Hash Left Join  (cost=750.54..1132.05 rows=1 width=4) (actual time=409.712..409.740 rows=2 loops=1)  Hash Cond: (a.id =
b.id) Filter: (b.id IS NULL)  ->  Seq Scan on items a  (cost=0.00..344.00 rows=10000 width=4) (actual
time=0.100..51.052rows=10000 loops=1)  ->  Hash  (cost=750.52..750.52 rows=1 width=4) (actual time=264.978..264.978
rows=9998loops=1)        ->  Hash Left Join  (cost=369.01..750.52 rows=1 width=4) (actual time=30.074..192.023
rows=9998loops=1)              Hash Cond: (b.id = c.id)              Filter: (c.id IS NULL)              ->  Seq Scan
onitems b  (cost=0.00..344.00 rows=10000 width=4) (actual time=0.030..50.913 rows=10000 loops=1)              ->  Hash
(cost=369.00..369.00rows=1 width=4) (actual time=29.976..29.976 rows=2 loops=1)                    ->  Seq Scan on
itemsc  (cost=0.00..369.00 rows=1 width=4) (actual time=29.896..29.916 rows=2 loops=1)                          Filter:
(number= 1)
 

So there is something wrong with the rule used for deciding whether two
LEFT JOINs can commute.  Per the planner README:

: The planner's treatment of outer join reordering is based on the following
: identities:
: 
: 1.    (A leftjoin B on (Pab)) innerjoin C on (Pac)
:     = (A innerjoin C on (Pac)) leftjoin B on (Pab)
: 
: where Pac is a predicate referencing A and C, etc (in this case, clearly
: Pac cannot reference B, or the transformation is nonsensical).
: 
: 2.    (A leftjoin B on (Pab)) leftjoin C on (Pac)
:     = (A leftjoin C on (Pac)) leftjoin B on (Pab)
: 
: 3.    (A leftjoin B on (Pab)) leftjoin C on (Pbc)
:     = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
: 
: Identity 3 only holds if predicate Pbc must fail for all-null B rows
: (that is, Pbc is strict for at least one column of B).  If Pbc is not
: strict, the first form might produce some rows with nonnull C columns
: where the second form would make those entries null.

What we have here is an invocation of rule 3 in a situation where it's
not appropriate.  The difficulty is that the code is only paying
attention to the syntactical JOIN/ON clauses and has neglected the
intermediate-level WHERE clause.

After a bit of reflection it seems that a WHERE that is semantically
just below a left-join's right side can be treated as if it were part of
that left-join's ON clause.  It will have the same effect as if it had
been written there: any rows rejected by the WHERE will fail to be
joined to the left side and will contribute nothing to the result.
Had we been following this rule, we'd have concluded that c.id IS NULL
is part of the upper join qual, and therefore that it has a predicate
Pabc not just Pab and cannot be commuted with the lower join.

Teaching initsplan.c to do things this way seems possible but less than
trivial.  Before I start worrying about that, does anyone see any flaws
in the reasoning at this level of detail?
        regards, tom lane