Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem) - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)
Date
Msg-id 30756.1489951977@sss.pgh.pa.us
Whole thread Raw
In response to [HACKERS] Re: Improve OR conditions on joined columns (common star schemaproblem)  (Jim Nasby <jim.nasby@openscg.com>)
Responses Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [HACKERS] Re: Improve OR conditions on joined columns (commonstar schema problem)  (Jim Nasby <jim.nasby@openscg.com>)
List pgsql-hackers
Jim Nasby <jim.nasby@openscg.com> writes:
> On 3/16/17 11:54 AM, David Steele wrote:
>> On 2/14/17 4:03 PM, Tom Lane wrote:
>>> Well, the key point is whether it's really OK to de-dup on the basis
>>> of only the CTIDs that are not eliminated in any UNION arm.  I was
>>> feeling fairly good about that until I thought of the full-join-to-
>>> left-join-to-no-join conversion issue mentioned in the comment.

>> Jim, have you had time to think about this?  Any insights?

> Having thought about it, I share Tom's concerns. And now I'm worried 
> about what happens if there are multiple separate OR clauses. I guess 
> those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

> Tom, could you expand the description some, especially with some examples?

Here's a counterexample --- admittedly rather artificial, but still a
counterexample --- to applying the optimization when there are FULL joins.
Consider
SELECT count(*)FROM a FULL JOIN b ON (a.id = b.id)WHERE (a.x = 42 OR b.y = 43);

and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa.  (For
the purposes of this example, I think it doesn't matter whether or not we
allow there to be null id values.)  Also suppose that there are some join
rows satisfying both a.x = 42 and b.y = 43, while other join rows satisfy
only one of the OR arms.  The patch would try to implement this query as,
approximately,
SELECT count(*)FROM( SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE a.x = 42  UNION-using-ctids  SELECT FROM a FULL
JOINb ON (a.id = b.id) WHERE b.y = 43 )
 

Now in the first UNION arm, we'd observe that the strict a.x = 42
condition allows reduction of the join strength to LEFT JOIN, and then
we'd deduce from the FK on b that the join to b can be dropped altogether.
In the second arm, we'd similarly conclude that a can be dropped
altogether, leaving
SELECT count(*)FROM( SELECT FROM a WHERE a.x = 42  UNION-using-ctids  SELECT FROM b WHERE b.y = 43 )

But at this point there are no rels that are not eliminated in any UNION
arm, so that no de-duping would occur in the UNION, meaning that we'd
double-count any join rows in which both a.x = 42 and b.y = 43.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely.  But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication.  To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

So full joins definitely break this whole optimization.  I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step.  If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm.  The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion.  In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row.  That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

Any clearer yet?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Michael Banck
Date:
Subject: Re: [HACKERS] Create replication slot in pg_basebackup if requestedand not yet present
Next
From: Andreas Karlsson
Date:
Subject: Re: [HACKERS] Removing binaries (was: createlang/droplang deprecated)