Re: Optimization rules for semi and anti joins - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Optimization rules for semi and anti joins
Date
Msg-id 4993CA7E.EE98.0025.0@wicourts.gov
Whole thread Raw
In response to Re: Optimization rules for semi and anti joins  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Optimization rules for semi and anti joins
List pgsql-hackers
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> (A semijoin B on (Pab)) antijoin C on (Pbc)
>> = A semijoin (B antijoin C on (Pbc)) on (Pab)
>  
>> I think this one is true, and it doesn't seem to be mentioned,
>> unless I'm missing something.  It seems potentially useful.
> 
> Hmm, it doesn't seem terribly well-defined --- the values of B are
> indeterminate above the semijoin in the first case, so having Pbc
> refer to them doesn't seem like a good idea.  In particular, it
> seems like in the first case the semijoin could randomly choose a B
> row that has a join partner in C, causing the A row to disappear
> from the result, when the same A row has another B partner that does
> not join to C --- and the second form would find that B partner and
> allow the A row to be output.
I was looking at it from the abstraction that A semijoin B could be
treated as the equivalent of an inner join with duplicate A rows from
the result removed before the final result of the enclosing query.  It
seems you've been interpreting it as meaning the inner join of A to
the first (arbitrarily chosen) row of B found.  It appears that these
two views of it generate the same results for the other identities,
but not this one.
The first case here could be implemented as an inner join which
included (in addition to any columns needed for other purposes) a row
identifier for A and all the columns of B which are needed for the Pbc
predicate.  The antijoin could be performed on that result, after
which duplicate A rows would be eliminated, as well as the row
identifier and the B columns.  A simplified (and only slightly
artificial) example of where this could buy orders of magnitude
improvement in run time follows.
Imagine that A is a Party table with ten million rows.  Imagine that B
and C are sets of rows within a hundred million row CaseHist table
which records events on cases, some of which are associated with
parties.  B represents warrant issuing events, each of which is
related to a party.  C represents warrant disposition events, each of
which is related to a warrant issuing event.  Both tables are indexed
on case number and a sequence number.  Party has a name index.
You've got a name, and you want a list of outstanding warrants for
parties with a matching name.
The second case above would be the natural way to write the query. 
The first case, implemented as I describe, would be orders of
magnitude faster.
-Kevin


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: mingw check hung
Next
From: "BogDan Vatra"
Date:
Subject: Re: SE-PostgreSQL and row level security