Re: [INTERFACES] Re: [HACKERS] changes in 6.4 - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: [INTERFACES] Re: [HACKERS] changes in 6.4
Date
Msg-id 199808290344.XAA28089@candle.pha.pa.us
Whole thread Raw
In response to Re: [INTERFACES] Re: [HACKERS] changes in 6.4  (David Hartwig <daybee@bellatlantic.net>)
List pgsql-hackers
>
>
> Bruce Momjian wrote:
>
> > >
> > > Hannu Krosing wrote:
> > >
> > > > > The days where every release fixed server crashes, or added a feature
> > > > > that users were 'screaming for' may be a thing of the past.
> > > >
> > > > Is anyone working on fixing the exploding optimisations for many OR-s,
> > > > at least the canonic case used by access?
> > > >
> > > > My impression is that this has fallen somewhere between
> > > > insightdist and Vadim.
> > >
> > > This is really big for the ODBCers. (And I suspect for JDBCers too.)  Many
> > > desktop libraries and end-user tools depend on this "record set" strategy to
> > > operate effectively.
> > >
> > > I have put together a workable hack that runs just before cnfify().  The
> > > option is activated through the SET command.  Once activated, it identifies
> > > queries with this particular multi-OR pattern generated by these RECORD SET
> > > strategies.  Qualified query trees are rewritten as multiple UNIONs.   (One
> > > for each OR grouping).
> > >
> > > The results are profound.    Queries that used to scan tables because of the
> > > ORs, now make use of any indexes.   Thus, the size of the table has virtually
> > > no effect on performance.  Furthermore, queries that used to crash the
> > > backend, now run in under a second.
> > >
> > > Currently the down sides are:
> > >     1. If there is no usable index, performance is significantly worse.  The
> > > patch does not check to make sure that there is a usable index.  I could use
> > > some pointers on this.
> > >
> > >     2. Small tables are actually a bit slower than without the patch.
> > >
> > >     3.  Not very elegant.    I am looking for a more generalized solution.
> > > I have lots of ideas, but I would need to know the backend much better before
> > > attempting any of them.   My favorite idea is before cnfify(), to factor the
> > > OR terms and pull out the constants into a virtual (temporary) table spaces.
> > > Then rewrite the query as a join.   The optimizer will (should) treat the new
> > > query accordingly.  This assumes that an efficient factoring algorithm exists
> > > and that temporary tables can exist in the heap.
> > >
> > > Illustration:
> > > SELECT ... FROM tab WHERE
> > > (var1 = const1 AND var2 = const2) OR
> > > (var1 = const3 AND var2 = const4) OR
> > > (var1 = const5 AND var2 = const6)
> > >
> > > SELECT ... FROM tab, tmp WHERE
> > > (var1 = var_x AND var2 = var_y)
> > >
> > > tmp
> > > var_x  | var_y
> > > --------------
> > > const1|const2
> > > const3|const4
> > > const5|const6
> >
> > David, where are we on this?  I know we have OR's using indexes.  Do we
> > still need to look this as a fix, or are we OK.   I have not gotten far
> > enough in the optimizer to know how to fix the
>
> Bruce,
>
> If the question is, have I come up with a solution for the cnf'ify problem:  No
>
> If the question is, is it still important:  Very much yes.
>
> It is essential for many RAD tools using remote data objects which make use of key
> sets.  Your recent optimization of the OR list goes a long way, but inevitably
> users are confronted with multi-part keys.
>
> When I look at the problem my head spins.   I do not have the experience (yet?)
> with the backend to be mucking around in the optimizer.  As I see it, cnf'ify is
> doing just what it is supposed to do.  Boundless boolean logic.
>
> I think hope may lay though, in identifying each AND'ed group associated with a key
> and tagging it as a special sub-root node which cnf'ify does not penetrate.   This
> node would be allowed to pass to the later stages of the optimizer where it will be
> used to plan index scans.  Easy for me to say.
>
> In the meantime, I still have the patch that I described in prior email.  It has
> worked well for us.  Let me restate that.   We could not survive without it!
> However, I do not feel that is a sufficiently functional approach that should be
> incorporated as a final solution.     I will submit the patch if you, (anyone) does
> not come up with a better solution.  It is coded to be activated by a SET KSQO to
> minimize its reach.
>
>

OK, let me try this one.

Why is the system cnf'ifying the query.  Because it  wants to have a
list of qualifications that are AND'ed, so it can just pick the most
restrictive/cheapest, and evaluate that one first.

If you have:

    (a=b and c=d) or e=1

In this case, without cnf'ify, it has to evaluate both of them, because
if one is false, you can't be sure another would be true.  In the
cnf'ify case,

    (a=b or e=1) and (c=d or e=1)

In this case, it can choose either, and act on just one, if a row fails
to meet it, it can stop and not evaluate it using the other restriction.

The fact is that it is only going to use fancy join/index in one of the
two cases, so it tries to pick the best one, and does a brute-force
qualification test on the remaining item if the first one tried is true.

The problem is of course large where clauses can exponentially expand
this.  What it really trying to do is to pick a cheapest restriction,
but the memory explosion and query failure are serious problems.

The issue is that it thinks it is doing something to help things, while
it is actually hurting things.

In the ODBC case of:

    (x=3 and y=4) or
    (x=3 and y=5) or
    (x=3 and y=6) or ...

it clearly is not going to gain anything by choosing any CHEAPEST path,
because they are all the same in terms of cost, and the use by ODBC
clients is hurting reliability.

I am inclined to agree with David's solution of breaking apart the query
into separate UNION queries in certain cases.  It seems to be the most
logical solution, because the cnf'ify code is working counter to its
purpose in these cases.

Now, the question is how/where to implement this.  I see your idea of
making the OR a join to a temp table that holds all the constants.
Another idea would be to do actual UNION queries:

    SELECT * FROM tab
    WHERE (x=3 and y=4)
    UNION
    SELECT * FROM tab
    WHERE (x=3 and y=5)
    UNION
    SELECT * FROM tab
    WHERE (x=3 and y=6) ...

This would work well for tables with indexes, but for a sequential scan,
you are doing a sequential scan for each UNION.  Another idea is
subselects.  Also, you have to make sure you return the proper rows,
keeping duplicates where they are in the base table, but not returning
them when the meet more than one qualification.

    SELECT * FROM tab
    WHERE (x,y) IN (SELECT 3, 4
            UNION
            SELECT 3, 5
            UNION
            SELECT 3, 6)

I believe we actually support this.  This is not going to use an index
on tab, so it may be slow if x and y are indexed.

Another more bizarre solution is:

    SELECT * FROM tab
    WHERE (x,y) = (SELECT 3, 4) OR
          (x,y) = (SELECT 3, 5) OR
          (x,y) = (SELECT 3, 6)

Again, I think we do this too.  I don't think cnf'ify does anything with
this.  I also believe "=" uses indexes on subselects, while IN does not
because IN could return lots of rows, and an index is slower than a
non-index join on lots of rows.  Of course, now that we index OR's.

Let me ask another question.  If I do:

    SELECT * FROM tab WHERE x=3 OR x=4

it works, and uses indexes.  Why can't the optimizer just not cnf'ify
things sometimes, and just do:

    SELECT * FROM tab
    WHERE    (x=3 AND y=4) OR
        (x=3 AND y=5) OR
        (x=3 AND y=6)

Why can it handle x=3 OR x=4, but not the more complicated case above,
without trying to be too smart?  If x,y is a multi-key index, it could
use that quite easily.  If not, it can do a sequentail scan and run the
tests.

Another issue.  To the optimizer, x=3 and x=y are totally different.  In
x=3, it is a column compared to a constant, while in x=y, it is a join.
That makes a huge difference.

In the case of (a=b and c=d) or e=1, you pick the best path and do the
a=b join, and throw in the e=1 entries.  You can't easily do both joins,
because you also need the e=1 stuff.

I wounder what would happen if we prevent cnf'ifying of cases where the
OR represent only column = constant restrictions.

I meant to really go through the optimizer this month, but other backend
items took my time.

Can someone run some tests on disabling the cnf'ify calls.  It is my
understanding that with the non-cnf-ify'ed query, it can't choose an
optimial path, and starts to do either straight index matches,
sequential scans, or cartesian products where it joins every row to
every other row looking for a match.

Let's say we turn off cnf-ify just for non-join queries.  Does that
help?

I am not sure of the ramifications of telling the optimizer it no longer
has a variety of paths to choose for evaluating the query.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] odd pg_dump output?
Next
From: Bruce Momjian
Date:
Subject: Re: PostgreSQL under BSD/OS