Re: index problems (again) - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: index problems (again)
Date
Msg-id 20160312184338.GA20873@rorschach.hjp.at
Whole thread Raw
In response to Re: index problems (again)  (Geoff Winkless <pgsqladmin@geoff.dj>)
Responses Re: index problems (again)  (Geoff Winkless <pgsqladmin@geoff.dj>)
List pgsql-general
On 2016-03-08 10:16:57 +0000, Geoff Winkless wrote:
> On 7 March 2016 at 20:40, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > As Tom wrote, the estimate of having to read only about 140 rows is only
> > valid if sc_id and sc_date are uncorrelated. In reality your query has
> > to read a lot more than 140 rows, so it is much slower.
>
> But as I've said previously, even if I do select from scdate values
> that I know to be in the first 1% of the data (supposedly the perfect
> condition) the scan method is insignificantly quicker than the index
> (scdate,scid) method.

Actually the planner expects find a match within the first 0.0035 %, so
to find out how fast that would be you would have to use a value from
that range.

> Even with the absolute perfect storm (loading in the entire index for
> the full range) it's still not too bad (1.3 seconds or so).
>
> The point is that to assume, knowing nothing about the data, that the
> data is in an even distribution is only a valid strategy if the worst
> case (when that assumption turns out to be wildly incorrect) is not
> catastrophic. That's not the case here.

True. The fundamental problem here is that the planner doesn't have any
notion of a worst case. It only knows "cost", and that is a single
number for each operation. For many operations, both the best case and
the worst case are unusable as cost - the first would almost always
underestimate the time and choose a plan which is far from optimal and
the second would almost always overestimate it and reject an optimal
plan. The art of programming a planner (which I've dabbled with in a
previous (not postgresql-related) project but certainly can't claim any
expertise in) lies in choosing a cost function which is quite close most
of the time and catastrophically wrong only very rarely. It is clear
that PostgreSQL hasn't succeed in the latter category: Correlated
columns do occur and the current cost function, which assumes that all
columns are uncorrelated can catastrophically underestimate the cost in
this case.

The question is what can be done to improve the situation.

Tom thinks that correlation statistics would help. That seems plausible
to me.

You claim that no statistics are needed.

That may or may not be true: You haven't proposed an alternate method
yet.

I feel fairly certain that using the worst case (the cost for scanning
the whole table) would be just as bad in and would cause inferior plans
to be used in many instances.

Maybe computing the cost as weighted average of the best, average and
worst case (e.g. cost = cost_best*0.05 + cost_avg*0.90 + cost_worst*0.05)
would penalize methods with a large spread between best and worst case
enough - but that still leaves the problem of determining the weights
and determining what the "average" is. So it's the same black magic as
now, just the little more complicated (on the plus side, this would
probably be a relatively simple patch).

If we assume that we could revamp the planner completely, other
possibilities come to mind:

For example, since I think that the core problem is having a single
number for the cost, the planner could instead compute a distribution
(in the most simple case just best and worst case, but ideally many
values). Then the planner could say something like: I have two plans A
nd B and A is at most 20 % faster in almost all cases. But in the worst
case, A is 1000 times slower. Being 20 % faster most of the time is nice
but doesn't outweigh the risk of being 1000 times slower sometimes, so
I'll use B anyway.

Another possibility I've been considering for some time is feeding back
the real execution times into the planner, but that sounds like a major
research project. (Actually I think Oracle does something like this
since version 12)

        hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

pgsql-general by date:

Previous
From: Johann Höchtl
Date:
Subject: Full text search question: "01.Bez." --> "Erster Bezirk"
Next
From: Adrian Klaver
Date:
Subject: Re: Unable to match same value in field.