Correlation in cost_index() - Mailing list pgsql-hackers

From Manfred Koizar
Subject Correlation in cost_index()
Date
Msg-id n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com
Whole thread Raw
Responses Re: Correlation in cost_index()  ("scott.marlowe" <scott.marlowe@ihs.com>)
Re: Correlation in cost_index()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
You all know this FAQ: "Why does Postgres not use my index?"  Half of
the time this problem can easily be solved by casting a literal to the
type of the respective column;  this is not my topic here.

In many other cases it turns out that the planner over-estimates the
cost of an index scan.  Sometimes this can be worked around by
lowering random_page_cost.  "Of course, that's a hack that is quite
unrelated to the real problem."  I strongly agree ;-)

AFAICS (part of) the real problem is in costsize.c:cost_index() where
IO_cost is calculated from min_IO_cost, pages_fetched,
random_page_cost, and indexCorrelation.  The current implementation
uses indexCorrelation^2 to interpolate between min_IO_cost and
max_IO_cost, which IMHO gives results that are too close to
max_IO_cost.  This conjecture is supported by the fact, that often
actual run times are much lower than estimated, when seqscans are
disabled.

So we have to find a cost function, so that
 . min_IO_cost <= cost <= max_IO_cost                        for  -1 <= indexCorrelation <= 1 . cost --> min_IO_cost
for indexCorrelation --> +/- 1 . cost --> max_IO_cost  for  indexCorrelation --> 0 . cost tends more towards
min_IO_costthan current implementation
 

After playing around a bit I propose three functions satisfying above
conditions.  All proposals use absC = abs(indexCorrelation).

Proposal 1:  Use absC for interpolation.
IO_cost = absC * min_IO_cost + (1 - absC) * max_IO_cost;


Proposal 2:  First calculate estimates for numbers of pages and cost
per page, then multiply the results.
estPages = absC * minPages + (1 - absC) * maxPages;estPCost = absC * 1 + (1 - absC) * random_page_cost;              /*
^                  sequential_page_cost */IO_cost = estPages * estPCost;
 


Proposal 3:  Interpolate "geometrically", using absC.
IO-cost = exp(   absC    * ln(min_IO_Cost) +              (1 - absC) * ln(max_IO_cost));


Here are some numbers for   seq_page_cost = 1   (constant)   random_page_cost = 4  (GUC)   minPages = 61   maxPages =
1440
       corr  current      p1        p2        p3       0     5760.00   5760.00   5760.00   5760.00       0.1   5703.01
5190.10   4817.77   3655.22       0.2   5532.04   4620.20   3958.28   2319.55       0.3   5247.09   4050.30   3181.53
1471.96      0.4   4848.16   3480.40   2487.52    934.08       0.5   4335.25   2910.50   1876.25    592.76       0.6
3708.36  2340.60   1347.72    376.16       0.7   2967.49   1770.70    901.93    238.70       0.8   2112.64   1200.80
538.88   151.48       0.9   1143.81    630.90    258.57     96.13       0.95   616.65    345.95    149.44     76.57
 0.99   174.41    117.99     77.03     63.84       0.995  117.85     89.50     68.91     62.40       0.999   72.39
66.70    62.57     61.28       1       61.00     61.00     61.00     61.00
 

Another example for   seq_page_cost = 1   (constant)   random_page_cost = 10  (GUC)   minPages = 20   maxPages =
938.58
       corr  current      p1        p2        p3       0     9385.79   9385.79   9385.79   9385.79       0.1   9292.14
8449.21   7705.17   5073.72       0.2   9011.16   7512.64   6189.88   2742.73       0.3   8542.87   6576.06   4839.94
1482.65      0.4   7887.27   5639.48   3655.34    801.48       0.5   7044.35   4702.90   2636.09    433.26       0.6
6014.11  3766.32   1782.19    234.21       0.7   4796.56   2829.74   1093.62    126.61       0.8   3391.69   1893.16
570.40    68.44       0.9   1799.50    956.58    212.53     37.00       0.95   933.16    488.29     95.60     27.20
 0.99   206.38    113.66     31.81     21.27       0.995  113.42     66.83     25.70     20.62       0.999   38.72
29.37    21.11     20.12       1       20.00     20.00     20.00     20.00
 

(If you want to play around with your own numbers, I can send my OOo
spreadsheet privately or to the list.)

The second example shows that especially with proposal 3 we could
afford to set random_page_cost to a *higher* value, which in contrast
to previous recommendations seems to be appropriate, IIRC that
benchmark results posted here showed values of up to 60.

As nobody knows how each of these proposals performs in real life
under different conditions, I suggest to leave the current
implementation in, add all three algorithms, and supply a GUC variable
to select a cost function.

Comments?  Ideas?  Objections?

ServusManfred


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: (Fwd) Re: Any Oracle 9 users? A test please...
Next
From: "scott.marlowe"
Date:
Subject: Re: Correlation in cost_index()