Re: BRIN cost estimate - Mailing list pgsql-hackers

From Emre Hasegeli
Subject Re: BRIN cost estimate
Date
Msg-id CAE2gYzzJvzPy-1cSgZjJyH69izSa13SEgFC4i4r2z0qBQ2P8Uw@mail.gmail.com
Whole thread Raw
In response to BRIN cost estimate  (Alvaro Herrera <alvherre@2ndQuadrant.com>)
Responses Re: BRIN cost estimate  (Emre Hasegeli <emre@hasegeli.com>)
List pgsql-hackers
> Somebody wrote to me a few days ago that the BRIN cost estimation is
> rather poor.  One immediately obvious issue which I think is easily
> fixed is the correlation estimate, which is currently hardcoded to 1.
>
> Since a BRIN scan always entails a scan of the relation in physical
> order, it's simple to produce a better estimate for that, per the
> attached patch.  (Note: trying to run this on expression indexes will
> cause a crash.  I need to complete that part ...)
>
> There are other improvements we've been discussing, but I'm unclear that
> I will have time to study them to get a useful patch quickly enough.  If
> somebody else wants to mess with this, please get in touch.
>
> Thoughts?

As we have discusses, the correlation is not used for bitmap index
scan cost estimation.  I think the best we can do in here is to somehow
increase the selectivity.  I suggest something like this:

/*
* Index selectivity is important for the planner to calculate the cost
* of the bitmap heap scan.  Unfortunately, we don't have a robust way to
* estimate selectivity of BRIN.  It can dependent on many things.  This
* is a long rationale about the incomplete calculation we have at the
* moment.
*
* Our starting point is that BRIN selectivity has to be less than the
* selectivity of the btree.  We are using a product of logical and
* physical selectivities to achieve this.  The equality of
*
* (1 + logical_selectivity) * (1 + physical_selectivity) - 1
*
* is used to make sure the result is not less than any of the values.
*
* The logical selectivity is calculated using the indexable expressions
* of the WHERE clause.  The physical selectivity is calculated using
* the block proportion and the maximum correlation.  The block
* proportion is a comparable value with selectivity.  It is the
* selectivity of the smallest unit of the index.  The final selectivity
* can never be less than that.
*
* Using the contrary of the correlation by subtracting it from 1 is not
* not really a comparable value with the selectivity.  It is just a
* value between 0 and 1.  On the other hand, it is the only value
* related to the BRIN quality, we have available right now.  We are
* using the arithmetic of it with the block proportion to normalise it.
* This part of the physical selectivity is likely to be more effective
* than the block proportion in many circumstances as there would be
* many blocks on big tables.
*
* Using the contrary of the correlation of a column as selectivity of
* the index is wrong in many ways.  First of all, it cannot be applied
* to all BRIN operator classes.  It makes sense for the main built-in
* operator class "minmax", and makes a little sense for the other one
* "inclusion".  It wouldn't probably make any sense for a bloom filter
* implementation, if there would be any.  Maybe, we should push down
* this function to the operator class, but there is not enough reason
* to do it right now.
*
* Second, correlation is not dependent to any indexed expression.  It
* probably doesn't make any sense for the complicated operators.  It
* would probably effect basic comparison operators differently than
* equality operator.  The effect would even differ by count of those
* expressions.  For example, x IN (10, 20, 30) would be effected from
* correlation more than x = 15, even when their selectivities are the
* same.
*
* Last but not least, the correlation is a single value for the whole
* range.  The indexed table can partly be very well correlated, but
* the correlation value can still be very low.  For example, if a
* perfectly correlated table is copied 4 times, the correlation would
* be 0.25, although the index would be almost as good as the version on
* the initial table.  Or the expression can match the better correlated
* part of the table.  It is not hard to imagine more scenarios where
* the correlation is a bad value to use as the selectivity.  We should
* probably improve this by collecting more statistics, one day.
*
* Another problem in here is that the caller assumes the selectivity
* by tuples.  It might have been better, if we had a way to return it
* as some number of pages.  On the other hand, even though we know about
* the index, it is not too much easier for us to estimate the number of
* matching pages then it is for the caller.  We are likely to make too
* much mistake by relying on the correlation, anyway.  We are at least
* not making it worse in here.
*/
blockSelectivity = (blockProportion + 1 - *indexCorrelation) / 2;
selec = (1 + qualSelectivity) * (1 + blockSelectivity) - 1;
CLAMP_PROBABILITY(selec);

The patch is attached.



pgsql-hackers by date:

Previous
From: Fornaroli Christophe
Date:
Subject: Patch: pg_trgm: gin index scan performance for similarity search
Next
From: Tomas Vondra
Date:
Subject: Re: WIP: bloom filter in Hash Joins with batches