Thread: BUG #8598: Row count estimates of partial indexes

BUG #8598: Row count estimates of partial indexes

From
marko@joh.to
Date:
The following bug has been logged on the website:

Bug reference:      8598
Logged by:          Marko Tiikkaja
Email address:      marko@joh.to
PostgreSQL version: 9.1.9
Operating system:   Linux
Description:

Hi,


We have the following partial index on a small subset of a larger table:


  "index_transactions_transaction_balance_details" btree (transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL


However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:


=# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed AND
accountbalancesdailyid IS NOT NULL;

      QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using index_transactions_transaction_balance_details on
transactions  (cost=0.00..3883160.47 rows=66259403 width=130) (actual
time=0.033..18.268 rows=13962 loops=1)
 Total runtime: 18.874 ms
(2 rows)


.. which makes for some silly joins when this index is part of a larger
query.


Is this expected on 9.1?  Has this been fixed in more recent versions?

Re: BUG #8598: Row count estimates of partial indexes

From
Kevin Grittner
Date:
"marko@joh.to" <marko@joh.to> wrote:=0A=0A>=A0=A0 "index_transactions_trans=
action_balance_details" btree=0A> (transactionid)=0A> WHERE NOT processed A=
ND accountbalancesdailyid IS NOT NULL=0A>=0A> However, querying with the WH=
ERE clause completely ignores the=0A> pg_class.reltuples value for the inde=
x:=0A=0AWhat is the pg_class.reltuples value for the index?=0A=0A> =3D# EXP=
LAIN ANALYZE SELECT * FROM transactions WHERE NOT processed=0A> AND account=
balancesdailyid IS NOT NULL;=0A=0A> Index Scan using index_transactions_tra=
nsaction_balance_details=0A> on transactions=A0 (cost=3D0.00..3883160.47 ro=
ws=3D66259403 width=3D130)=0A> (actual time=3D0.033..18.268 rows=3D13962 lo=
ops=3D1)=0A=0A> .. which makes for some silly joins when this index is part=
 of a=0A> larger query.=0A>=0A> Is this expected on 9.1?=A0 Has this been f=
ixed in more recent=0A> versions?=0A=0APlease provide a little more informa=
tion:=0A=0ASELECT version();=0ASELECT name, current_setting(name), source=
=0A=A0 FROM pg_settings=0A=A0 WHERE source NOT IN ('default', 'override');=
=0A=0AWhat VACUUM or ANALYZE commands are run outside of autovacuum?=0A=0A-=
-=0AKevin Grittner=0AEDB: http://www.enterprisedb.com=0AThe Enterprise Post=
greSQL Company

Re: BUG #8598: Row count estimates of partial indexes

From
Tom Lane
Date:
marko@joh.to writes:
> We have the following partial index on a small subset of a larger table:
>   "index_transactions_transaction_balance_details" btree (transactionid)
> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
> However, querying with the WHERE clause completely ignores the
> pg_class.reltuples value for the index:

Yup.  Row count estimates are derived by estimating the selectivity of the
given WHERE clauses and multiplying by the (estimated) current table size.
In the particular case you show here, with a partial index that *exactly*
matches the WHERE clause, we could get a better answer by looking at the
index size --- but that doesn't scale to any less simplistic case, such
as a query with additional WHERE clauses.

It's also important to realize that reltuples for an index is a whole lot
less trustworthy than it is for a table; ANALYZE doesn't update the
former, for example.  And scaling from the last-reported VACUUM stats
to current reality is going to be shakier.

So on the whole, I don't think this would be a good idea.

            regards, tom lane

Re: BUG #8598: Row count estimates of partial indexes

From
Marko Tiikkaja
Date:
On 11/17/13, 5:29 PM, Tom Lane wrote:
> marko@joh.to writes:
>> We have the following partial index on a small subset of a larger table:
>>    "index_transactions_transaction_balance_details" btree (transactionid)
>> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
>> However, querying with the WHERE clause completely ignores the
>> pg_class.reltuples value for the index:
>
> Yup.  Row count estimates are derived by estimating the selectivity of the
> given WHERE clauses and multiplying by the (estimated) current table size.
> In the particular case you show here, with a partial index that *exactly*
> matches the WHERE clause, we could get a better answer by looking at the
> index size --- but that doesn't scale to any less simplistic case, such
> as a query with additional WHERE clauses.
>
> It's also important to realize that reltuples for an index is a whole lot
> less trustworthy than it is for a table; ANALYZE doesn't update the
> former, for example.  And scaling from the last-reported VACUUM stats
> to current reality is going to be shakier.
>
> So on the whole, I don't think this would be a good idea.

Any suggestions for a workaround?  When reading this index as a part of
a bigger query the horrible estimate ensures that nobody's having fun.
I currently have something like:

   SELECT * FROM
   (SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
transactions
   JOIN ..

And I *really* don't like that as a workaround.


Regards,
Marko Tiikkaja

Re: BUG #8598: Row count estimates of partial indexes

From
Tom Lane
Date:
Marko Tiikkaja <marko@joh.to> writes:
> Any suggestions for a workaround?  When reading this index as a part of
> a bigger query the horrible estimate ensures that nobody's having fun.

Why is the estimate so bad?  I suppose the answer is that those two
columns are very strongly correlated.  Maybe you could refactor your
data representation to avoid that?

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

            regards, tom lane

Re: BUG #8598: Row count estimates of partial indexes

From
Jeff Janes
Date:
On Sun, Nov 17, 2013 at 11:55 AM, Marko Tiikkaja <marko@joh.to> wrote:

> On 11/17/13, 5:29 PM, Tom Lane wrote:
>
>> marko@joh.to writes:
>>
>>> We have the following partial index on a small subset of a larger table:
>>>    "index_transactions_transaction_balance_details" btree
>>> (transactionid)
>>> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
>>> However, querying with the WHERE clause completely ignores the
>>> pg_class.reltuples value for the index:
>>>
>>
>> Yup.  Row count estimates are derived by estimating the selectivity of the
>> given WHERE clauses and multiplying by the (estimated) current table size.
>> In the particular case you show here, with a partial index that *exactly*
>> matches the WHERE clause, we could get a better answer by looking at the
>> index size --- but that doesn't scale to any less simplistic case, such
>> as a query with additional WHERE clauses.
>>
>> It's also important to realize that reltuples for an index is a whole lot
>> less trustworthy than it is for a table; ANALYZE doesn't update the
>> former, for example.  And scaling from the last-reported VACUUM stats
>> to current reality is going to be shakier.
>>
>> So on the whole, I don't think this would be a good idea.
>>
>
> Any suggestions for a workaround?  When reading this index as a part of a
> bigger query the horrible estimate ensures that nobody's having fun. I
> currently have something like:
>

Define a new column which is true iff the where condition is true?  It
sounds like that one magic combination has a meaning all of its own, so it
would make sense to encode it in one column.


>
>   SELECT * FROM
>   (SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
> transactions
>   JOIN ..
>
> And I *really* don't like that as a workaround.
>

I've wanted a function that always returns true, but which the planner
things returns false most of the time, for use in such situations.  It
looks like you can make one of these with a compiled module (by creating an
operator and then wrapping that in a function), but I have not found a way
to do it without using C.  (CREATE FUNCTION takes a COST and ROWS, but not
a SELECTIVITY.)

 Cheers,

Jeff

Re: BUG #8598: Row count estimates of partial indexes

From
Marko Tiikkaja
Date:
On 11/17/13 9:18 PM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> Any suggestions for a workaround?  When reading this index as a part of
>> a bigger query the horrible estimate ensures that nobody's having fun.
>
> Why is the estimate so bad?  I suppose the answer is that those two
> columns are very strongly correlated.  Maybe you could refactor your
> data representation to avoid that?

I'll look into that, thanks.

> The long-term answer as far as Postgres is concerned is to learn about
> cross-column correlations, but that's not happening in the near future.

I'm completely clueless about how the planner works, but wouldn't it be
easier to have some kind of separate stats for the conditions in partial
indexes?  It seems better in all cases than trying infer the stats from
cross-column correlations, even if we had that.


Regards,
Marko Tiikkaja

Re: BUG #8598: Row count estimates of partial indexes

From
Tom Lane
Date:
Marko Tiikkaja <marko@joh.to> writes:
> On 11/17/13 9:18 PM, Tom Lane wrote:
>> The long-term answer as far as Postgres is concerned is to learn about
>> cross-column correlations, but that's not happening in the near future.

> I'm completely clueless about how the planner works, but wouldn't it be
> easier to have some kind of separate stats for the conditions in partial
> indexes?  It seems better in all cases than trying infer the stats from
> cross-column correlations, even if we had that.

There's been some discussion of providing a way to hint to ANALYZE about
which combinations of columns are worth gathering cross-column statistics
for.  But partial index predicates seem like a pretty bad mechanism for
that.

            regards, tom lane

Re: BUG #8598: Row count estimates of partial indexes

From
Claudio Freire
Date:
On Mon, Nov 18, 2013 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>> On 11/17/13 9:18 PM, Tom Lane wrote:
>>> The long-term answer as far as Postgres is concerned is to learn about
>>> cross-column correlations, but that's not happening in the near future.
>
>> I'm completely clueless about how the planner works, but wouldn't it be
>> easier to have some kind of separate stats for the conditions in partial
>> indexes?  It seems better in all cases than trying infer the stats from
>> cross-column correlations, even if we had that.
>
> There's been some discussion of providing a way to hint to ANALYZE about
> which combinations of columns are worth gathering cross-column statistics
> for.  But partial index predicates seem like a pretty bad mechanism for
> that.
>
>                         regards, tom lane


Why?

If there's a partial index on some predicate, it does mean the
predicate is of common occurence or at least important and it's quite
expectable that more precise estimations regarding those queries
valuable.

Analyze should simply record the selectivity of partial index
predicates as it would the MFV of the boolean variable equal to the
predicate's result, and modifying the MFV estimation code to look up
for those specific stats doesn't seem too difficult.