Re: Setting Statistics on Functional Indexes - Mailing list pgsql-performance

From Tom Lane
Subject Re: Setting Statistics on Functional Indexes
Date
Msg-id 3040.1351285698@sss.pgh.pa.us
Whole thread Raw
In response to Re: Setting Statistics on Functional Indexes  (Shaun Thomas <sthomas@optionshouse.com>)
Responses Re: Setting Statistics on Functional Indexes
Re: Setting Statistics on Functional Indexes
Re: Setting Statistics on Functional Indexes
List pgsql-performance
Shaun Thomas <sthomas@optionshouse.com> writes:
> But I just noticed the lag in your response. :) It turns out, even
> though I was substituting 2012-10-24 or 2012-10-25, what I really meant
> was current_date. That does make all the difference, actually.

Ah.  [ pokes at that for awhile... ]  OK, this has nothing to do with
functional indexes, and everything to do with the edge-case behavior of
scalarltsel.  What you've got is a histogram whose last entry
(corresponding to the highest observed value of the date) is
current_date, and the question is what we should assume when estimating
how many rows have a value >= that.  The answer of course is "one, plus
any duplicates" ... but we don't know how many duplicates there are,
and what we do know is it's not a particularly large number because the
value isn't present in the most-common-values stats.  So the code there
assumes there aren't any dups.

Once you have enough histogram resolution for current_date to show up
as the next-to-last as well as the last histogram entry, then of course
the estimate gets a lot better, since we can now tell that there's at
least one histogram bin's worth of duplicates.

Interestingly, this is a case where the get_actual_variable_range patch
(commit 40608e7f, which appeared in 9.0) makes the results worse.
Before that, there was a (very arbitrary) lower bound on what we'd
believe as the selectivity of a >= condition, but now, when we know the
actual upper limit of the variable, we don't clamp the result that way.
I think the clamp must have been saving you in your previous version,
because it more-or-less-accidentally accounted for the end value not
being unique.

So the bottom line is that this is a case where you need a lot of
resolution in the histogram.  I'm not sure there's anything good
we can do to avoid that.   I spent a bit of time thinking about whether
we could use n_distinct to get some idea of how many duplicates there
might be for the endpoint value, but n_distinct is unreliable enough
that I can't develop a lot of faith in such a thing.  Or we could just
arbitarily assume some fraction-of-a-histogram-bin's worth of
duplicates, but that would make the results worse for some people.

            regards, tom lane


pgsql-performance by date:

Previous
From: robcron
Date:
Subject: Re: Slower Performance on Postgres 9.1.6 vs 8.2.11
Next
From: Claudio Freire
Date:
Subject: Re: Setting Statistics on Functional Indexes