Thread: Updating histogram_bounds after a delete

Updating histogram_bounds after a delete

From
Derrick Rice
Date:
Greetings.

I recently ran into a problem with a planner opting for a sequential scan rather than a bitmap heap scan because the stats suggested that my delete query was going to affect 33% of the rows, rather than the 1% it really was.  I was able to follow the planner's logic and came to the realization that it was a result of the histogram_bounds for that column being out of date.

The table is regularly purged of some of it's oldest data, and new data is constantly added.  It seems to me that PostgreSQL *should* be able to identify a query which is going to delete all rows within a histogram bucket, and could possibly react by updating the histogram_bounds at commit-time, rather than needing an additional analyze or needing auto-analyze settings jacked way up.

Alternatively, it might be nice to be able to manually describe the table (I've been following the "no hints" discussion) by providing information along the lines of "always assume that column event_date is uniformly distributed".  This would be provided as schema information, not additional SQL syntax for hints.

Is this something that is remotely feasible, has the suggestion been made before, or am I asking for something where a solution already exists?

Thanks,

Derrick

Re: Updating histogram_bounds after a delete

From
Derrick Rice
Date:
Oh, I'm using 8.2

On Wed, Mar 16, 2011 at 3:40 PM, Derrick Rice <derrick.rice@gmail.com> wrote:
Greetings.

I recently ran into a problem with a planner opting for a sequential scan rather than a bitmap heap scan because the stats suggested that my delete query was going to affect 33% of the rows, rather than the 1% it really was.  I was able to follow the planner's logic and came to the realization that it was a result of the histogram_bounds for that column being out of date.

The table is regularly purged of some of it's oldest data, and new data is constantly added.  It seems to me that PostgreSQL *should* be able to identify a query which is going to delete all rows within a histogram bucket, and could possibly react by updating the histogram_bounds at commit-time, rather than needing an additional analyze or needing auto-analyze settings jacked way up.

Alternatively, it might be nice to be able to manually describe the table (I've been following the "no hints" discussion) by providing information along the lines of "always assume that column event_date is uniformly distributed".  This would be provided as schema information, not additional SQL syntax for hints.

Is this something that is remotely feasible, has the suggestion been made before, or am I asking for something where a solution already exists?

Thanks,

Derrick

Re: Updating histogram_bounds after a delete

From
"Kevin Grittner"
Date:
Derrick Rice <derrick.rice@gmail.com> wrote:

> I recently ran into a problem with a planner opting for a
> sequential scan rather than a bitmap heap scan because the stats
> suggested that my delete query was going to affect 33% of the
> rows, rather than the 1% it really was.

> could possibly react by updating the histogram_bounds at
> commit-time, rather than needing an additional analyze or needing
> auto-analyze settings jacked way up.

I recommend you try version 9.0 with default autovacuum settings and
see how things go.  If you still have an issue, let's talk then.
Besides numerous autovacuum improvements, which make it more
reliable and less likely to noticeably affect runtime of your
queries, there is a feature to probe the end of an index's range in
situations where data skew was often causing less than optimal plans
to be chosen.

From what you've told us, I suspect you won't see this problem in
9.0 unless you shoot yourself in the foot by crippling autovacuum.

-Kevin

Re: Updating histogram_bounds after a delete

From
Derrick Rice
Date:
On Wed, Mar 16, 2011 at 5:56 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
there is a feature to probe the end of an index's range in
situations where data skew was often causing less than optimal plans
to be chosen.

Was this introduced in 9.0 or was it earlier?  My company hasn't introduced integrated support for 9.0 yet, but I can go to 8.4.

It was suggested that I change my SQL from:

delete from my_table where event_date < now() - interval '12 hours';

to:

delete from my_table where event_date < now() - interval '12 hours'
and event_date >= (select min(event_date) from my_table);

Which, even if the stats are out of date, will be more accurate as it will not consider the histogram buckets that are empty due to previous deletes.  Seems like exactly what the feature you mentioned would do, no?

Thanks for the help,

Derrick

Re: Updating histogram_bounds after a delete

From
"Kevin Grittner"
Date:
Derrick Rice <derrick.rice@gmail.com> wrote:
> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:
>
>> there is a feature to probe the end of an index's range in
>> situations where data skew was often causing less than optimal
>> plans to be chosen.
>
> Was this introduced in 9.0 or was it earlier?

I don't remember when it was added.  I took a stab at searching for
it, but didn't get it figured out; if nobody who knows off-hand
jumps in, I'll try again when I have more time.

> It was suggested that I change my SQL from:
>
> delete from my_table where event_date < now() - interval '12
> hours';
>
> to:
>
> delete from my_table where event_date < now() - interval '12
> hours' and event_date >= (select min(event_date) from my_table);

That seems like a reasonable workaround.

> Seems like exactly what the feature you mentioned would do, no?

I know it helps with inserts off the end of the range; I'm less
certain about deletes.  I *think* that's covered, but I'd have to
dig into the code or do some testing to confirm.

-Kevin

Re: Updating histogram_bounds after a delete

From
Kenneth Marshall
Date:
On Thu, Mar 17, 2011 at 09:49:45AM -0500, Kevin Grittner wrote:
> Derrick Rice <derrick.rice@gmail.com> wrote:
> > Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:
> >
> >> there is a feature to probe the end of an index's range in
> >> situations where data skew was often causing less than optimal
> >> plans to be chosen.
> >
> > Was this introduced in 9.0 or was it earlier?
>
> I don't remember when it was added.  I took a stab at searching for
> it, but didn't get it figured out; if nobody who knows off-hand
> jumps in, I'll try again when I have more time.
>

I think this is it:

http://archives.postgresql.org/pgsql-committers/2010-01/msg00021.php

Regards,
Ken

Re: Updating histogram_bounds after a delete

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Derrick Rice <derrick.rice@gmail.com> wrote:
>> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote:
>>> there is a feature to probe the end of an index's range in
>>> situations where data skew was often causing less than optimal
>>> plans to be chosen.

>> Was this introduced in 9.0 or was it earlier?

> I don't remember when it was added.  I took a stab at searching for
> it, but didn't get it figured out; if nobody who knows off-hand
> jumps in, I'll try again when I have more time.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL9_0_BR [40608e7f9] 2010-01-04 02:44:40 +0000

    When estimating the selectivity of an inequality "column > constant" or
    "column < constant", and the comparison value is in the first or last
    histogram bin or outside the histogram entirely, try to fetch the actual
    column min or max value using an index scan (if there is an index on the
    column).  If successful, replace the lower or upper histogram bound with
    that value before carrying on with the estimate.  This limits the
    estimation error caused by moving min/max values when the comparison
    value is close to the min or max.  Per a complaint from Josh Berkus.

    It is tempting to consider using this mechanism for mergejoinscansel as well,
    but that would inject index fetches into main-line join estimation not just
    endpoint cases.  I'm refraining from that until we can get a better handle
    on the costs of doing this type of lookup.

            regards, tom lane

Re: Updating histogram_bounds after a delete

From
"Kevin Grittner"
Date:
Kenneth Marshall <ktm@rice.edu> wrote:

> I think this is it:
>
>
http://archives.postgresql.org/pgsql-committers/2010-01/msg00021.php

Looks like it.  Based on the commit date, that would be a 9.0
change.  Based on the description, I'm not sure it fixes Derrick's
problem; the workaround of explicitly using min() for the low end of
a range may need to be a long-term approach.

It does seem odd, though, that the statistics would be off by that
much.  Unless the query is run immediately after a mass delete,
autovacuum should be fixing that.  Perhaps the autovacuum
improvements in later releases will solve the problem.  If not, an
explicit ANALYZE (or perhaps better, VACUUM ANALYZE) immediately
after a mass delete would be wise.

-Kevin