Thread: Updating histogram_bounds after a delete
Greetings.
I recently ran into a problem with a planner opting for a sequential scan rather than a bitmap heap scan because the stats suggested that my delete query was going to affect 33% of the rows, rather than the 1% it really was. I was able to follow the planner's logic and came to the realization that it was a result of the histogram_bounds for that column being out of date.
The table is regularly purged of some of it's oldest data, and new data is constantly added. It seems to me that PostgreSQL *should* be able to identify a query which is going to delete all rows within a histogram bucket, and could possibly react by updating the histogram_bounds at commit-time, rather than needing an additional analyze or needing auto-analyze settings jacked way up.
Alternatively, it might be nice to be able to manually describe the table (I've been following the "no hints" discussion) by providing information along the lines of "always assume that column event_date is uniformly distributed". This would be provided as schema information, not additional SQL syntax for hints.
Is this something that is remotely feasible, has the suggestion been made before, or am I asking for something where a solution already exists?
Thanks,
Derrick
I recently ran into a problem with a planner opting for a sequential scan rather than a bitmap heap scan because the stats suggested that my delete query was going to affect 33% of the rows, rather than the 1% it really was. I was able to follow the planner's logic and came to the realization that it was a result of the histogram_bounds for that column being out of date.
The table is regularly purged of some of it's oldest data, and new data is constantly added. It seems to me that PostgreSQL *should* be able to identify a query which is going to delete all rows within a histogram bucket, and could possibly react by updating the histogram_bounds at commit-time, rather than needing an additional analyze or needing auto-analyze settings jacked way up.
Alternatively, it might be nice to be able to manually describe the table (I've been following the "no hints" discussion) by providing information along the lines of "always assume that column event_date is uniformly distributed". This would be provided as schema information, not additional SQL syntax for hints.
Is this something that is remotely feasible, has the suggestion been made before, or am I asking for something where a solution already exists?
Thanks,
Derrick
Oh, I'm using 8.2
On Wed, Mar 16, 2011 at 3:40 PM, Derrick Rice <derrick.rice@gmail.com> wrote:
Greetings.
I recently ran into a problem with a planner opting for a sequential scan rather than a bitmap heap scan because the stats suggested that my delete query was going to affect 33% of the rows, rather than the 1% it really was. I was able to follow the planner's logic and came to the realization that it was a result of the histogram_bounds for that column being out of date.
The table is regularly purged of some of it's oldest data, and new data is constantly added. It seems to me that PostgreSQL *should* be able to identify a query which is going to delete all rows within a histogram bucket, and could possibly react by updating the histogram_bounds at commit-time, rather than needing an additional analyze or needing auto-analyze settings jacked way up.
Alternatively, it might be nice to be able to manually describe the table (I've been following the "no hints" discussion) by providing information along the lines of "always assume that column event_date is uniformly distributed". This would be provided as schema information, not additional SQL syntax for hints.
Is this something that is remotely feasible, has the suggestion been made before, or am I asking for something where a solution already exists?
Thanks,
Derrick
Derrick Rice <derrick.rice@gmail.com> wrote: > I recently ran into a problem with a planner opting for a > sequential scan rather than a bitmap heap scan because the stats > suggested that my delete query was going to affect 33% of the > rows, rather than the 1% it really was. > could possibly react by updating the histogram_bounds at > commit-time, rather than needing an additional analyze or needing > auto-analyze settings jacked way up. I recommend you try version 9.0 with default autovacuum settings and see how things go. If you still have an issue, let's talk then. Besides numerous autovacuum improvements, which make it more reliable and less likely to noticeably affect runtime of your queries, there is a feature to probe the end of an index's range in situations where data skew was often causing less than optimal plans to be chosen. From what you've told us, I suspect you won't see this problem in 9.0 unless you shoot yourself in the foot by crippling autovacuum. -Kevin
On Wed, Mar 16, 2011 at 5:56 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
there is a feature to probe the end of an index's range in
situations where data skew was often causing less than optimal plans
to be chosen.
Was this introduced in 9.0 or was it earlier? My company hasn't introduced integrated support for 9.0 yet, but I can go to 8.4.
It was suggested that I change my SQL from:
delete from my_table where event_date < now() - interval '12 hours';
to:
delete from my_table where event_date < now() - interval '12 hours'
and event_date >= (select min(event_date) from my_table);
Which, even if the stats are out of date, will be more accurate as it will not consider the histogram buckets that are empty due to previous deletes. Seems like exactly what the feature you mentioned would do, no?
Thanks for the help,
Derrick
Derrick Rice <derrick.rice@gmail.com> wrote: > Kevin Grittner <Kevin.Grittner@wicourts.gov wrote: > >> there is a feature to probe the end of an index's range in >> situations where data skew was often causing less than optimal >> plans to be chosen. > > Was this introduced in 9.0 or was it earlier? I don't remember when it was added. I took a stab at searching for it, but didn't get it figured out; if nobody who knows off-hand jumps in, I'll try again when I have more time. > It was suggested that I change my SQL from: > > delete from my_table where event_date < now() - interval '12 > hours'; > > to: > > delete from my_table where event_date < now() - interval '12 > hours' and event_date >= (select min(event_date) from my_table); That seems like a reasonable workaround. > Seems like exactly what the feature you mentioned would do, no? I know it helps with inserts off the end of the range; I'm less certain about deletes. I *think* that's covered, but I'd have to dig into the code or do some testing to confirm. -Kevin
On Thu, Mar 17, 2011 at 09:49:45AM -0500, Kevin Grittner wrote: > Derrick Rice <derrick.rice@gmail.com> wrote: > > Kevin Grittner <Kevin.Grittner@wicourts.gov wrote: > > > >> there is a feature to probe the end of an index's range in > >> situations where data skew was often causing less than optimal > >> plans to be chosen. > > > > Was this introduced in 9.0 or was it earlier? > > I don't remember when it was added. I took a stab at searching for > it, but didn't get it figured out; if nobody who knows off-hand > jumps in, I'll try again when I have more time. > I think this is it: http://archives.postgresql.org/pgsql-committers/2010-01/msg00021.php Regards, Ken
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Derrick Rice <derrick.rice@gmail.com> wrote: >> Kevin Grittner <Kevin.Grittner@wicourts.gov wrote: >>> there is a feature to probe the end of an index's range in >>> situations where data skew was often causing less than optimal >>> plans to be chosen. >> Was this introduced in 9.0 or was it earlier? > I don't remember when it was added. I took a stab at searching for > it, but didn't get it figured out; if nobody who knows off-hand > jumps in, I'll try again when I have more time. Author: Tom Lane <tgl@sss.pgh.pa.us> Branch: master Release: REL9_0_BR [40608e7f9] 2010-01-04 02:44:40 +0000 When estimating the selectivity of an inequality "column > constant" or "column < constant", and the comparison value is in the first or last histogram bin or outside the histogram entirely, try to fetch the actual column min or max value using an index scan (if there is an index on the column). If successful, replace the lower or upper histogram bound with that value before carrying on with the estimate. This limits the estimation error caused by moving min/max values when the comparison value is close to the min or max. Per a complaint from Josh Berkus. It is tempting to consider using this mechanism for mergejoinscansel as well, but that would inject index fetches into main-line join estimation not just endpoint cases. I'm refraining from that until we can get a better handle on the costs of doing this type of lookup. regards, tom lane
Kenneth Marshall <ktm@rice.edu> wrote: > I think this is it: > > http://archives.postgresql.org/pgsql-committers/2010-01/msg00021.php Looks like it. Based on the commit date, that would be a 9.0 change. Based on the description, I'm not sure it fixes Derrick's problem; the workaround of explicitly using min() for the low end of a range may need to be a long-term approach. It does seem odd, though, that the statistics would be off by that much. Unless the query is run immediately after a mass delete, autovacuum should be fixing that. Perhaps the autovacuum improvements in later releases will solve the problem. If not, an explicit ANALYZE (or perhaps better, VACUUM ANALYZE) immediately after a mass delete would be wise. -Kevin