Thread: Default Stats Revisited
Folks, Early on in the default_stats thread, I made a proposal that got dropped without discussion. I'd like to revisit it, because I still think it's a good idea. The Issue: The low default_stats_target of 10 is not sufficient for many complex queries involving multi-column correlation or oddly distributed data. Yet modestly increasing the stats target for *all* columns, as demonstrated, substantially increases the time required for Analyze, without gain on most queries. If only there were a way to automatically increas the default stats on only "important" columns, and not on other columns! Yet if we burden the DBA with flagging important colummns all over the database, we haven't saved him/her any work. Ah, but there is a way! Most "important" columns are already indicated ... because they are indexed. If we implemented a system where indexed columns would have a significantly higher stats_target than non-indexed columns, this might improve our default behavior without overburdening Analyze. Proposal: That we consider:-- adding a new GUC default_stats_indexed-- that this GUC be set initially to 100 if stats_targetis 10-- that the system be adjusted to that indexed columns take their stats_target from default_stats_indexedand not default_stats_target-- that expressional indexes be ignored for this purpose, as implementation would be too complex, and they have their own stats anyway If this proposal is worth considering, I will spend some time building up a test case to demonstrate the cost and utility of the plan. With Neil's help, of course! -- -Josh BerkusAglio Database SolutionsSan Francisco
Scott, > I like it. Would a multiplier be acceptable? > default_stats_index_multiplier = 10 Yeah, I thought about that, but a multiplier would be harder to manage for most people. I mean, what if your default_stats are at 25 and you want your index_stats at 40? PITA. Also, if you wanted to increase the default stats but *forgot* that the index_stats were a multiplier ... I think a straight number is less confusing. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Wed, 10 Mar 2004, Josh Berkus wrote: > Folks, > > Early on in the default_stats thread, I made a proposal that got dropped > without discussion. I'd like to revisit it, because I still think it's a > good idea. > > The Issue: The low default_stats_target of 10 is not sufficient for many > complex queries involving multi-column correlation or oddly distributed data. > Yet modestly increasing the stats target for *all* columns, as demonstrated, > substantially increases the time required for Analyze, without gain on most > queries. > > If only there were a way to automatically increas the default stats on only > "important" columns, and not on other columns! Yet if we burden the DBA with > flagging important colummns all over the database, we haven't saved him/her > any work. > > Ah, but there is a way! Most "important" columns are already indicated ... > because they are indexed. If we implemented a system where indexed columns > would have a significantly higher stats_target than non-indexed columns, this > might improve our default behavior without overburdening Analyze. > > Proposal: That we consider: > -- adding a new GUC default_stats_indexed > -- that this GUC be set initially to 100 if stats_target is 10 > -- that the system be adjusted to that indexed columns take their > stats_target from default_stats_indexed and not default_stats_target > -- that expressional indexes be ignored for this purpose, as implementation > would be too complex, and they have their own stats anyway > > If this proposal is worth considering, I will spend some time building up a > test case to demonstrate the cost and utility of the plan. With Neil's help, > of course! I like it. Would a multiplier be acceptable? default_stats_index_multiplier = 10 Of course it would automatically be capped at 1000, etc... I like the idea that if I'm increasing the default statistics target the stats for indexed columns goes up too.
On Thu, 11 Mar 2004, Josh Berkus wrote: > Scott, > > > I like it. Would a multiplier be acceptable? > > default_stats_index_multiplier = 10 > > Yeah, I thought about that, but a multiplier would be harder to manage for > most people. I mean, what if your default_stats are at 25 and you want > your index_stats at 40? PITA. Also, if you wanted to increase the default > stats but *forgot* that the index_stats were a multiplier ... > > I think a straight number is less confusing. But possible more error prone. If you crank up the default statistics to 50, but the index default is still 25... OTOH, you could always have the setting of used for index default be whichever is greater... hmmm.
Scott, > But possible more error prone. If you crank up the default statistics to > 50, but the index default is still 25... OTOH, you could always have the > setting of used for index default be whichever is greater... hmmm. Well, I'm not 100% opposed to a multiplier. I'd like to take a poll of DBAs to find out which they would find more accessable. But since most people seem to be ignoring this thread, I'm not sure we'll get much response ... -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: > Well, I'm not 100% opposed to a multiplier. I'd like to take a poll of DBAs > to find out which they would find more accessable. But since most people > seem to be ignoring this thread, I'm not sure we'll get much response ... Maybe you should ask on -admin or -general. Personally I thought there wasn't anything to say until someone did some experiments to show whether an indexed-column differential is really worthwhile and what a plausible default value would be. The idea sounds good in the abstract, but will it really help or just be another useless setting? regards, tom lane
Tom, > Maybe you should ask on -admin or -general. Personally I thought there > wasn't anything to say until someone did some experiments to show > whether an indexed-column differential is really worthwhile and what a > plausible default value would be. The idea sounds good in the abstract, > but will it really help or just be another useless setting? Yeah, that's our next step, a test. On Monday, I hope to have comparative stats for a difficult database on the 3 solutions (leaving things as-is, raising the general default stats, and doing index_stats). -- -Josh BerkusAglio Database SolutionsSan Francisco
On Thursday 11 March 2004 14:17, Josh Berkus wrote: > Tom, > > > Maybe you should ask on -admin or -general. Personally I thought there > > wasn't anything to say until someone did some experiments to show > > whether an indexed-column differential is really worthwhile and what a > > plausible default value would be. The idea sounds good in the abstract, > > but will it really help or just be another useless setting? > > Yeah, that's our next step, a test. > > On Monday, I hope to have comparative stats for a difficult database on > the 3 solutions (leaving things as-is, raising the general default stats, > and doing index_stats). Do you plan on handeling primary key columns differently (since they are likely to be unique and indexed) ? Also how will you handle column that are part of expressional indexes (where foo is true for example) ? Final thought... I'm a DBA and I think the straight number is simpler, though could be convinced to go with whichever is higher... Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Robert, > Do you plan on handeling primary key columns differently (since they are > likely to be unique and indexed) ? The same as any other indexed column. > Also how will you handle column that are part of expressional indexes (where > foo is true for example) ? See my original proposal. These columns will be ignored. Expressions have their own stats. > Final thought... I'm a DBA and I think the straight number is simpler, though > could be convinced to go with whichever is higher... Thanks. -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus <josh@agliodbs.com> writes: >> Also how will you handle column that are part of expressional indexes (where >> foo is true for example) ? > See my original proposal. These columns will be ignored. Expressions have > their own stats. Yeah, I see no particular need to increase the stats allocation for a column merely because it is used in an expression index. The mechanism Josh is proposing should cause the default amount of stats collected *for the expression index* to go up, though. BTW, there's an as-yet-undone bit of business associated with collecting stats for expression indexes. The ANALYZE code will honor an explicit attstattarget setting for an expressional index column, but there's no clean way to get that setting into the system. What works in CVS tip is regression=# create table foo (f1 int, f2 int); CREATE TABLE regression=# create index fooi on foo ((f1 + f2)); CREATE INDEX regression=# alter table fooi alter column pg_expression_1 set statistics 100; ALTER TABLE but I don't much care for this; the arbitrary names that are used for expressional columns shouldn't be embedded into SQL commands, and doing "alter table" on an index looks a bit funny as well. I want to make pg_dump dump these settings, but I'm not feeling comfortable with having it dump commands that look like the above. That would nail down the current method of assigning expression column names as something we could never change without breaking dump scripts. Can anyone think of a better way? regards, tom lane
Robert Treat wrote: > On Thursday 11 March 2004 14:17, Josh Berkus wrote: > > Tom, > > > > > Maybe you should ask on -admin or -general. Personally I thought there > > > wasn't anything to say until someone did some experiments to show > > > whether an indexed-column differential is really worthwhile and what a > > > plausible default value would be. The idea sounds good in the abstract, > > > but will it really help or just be another useless setting? > > > > Yeah, that's our next step, a test. > > > > On Monday, I hope to have comparative stats for a difficult database on > > the 3 solutions (leaving things as-is, raising the general default stats, > > and doing index_stats). > > Do you plan on handeling primary key columns differently (since they are > likely to be unique and indexed) ? Another idea is whether a foreign key column should get extra statistics? > Also how will you handle column that are part of expressional indexes (where > foo is true for example) ? > > Final thought... I'm a DBA and I think the straight number is simpler, though > could be convinced to go with whichever is higher... Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Another idea is whether a foreign key column should get extra > statistics? In practice, both ends of an FK relationship have to be indexed, so I don't see that we need any extra special case for that. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Another idea is whether a foreign key column should get extra > > statistics? > > In practice, both ends of an FK relationship have to be indexed, > so I don't see that we need any extra special case for that. Do they? We don't create an index automatically when using REFERENCES. We do create an index for PRIMARY KEY. I was just wondering if the REFERENCES column is more sensitive to join usage and would benefit from more accurate statistics even if it doesn't have an index. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce, > Do they? We don't create an index automatically when using REFERENCES. > We do create an index for PRIMARY KEY. > > I was just wondering if the REFERENCES column is more sensitive to join > usage and would benefit from more accurate statistics even if it doesn't > have an index. I don't think so. If one does not create an index on an FK column, this is usually because the column does not have enough of a range of values to deserve indexing (for example, 4 potential values across 100,000 records). In that case, we would not want to up the statistics either. And we haven't made it a practice to hand-hold for database designers who don't know when to index, so I don't think we should start now. -- -Josh BerkusAglio Database SolutionsSan Francisco
>Josh Berkus > > But possible more error prone. If you crank up the default statistics > to > > 50, but the index default is still 25... OTOH, you could always have > the > > setting of used for index default be whichever is greater... hmmm. > > Well, I'm not 100% opposed to a multiplier. I'd like to take a poll of > DBAs > to find out which they would find more accessable. But since most people > seem to be ignoring this thread, I'm not sure we'll get much response ... ...very interesting thoughts overall. I note that your idea has a very strong basis and is pretty much suggested to us, since it forms part of the allowable specification of TPC-H: - so let's do it! I enclose the following TPC-H text. Section 5.2.8, p.103, TPC Benchmark H (Decision Support), Standard Specification, revision 2.1.0: 5.2.8 The gathering of statistics is part of the database load (see Clause 4.3) but it also serves as an important configuration vehicle, particularly for the query optimizer. In order to satisfy the requirements of Clause 5.2.7, it is desirable to collect the same quality of statistics for every column of every table. However, in order to reduce processing requirements, it is permissible to segment columns into distinct classes and base the level of statistics collection for a particular column on class membership. Class definitions must rely solely on schema-related attributes of a column and must be applied consistently across all tables. For example: . membership in an index; . leading or other position in an index; . use in a constraint (including a primary or foreign key relationships). Statistics that operate in sets, such as distribution statistics, should employ a fixed set appropriate to the scale factor used. Knowledge of the cardinality, values or distribution of a non-key column as specified in Clause 4 cannot be used to tailor statistics gathering. Based upon that, might we add slightly to your index stats suggestion slightly and include constraint-member columns also? Overall, the problem you highlighted is: - if default stats is 10 and we don't think that is very useful, then clearly that should change, yet we are performance constrained I would note that DB2 uses default 20, as does Teradata. Oracle uses 75 histogram buckets as default. Clearly, 10 is not the "accepted" view...but I'm sure I'll be shot down for such thinking. You set me thinking about another possible solution: My understanding is that the default for stats on PostgreSQL is "collect", whereas on other systems it is "don't collect" (until instructed). i.e. we collect stats on all columns by default. That's good, but the end result is that it is slower than the other default. [That might be important because dynamic sampling is not yet implemented, not sure] If we had the option not to collect stats at all on most columns, then that would speed things up, wouldn't it (just as TPC-H pretty much says). Perhaps it might be better to offer an option to alter that default? It seems that DEFAULT_STATISTICS_TARGET cannot be set to "no thanks". If you could turn off the collection of what we seem to be agreeing is a relatively pointless collection of statistics, would that not improve stats collection performance? If set to 0, only collect number of nulls, number of distinct values, max and min. Turning off stats-by-default and yet ramping up the collection target where it is needed sounds good for TPC-H/DBT-3 environments - and will likely even improve perf numbers on the actual tests! ..this might also allow us to rename the two parameters, just as was recently done with work_mem etc.. statistics_target_default 0 -- i.e. don't collect by default statistics_target_index 100 I would also suggest a further class of columns for statistics collection: statistics_target_text which would include all fields with length > 32 (pick a limit...), since these are very frequently all unique. Setting these to -1 would mean they use the statistics_target_default value, which would be their default setting. You might even provide a mechanism for defining statistics collection classes based upon their datatype...e.g. CREATE STATISTICS CLASS <CLASSNAME> STATISTICS COLLECTION TARGET 100 APPLIES TO (LEADING n) INDEX COLUMNS, CONSTRAINT COLUMNS; CREATE STATISTICS CLASS <CLASSNAME>STATISTICS COLLECTION TARGET 0 APPLIES TO <COMMAS SEPARATED LIST OF DATATYPES>; <anything not specifically identified, would then use system parameter default> That would be great, since we now have the ability to collects stats on user defined datatypes (don't we?). Back to the multiplier: Yes please, but only if it worked like this: Normal and index stats are settable differently. Index stats DEFAULT is a multiple of normal stats, unless specifically set. (You may wish to set it down as well as up, remember). That way, the default behaviour improves even when the index stats parameter is not actually set, yet is still controllable when you do. Best Regards, Simon Riggs