Re: Default Stats Revisited - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Default Stats Revisited |
Date | |
Msg-id | 006501c4081f$47fabe80$509d87d9@LaptopDellXP Whole thread Raw |
In response to | Re: Default Stats Revisited (Josh Berkus <josh@agliodbs.com>) |
List | pgsql-hackers |
>Josh Berkus > > But possible more error prone. If you crank up the default statistics > to > > 50, but the index default is still 25... OTOH, you could always have > the > > setting of used for index default be whichever is greater... hmmm. > > Well, I'm not 100% opposed to a multiplier. I'd like to take a poll of > DBAs > to find out which they would find more accessable. But since most people > seem to be ignoring this thread, I'm not sure we'll get much response ... ...very interesting thoughts overall. I note that your idea has a very strong basis and is pretty much suggested to us, since it forms part of the allowable specification of TPC-H: - so let's do it! I enclose the following TPC-H text. Section 5.2.8, p.103, TPC Benchmark H (Decision Support), Standard Specification, revision 2.1.0: 5.2.8 The gathering of statistics is part of the database load (see Clause 4.3) but it also serves as an important configuration vehicle, particularly for the query optimizer. In order to satisfy the requirements of Clause 5.2.7, it is desirable to collect the same quality of statistics for every column of every table. However, in order to reduce processing requirements, it is permissible to segment columns into distinct classes and base the level of statistics collection for a particular column on class membership. Class definitions must rely solely on schema-related attributes of a column and must be applied consistently across all tables. For example: . membership in an index; . leading or other position in an index; . use in a constraint (including a primary or foreign key relationships). Statistics that operate in sets, such as distribution statistics, should employ a fixed set appropriate to the scale factor used. Knowledge of the cardinality, values or distribution of a non-key column as specified in Clause 4 cannot be used to tailor statistics gathering. Based upon that, might we add slightly to your index stats suggestion slightly and include constraint-member columns also? Overall, the problem you highlighted is: - if default stats is 10 and we don't think that is very useful, then clearly that should change, yet we are performance constrained I would note that DB2 uses default 20, as does Teradata. Oracle uses 75 histogram buckets as default. Clearly, 10 is not the "accepted" view...but I'm sure I'll be shot down for such thinking. You set me thinking about another possible solution: My understanding is that the default for stats on PostgreSQL is "collect", whereas on other systems it is "don't collect" (until instructed). i.e. we collect stats on all columns by default. That's good, but the end result is that it is slower than the other default. [That might be important because dynamic sampling is not yet implemented, not sure] If we had the option not to collect stats at all on most columns, then that would speed things up, wouldn't it (just as TPC-H pretty much says). Perhaps it might be better to offer an option to alter that default? It seems that DEFAULT_STATISTICS_TARGET cannot be set to "no thanks". If you could turn off the collection of what we seem to be agreeing is a relatively pointless collection of statistics, would that not improve stats collection performance? If set to 0, only collect number of nulls, number of distinct values, max and min. Turning off stats-by-default and yet ramping up the collection target where it is needed sounds good for TPC-H/DBT-3 environments - and will likely even improve perf numbers on the actual tests! ..this might also allow us to rename the two parameters, just as was recently done with work_mem etc.. statistics_target_default 0 -- i.e. don't collect by default statistics_target_index 100 I would also suggest a further class of columns for statistics collection: statistics_target_text which would include all fields with length > 32 (pick a limit...), since these are very frequently all unique. Setting these to -1 would mean they use the statistics_target_default value, which would be their default setting. You might even provide a mechanism for defining statistics collection classes based upon their datatype...e.g. CREATE STATISTICS CLASS <CLASSNAME> STATISTICS COLLECTION TARGET 100 APPLIES TO (LEADING n) INDEX COLUMNS, CONSTRAINT COLUMNS; CREATE STATISTICS CLASS <CLASSNAME>STATISTICS COLLECTION TARGET 0 APPLIES TO <COMMAS SEPARATED LIST OF DATATYPES>; <anything not specifically identified, would then use system parameter default> That would be great, since we now have the ability to collects stats on user defined datatypes (don't we?). Back to the multiplier: Yes please, but only if it worked like this: Normal and index stats are settable differently. Index stats DEFAULT is a multiple of normal stats, unless specifically set. (You may wish to set it down as well as up, remember). That way, the default behaviour improves even when the index stats parameter is not actually set, yet is still controllable when you do. Best Regards, Simon Riggs
pgsql-hackers by date: