Re: Default Stats Revisited - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Default Stats Revisited
Date
Msg-id 006501c4081f$47fabe80$509d87d9@LaptopDellXP
Whole thread Raw
In response to Re: Default Stats Revisited  (Josh Berkus <josh@agliodbs.com>)
List pgsql-hackers
>Josh Berkus
> > But possible more error prone.  If you crank up the default
statistics
> to
> > 50, but the index default is still 25...  OTOH, you could always
have
> the
> > setting of used for index default be whichever is greater... hmmm.
> 
> Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll
of
> DBAs
> to find out which they would find more accessable.   But since most
people
> seem to be ignoring this thread, I'm not sure we'll get much response
...

...very interesting thoughts overall. I note that your idea has a very
strong basis and is pretty much suggested to us, since it forms part of
the allowable specification of TPC-H: - so let's do it!
I enclose the following TPC-H text. Section 5.2.8, p.103, TPC Benchmark
H (Decision Support), Standard Specification, revision 2.1.0:
5.2.8 The gathering of statistics is part of the database load (see
Clause 4.3) but it also serves as an important configuration
vehicle, particularly for the query optimizer. In order to satisfy the
requirements of Clause 5.2.7, it is desirable
to collect the same quality of statistics for every column of every
table. However, in order to reduce processing
requirements, it is permissible to segment columns into distinct classes
and base the level of statistics collection for
a particular column on class membership. Class definitions must rely
solely on schema-related attributes of a column
and must be applied consistently across all tables. For example:
. membership in an index;
. leading or other position in an index;
. use in a constraint (including a primary or foreign key
relationships).
Statistics that operate in sets, such as distribution statistics, should
employ a fixed set appropriate to the scale factor
used. Knowledge of the cardinality, values or distribution of a non-key
column as specified in Clause 4 cannot be
used to tailor statistics gathering.

Based upon that, might we add slightly to your index stats suggestion
slightly and include constraint-member columns also?

Overall, the problem you highlighted is:
- if default stats is 10 and we don't think that is very useful, then
clearly that should change, yet we are performance constrained

I would note that DB2 uses default 20, as does Teradata. Oracle uses 75
histogram buckets as default. Clearly, 10 is not the "accepted"
view...but I'm sure I'll be shot down for such thinking.

You set me thinking about another possible solution:
My understanding is that the default for stats on PostgreSQL is
"collect", whereas on other systems it is "don't collect" (until
instructed). i.e. we collect stats on all columns by default. That's
good, but the end result is that it is slower than the other default.
[That might be important because dynamic sampling is not yet
implemented, not sure]

If we had the option not to collect stats at all on most columns, then
that would speed things up, wouldn't it (just as TPC-H pretty much
says). Perhaps it might be better to offer an option to alter that
default? It seems that DEFAULT_STATISTICS_TARGET cannot be set to "no
thanks". If you could turn off the collection of what we seem to be
agreeing is a relatively pointless collection of statistics, would that
not improve stats collection performance? If set to 0, only collect
number of nulls, number of distinct values, max and min.

Turning off stats-by-default and yet ramping up the collection target
where it is needed sounds good for TPC-H/DBT-3 environments - and will
likely even improve perf numbers on the actual tests! 

..this might also allow us to rename the two parameters, just as was
recently done with work_mem etc..

statistics_target_default    0  -- i.e. don't collect by default
statistics_target_index     100

I would also suggest a further class of columns for statistics
collection:

statistics_target_text

which would include all fields with length > 32 (pick a limit...), since
these are very frequently all unique.
Setting these to -1 would mean they use the statistics_target_default
value, which would be their default setting.

You might even provide a mechanism for defining statistics collection
classes based upon their datatype...e.g.

CREATE STATISTICS CLASS <CLASSNAME>
STATISTICS COLLECTION TARGET 100
APPLIES TO (LEADING n) INDEX COLUMNS, CONSTRAINT COLUMNS;

CREATE STATISTICS CLASS <CLASSNAME>STATISTICS COLLECTION TARGET 0
APPLIES TO <COMMAS SEPARATED LIST OF DATATYPES>;

<anything not specifically identified, would then use system parameter
default>

That would be great, since we now have the ability to collects stats on
user defined datatypes (don't we?).

Back to the multiplier: Yes please, but only if it worked like this:
Normal and index stats are settable differently.
Index stats DEFAULT is a multiple of normal stats, unless specifically
set. (You may wish to set it down as well as up, remember).
That way, the default behaviour improves even when the index stats
parameter is not actually set, yet is still controllable when you do.

Best Regards, Simon Riggs




pgsql-hackers by date:

Previous
From: "Simon Riggs"
Date:
Subject: Re: PITR Functional Design v2 for 7.5
Next
From: Tatsuo Ishii
Date:
Subject: Re: client side syntax error localisation for psql (v1)