Thread: Default Stats Revisited

Default Stats Revisited

From
Josh Berkus
Date:
Folks,

Early on in the default_stats thread, I made a proposal that got dropped 
without discussion.   I'd like to revisit it, because I still think it's a 
good idea.

The Issue:   The low default_stats_target of 10 is not sufficient for many 
complex queries involving multi-column correlation or oddly distributed data.  
Yet modestly increasing the stats target for *all* columns, as demonstrated, 
substantially increases the time required for Analyze, without gain on most 
queries.  

If only there were a way to automatically increas the default stats on only 
"important" columns, and not on other columns!  Yet if we burden the DBA with 
flagging important colummns all over the database, we haven't saved him/her 
any work.

Ah, but there is a way!   Most "important" columns are already indicated ... 
because they are indexed.    If we implemented a system where indexed columns 
would have a significantly higher stats_target than non-indexed columns, this 
might improve our default behavior without overburdening Analyze.

Proposal:  That we consider:-- adding a new GUC default_stats_indexed-- that this GUC be set initially to 100 if
stats_targetis 10-- that the system be adjusted to that indexed columns take their     stats_target from
default_stats_indexedand not default_stats_target-- that expressional indexes be ignored for this purpose, as
implementation   would be too complex, and they have their own stats anyway
 

If this proposal is worth considering, I will spend some time building up a 
test case to demonstrate the cost and utility of the plan.  With Neil's help, 
of course!

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
Josh Berkus
Date:
Scott,

> I like it.  Would a multiplier be acceptable?  
> default_stats_index_multiplier = 10

Yeah, I thought about that, but a multiplier would be harder to manage for 
most people.    I mean, what if your default_stats are at 25 and you want 
your index_stats at 40?   PITA.   Also, if you wanted to increase the default 
stats but *forgot* that the index_stats were a multiplier ...

I think a straight number is less confusing.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
"scott.marlowe"
Date:
On Wed, 10 Mar 2004, Josh Berkus wrote:

> Folks,
> 
> Early on in the default_stats thread, I made a proposal that got dropped 
> without discussion.   I'd like to revisit it, because I still think it's a 
> good idea.
> 
> The Issue:   The low default_stats_target of 10 is not sufficient for many 
> complex queries involving multi-column correlation or oddly distributed data.  
> Yet modestly increasing the stats target for *all* columns, as demonstrated, 
> substantially increases the time required for Analyze, without gain on most 
> queries.  
> 
> If only there were a way to automatically increas the default stats on only 
> "important" columns, and not on other columns!  Yet if we burden the DBA with 
> flagging important colummns all over the database, we haven't saved him/her 
> any work.
> 
> Ah, but there is a way!   Most "important" columns are already indicated ... 
> because they are indexed.    If we implemented a system where indexed columns 
> would have a significantly higher stats_target than non-indexed columns, this 
> might improve our default behavior without overburdening Analyze.
> 
> Proposal:  That we consider:
>     -- adding a new GUC default_stats_indexed
>     -- that this GUC be set initially to 100 if stats_target is 10
>     -- that the system be adjusted to that indexed columns take their 
>         stats_target from default_stats_indexed and not default_stats_target
>     -- that expressional indexes be ignored for this purpose, as implementation
>         would be too complex, and they have their own stats anyway
> 
> If this proposal is worth considering, I will spend some time building up a 
> test case to demonstrate the cost and utility of the plan.  With Neil's help, 
> of course!

I like it.  Would a multiplier be acceptable?  
default_stats_index_multiplier = 10

Of course it would automatically be capped at 1000, etc...

I like the idea that if I'm increasing the default statistics target the 
stats for indexed columns goes up too.



Re: Default Stats Revisited

From
"scott.marlowe"
Date:
On Thu, 11 Mar 2004, Josh Berkus wrote:

> Scott,
> 
> > I like it.  Would a multiplier be acceptable?  
> > default_stats_index_multiplier = 10
> 
> Yeah, I thought about that, but a multiplier would be harder to manage for 
> most people.    I mean, what if your default_stats are at 25 and you want 
> your index_stats at 40?   PITA.   Also, if you wanted to increase the default 
> stats but *forgot* that the index_stats were a multiplier ...
> 
> I think a straight number is less confusing.

But possible more error prone.  If you crank up the default statistics to 
50, but the index default is still 25...  OTOH, you could always have the 
setting of used for index default be whichever is greater... hmmm.



Re: Default Stats Revisited

From
Josh Berkus
Date:
Scott,

> But possible more error prone.  If you crank up the default statistics to 
> 50, but the index default is still 25...  OTOH, you could always have the 
> setting of used for index default be whichever is greater... hmmm.

Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll of DBAs 
to find out which they would find more accessable.   But since most people 
seem to be ignoring this thread, I'm not sure we'll get much response ...

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll of DBAs
> to find out which they would find more accessable.   But since most people 
> seem to be ignoring this thread, I'm not sure we'll get much response ...

Maybe you should ask on -admin or -general.  Personally I thought there
wasn't anything to say until someone did some experiments to show
whether an indexed-column differential is really worthwhile and what a
plausible default value would be.  The idea sounds good in the abstract,
but will it really help or just be another useless setting?
        regards, tom lane


Re: Default Stats Revisited

From
Josh Berkus
Date:
Tom,

> Maybe you should ask on -admin or -general.  Personally I thought there
> wasn't anything to say until someone did some experiments to show
> whether an indexed-column differential is really worthwhile and what a
> plausible default value would be.  The idea sounds good in the abstract,
> but will it really help or just be another useless setting?

Yeah, that's our next step, a test.

On Monday,  I hope to have comparative stats for a difficult database on the 3 
solutions (leaving things as-is, raising the general default stats, and doing 
index_stats).

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
Robert Treat
Date:
On Thursday 11 March 2004 14:17, Josh Berkus wrote:
> Tom,
>
> > Maybe you should ask on -admin or -general.  Personally I thought there
> > wasn't anything to say until someone did some experiments to show
> > whether an indexed-column differential is really worthwhile and what a
> > plausible default value would be.  The idea sounds good in the abstract,
> > but will it really help or just be another useless setting?
>
> Yeah, that's our next step, a test.
>
> On Monday,  I hope to have comparative stats for a difficult database on
> the 3 solutions (leaving things as-is, raising the general default stats,
> and doing index_stats).

Do you plan on handeling primary key columns differently (since they are 
likely to be unique and indexed) ?

Also how will you handle column that are part of expressional indexes (where 
foo is true for example) ?

Final thought... I'm a DBA and I think the straight number is simpler, though 
could be convinced to go with whichever is higher... 

Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


Re: Default Stats Revisited

From
Josh Berkus
Date:
Robert,

> Do you plan on handeling primary key columns differently (since they are 
> likely to be unique and indexed) ?

The same as any other indexed column.

> Also how will you handle column that are part of expressional indexes (where 
> foo is true for example) ?

See my original proposal.  These columns will be ignored.  Expressions have 
their own stats.

> Final thought... I'm a DBA and I think the straight number is simpler, 
though 
> could be convinced to go with whichever is higher... 

Thanks.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Also how will you handle column that are part of expressional indexes (where
>> foo is true for example) ?

> See my original proposal.  These columns will be ignored.  Expressions have 
> their own stats.

Yeah, I see no particular need to increase the stats allocation for a
column merely because it is used in an expression index.  The mechanism
Josh is proposing should cause the default amount of stats collected
*for the expression index* to go up, though.


BTW, there's an as-yet-undone bit of business associated with collecting
stats for expression indexes.  The ANALYZE code will honor an explicit
attstattarget setting for an expressional index column, but there's no
clean way to get that setting into the system.  What works in CVS tip is

regression=# create table foo (f1 int, f2 int);
CREATE TABLE
regression=# create index fooi on foo ((f1 + f2));
CREATE INDEX
regression=# alter table fooi alter column pg_expression_1 set statistics 100;
ALTER TABLE

but I don't much care for this; the arbitrary names that are used for
expressional columns shouldn't be embedded into SQL commands, and doing
"alter table" on an index looks a bit funny as well.

I want to make pg_dump dump these settings, but I'm not feeling
comfortable with having it dump commands that look like the above.
That would nail down the current method of assigning expression column
names as something we could never change without breaking dump scripts.

Can anyone think of a better way?
        regards, tom lane


Re: Default Stats Revisited

From
Bruce Momjian
Date:
Robert Treat wrote:
> On Thursday 11 March 2004 14:17, Josh Berkus wrote:
> > Tom,
> >
> > > Maybe you should ask on -admin or -general.  Personally I thought there
> > > wasn't anything to say until someone did some experiments to show
> > > whether an indexed-column differential is really worthwhile and what a
> > > plausible default value would be.  The idea sounds good in the abstract,
> > > but will it really help or just be another useless setting?
> >
> > Yeah, that's our next step, a test.
> >
> > On Monday,  I hope to have comparative stats for a difficult database on
> > the 3 solutions (leaving things as-is, raising the general default stats,
> > and doing index_stats).
> 
> Do you plan on handeling primary key columns differently (since they are 
> likely to be unique and indexed) ?

Another idea is whether a foreign key column should get extra
statistics?

> Also how will you handle column that are part of expressional indexes (where 
> foo is true for example) ?
> 
> Final thought... I'm a DBA and I think the straight number is simpler, though 
> could be convinced to go with whichever is higher... 

Agreed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Default Stats Revisited

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Another idea is whether a foreign key column should get extra
> statistics?

In practice, both ends of an FK relationship have to be indexed,
so I don't see that we need any extra special case for that.
        regards, tom lane


Re: Default Stats Revisited

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Another idea is whether a foreign key column should get extra
> > statistics?
> 
> In practice, both ends of an FK relationship have to be indexed,
> so I don't see that we need any extra special case for that.

Do they?  We don't create an index automatically when using REFERENCES.
We do create an index for PRIMARY KEY.

I was just wondering if the REFERENCES column is more sensitive to join
usage and would benefit from more accurate statistics even if it doesn't
have an index.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Default Stats Revisited

From
Josh Berkus
Date:
Bruce,

> Do they?  We don't create an index automatically when using REFERENCES.
> We do create an index for PRIMARY KEY.
> 
> I was just wondering if the REFERENCES column is more sensitive to join
> usage and would benefit from more accurate statistics even if it doesn't
> have an index.

I don't think so.   If one does not create an index on an FK column, this is 
usually because the column does not have enough of a range of values to 
deserve indexing (for example, 4 potential values across 100,000 records).  
In that case, we would not want to up the statistics either.

And we haven't made it a practice to hand-hold for database designers who 
don't know when to index, so I don't think we should start now.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Default Stats Revisited

From
"Simon Riggs"
Date:
>Josh Berkus
> > But possible more error prone.  If you crank up the default
statistics
> to
> > 50, but the index default is still 25...  OTOH, you could always
have
> the
> > setting of used for index default be whichever is greater... hmmm.
> 
> Well, I'm not 100% opposed to a multiplier.   I'd like to take a poll
of
> DBAs
> to find out which they would find more accessable.   But since most
people
> seem to be ignoring this thread, I'm not sure we'll get much response
...

...very interesting thoughts overall. I note that your idea has a very
strong basis and is pretty much suggested to us, since it forms part of
the allowable specification of TPC-H: - so let's do it!
I enclose the following TPC-H text. Section 5.2.8, p.103, TPC Benchmark
H (Decision Support), Standard Specification, revision 2.1.0:
5.2.8 The gathering of statistics is part of the database load (see
Clause 4.3) but it also serves as an important configuration
vehicle, particularly for the query optimizer. In order to satisfy the
requirements of Clause 5.2.7, it is desirable
to collect the same quality of statistics for every column of every
table. However, in order to reduce processing
requirements, it is permissible to segment columns into distinct classes
and base the level of statistics collection for
a particular column on class membership. Class definitions must rely
solely on schema-related attributes of a column
and must be applied consistently across all tables. For example:
. membership in an index;
. leading or other position in an index;
. use in a constraint (including a primary or foreign key
relationships).
Statistics that operate in sets, such as distribution statistics, should
employ a fixed set appropriate to the scale factor
used. Knowledge of the cardinality, values or distribution of a non-key
column as specified in Clause 4 cannot be
used to tailor statistics gathering.

Based upon that, might we add slightly to your index stats suggestion
slightly and include constraint-member columns also?

Overall, the problem you highlighted is:
- if default stats is 10 and we don't think that is very useful, then
clearly that should change, yet we are performance constrained

I would note that DB2 uses default 20, as does Teradata. Oracle uses 75
histogram buckets as default. Clearly, 10 is not the "accepted"
view...but I'm sure I'll be shot down for such thinking.

You set me thinking about another possible solution:
My understanding is that the default for stats on PostgreSQL is
"collect", whereas on other systems it is "don't collect" (until
instructed). i.e. we collect stats on all columns by default. That's
good, but the end result is that it is slower than the other default.
[That might be important because dynamic sampling is not yet
implemented, not sure]

If we had the option not to collect stats at all on most columns, then
that would speed things up, wouldn't it (just as TPC-H pretty much
says). Perhaps it might be better to offer an option to alter that
default? It seems that DEFAULT_STATISTICS_TARGET cannot be set to "no
thanks". If you could turn off the collection of what we seem to be
agreeing is a relatively pointless collection of statistics, would that
not improve stats collection performance? If set to 0, only collect
number of nulls, number of distinct values, max and min.

Turning off stats-by-default and yet ramping up the collection target
where it is needed sounds good for TPC-H/DBT-3 environments - and will
likely even improve perf numbers on the actual tests! 

..this might also allow us to rename the two parameters, just as was
recently done with work_mem etc..

statistics_target_default    0  -- i.e. don't collect by default
statistics_target_index     100

I would also suggest a further class of columns for statistics
collection:

statistics_target_text

which would include all fields with length > 32 (pick a limit...), since
these are very frequently all unique.
Setting these to -1 would mean they use the statistics_target_default
value, which would be their default setting.

You might even provide a mechanism for defining statistics collection
classes based upon their datatype...e.g.

CREATE STATISTICS CLASS <CLASSNAME>
STATISTICS COLLECTION TARGET 100
APPLIES TO (LEADING n) INDEX COLUMNS, CONSTRAINT COLUMNS;

CREATE STATISTICS CLASS <CLASSNAME>STATISTICS COLLECTION TARGET 0
APPLIES TO <COMMAS SEPARATED LIST OF DATATYPES>;

<anything not specifically identified, would then use system parameter
default>

That would be great, since we now have the ability to collects stats on
user defined datatypes (don't we?).

Back to the multiplier: Yes please, but only if it worked like this:
Normal and index stats are settable differently.
Index stats DEFAULT is a multiple of normal stats, unless specifically
set. (You may wish to set it down as well as up, remember).
That way, the default behaviour improves even when the index stats
parameter is not actually set, yet is still controllable when you do.

Best Regards, Simon Riggs