Thread: raising the default default_statistics_target

raising the default default_statistics_target

From
Neil Conway
Date:
 From time to time, people on IRC ask for help with performance 
problems, and the cause of the difficulty is ultimately traced to a 
poor query plan that is chosen because default_statistics_target is 
too low. While there will always need to be *some* tuning of the 
statistics target by advanced users, I wanted to see what the 
performance penalty would be to increase the default stats target out 
of the box -- that way, more users will get good query plans without 
needing to manually tweak the configuration.

In the simple test I performed, raising the default_statistics_target 
from 10 to 25 resulted in a 40% increase in the time to ANALYZE a 
large table. (I picked 25 more or less at random -- would 15 or 20 be 
better?)

That's a larger hit than I was hoping to see; however, it can be 
argued that ANALYZE isn't really performance-critical anyway (since it 
doesn't hold the same kind of locks that VACUUM and especially VACUUM 
FULL hold). Also, I only have anecdotal evidence that this is actually 
a problem. It may also be the case that for those people for whom 10 
is an insufficient stats target, 25 is also insufficient.

Any comments on whether increasing the default stats target is a good 
idea for 7.5? (Details on the test I performed are included below)

-Neil

I created a 2.1 GB table with 3 columns (int, varchar, and float):

nconway=# select relpages from pg_class where relname = 'abc'; relpages
----------   279621
(1 row)

nconway=# select reltuples from pg_class where relname = 'abc';  reltuples
------------- 3.35545e+07
(1 row)

I tested two default_statistcs_target settings: 10 (the current 
default), and 25. The test machine is a P4 1.8 Ghz with 768 MB of RAM 
and a pretty mediocre 7200 RPM IDE disk running Linux 2.6.3. I 
rebooted the machine before and between tests.

ANALYZE w/ stats target = 10: 51.643 seconds
ANALYZE w/ stats target = 25: 71.969 seconds

(Additional tests performed w/o rebooting seem to be consistent with 
these numbers.)


Re: raising the default default_statistics_target

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Any comments on whether increasing the default stats target is a good 
> idea for 7.5? (Details on the test I performed are included below)

This is something we need to consider, but we'll need more evidence
before making a choice.  One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
(There's no point in slowing down ANALYZE unless the plans get better.)

Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine.  I do not have any numbers about this cost though --- do you
want to try to get some?
        regards, tom lane


Re: raising the default default_statistics_target

From
Neil Conway
Date:
Tom Lane wrote:
> This is something we need to consider, but we'll need more evidence
> before making a choice.  One thing that we have very little data about
> is how much difference it makes in the quality of planner choices.

Right, but is there a practical way to actually get this data?

If the distribution of data in the table is irregular, a higher stats 
target is needed to allow good planning choices. Therefore, the effect 
that the stats target has on planner choices depends on the regularity 
of the distribution of data at installations, and there is no way to 
know that in general AFAICS.

> Also, I would expect that larger stats targets would slow down the parts
> of the planner that look at the stats, since there are more data values
> to examine.  I do not have any numbers about this cost though --- do you
> want to try to get some?

Given the magnitude of the change (25 data elements versus 10), I 
wouldn't expect this to produce a major change in the total runtime of 
the optimizer. However, I don't know the optimizer that well, so I'll 
do some benchmarks when I get a chance.

-Neil


Re: raising the default default_statistics_target

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Tom Lane wrote:
>> This is something we need to consider, but we'll need more evidence
>> before making a choice.  One thing that we have very little data about
>> is how much difference it makes in the quality of planner choices.

> Right, but is there a practical way to actually get this data?

I haven't thought of one yet, but perhaps someone will have an idea.


>> Also, I would expect that larger stats targets would slow down the parts
>> of the planner that look at the stats, since there are more data values
>> to examine.  I do not have any numbers about this cost though --- do you
>> want to try to get some?

> Given the magnitude of the change (25 data elements versus 10), I 
> wouldn't expect this to produce a major change in the total runtime of 
> the optimizer.

I wouldn't either, but if we need to raise the stats target to 100 or
1000 to make a meaningful difference, then the question becomes more
urgent.
        regards, tom lane


Re: raising the default default_statistics_target

From
Josh Berkus
Date:
Neil,

> In the simple test I performed, raising the default_statistics_target 
> from 10 to 25 resulted in a 40% increase in the time to ANALYZE a 
> large table. (I picked 25 more or less at random -- would 15 or 20 be 
> better?)

I find that very interesting, since I haven't found much higher increases to 
be a proportionate penality.   For example, on an 11-column table raising 3 
columns to statistics=250 merely doubled the ANALYZE time.  I have not done 
exact timing, but would be happy to ....

>It may also be the case that for those people for whom 10 
> is an insufficient stats target, 25 is also insufficient.

It is.   I've found that "problem" queries, especially those caused by real, 
uneven distribution of data, require raising statistics to 150-400 in order 
to fix.  This is much to high a level to assign as a default.

> Any comments on whether increasing the default stats target is a good 
> idea for 7.5? (Details on the test I performed are included below)

No.   I don't think it's a good idea to raise the default for *all* columns; 
for one thing, I'd really hate to think what, say, a default stats of 100 
would do to a TEXT column with an average of 8K of data per row.

Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
(unless something's gone off with that?) which should help a lot of problem 
queries.  And change our whole emphasis on brute forcing analyze through 
increasing stats into the 100's.   

If you really want to tackle this issue, though, here's what I suggest:

1) add a GUC called default_statistics_indexed, which starts at say 100 or 50.
2) When ever the user indexes a column, automatically increase the stats
to the level in default_statistics_indexed, if they are at the level in 
default_statistics_target.

This will then give indexed columns "automatically" a somewhat higher level of 
stats analysis than other columns.   This should help a lot of "slow query" 
problems, yet effectively leave the selection of "important" columns in the 
hands of the DBA.   Make sense?

Also, another great feature in this department would be to extend the 
multi-column correlation statistics to cover foriegn keys, as a way of 
improving cross-table estimates.

Anyway, keep me in the loop on this, I have a lot of very complex databases I 
can test such issues on.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: raising the default default_statistics_target

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> It is.   I've found that "problem" queries, especially those caused by real, 
> uneven distribution of data, require raising statistics to 150-400 in order 
> to fix.  This is much to high a level to assign as a default.

That's basically what's bothering me about the suggestion to increase to
25 --- I'm dubious that it will do any good.

> Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
> (unless something's gone off with that?)

News to me.  It's certainly not there now.

> This will then give indexed columns "automatically" a somewhat higher
> level of stats analysis than other columns.

That is potentially a good idea.  There's still the question of what is
a reasonable default, though.
        regards, tom lane


Re: raising the default default_statistics_target

From
Bruce Momjian
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > It is.   I've found that "problem" queries, especially those caused by real, 
> > uneven distribution of data, require raising statistics to 150-400 in order 
> > to fix.  This is much to high a level to assign as a default.
> 
> That's basically what's bothering me about the suggestion to increase to
> 25 --- I'm dubious that it will do any good.
> 
> > Further, in 7.5 we'll be introducing correlated stats for multi-column indexes 
> > (unless something's gone off with that?)
> 
> News to me.  It's certainly not there now.
> 
> > This will then give indexed columns "automatically" a somewhat higher
> > level of stats analysis than other columns.
> 
> That is potentially a good idea.  There's still the question of what is
> a reasonable default, though.

Do all the columns have to have the same number of statistics buckets? 
Could that stats collector adjust the number of buckets based on the
data somehow?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: raising the default default_statistics_target

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Do all the columns have to have the same number of statistics buckets? 

They do not, but the effort spent by ANALYZE is proportional to the
largest stats target among all the columns of the table.
        regards, tom lane


Re: raising the default default_statistics_target

From
markw@osdl.org
Date:
On  7 Mar, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
>> Tom Lane wrote:
>>> This is something we need to consider, but we'll need more evidence
>>> before making a choice.  One thing that we have very little data about
>>> is how much difference it makes in the quality of planner choices.
> 
>> Right, but is there a practical way to actually get this data?
> 
> I haven't thought of one yet, but perhaps someone will have an idea.

If DBT-3 is an appropriate test, we can get those EXPLAIN ANALYZE
changes I have working in the STP version and queued up tests varying
the default_statistics_target parameter.

Mark


Re: raising the default default_statistics_target

From
Josh Berkus
Date:
Tom,

> > Further, in 7.5 we'll be introducing correlated stats for multi-column 
indexes 
> > (unless something's gone off with that?)

This was discussed on Hackers in October, a complete implementation was shown, 
I thought it was committed at that time.   If not, what happened to it?

Dammit, it's impossible to find anything in the archives if you don't have 
some good keywords or at least the author.  Is the autor reading this?   Will 
you speak up?

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: raising the default default_statistics_target

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Further, in 7.5 we'll be introducing correlated stats for multi-column 
> indexes (unless something's gone off with that?)

> This was discussed on Hackers in October, a complete implementation was shown, 
> I thought it was committed at that time.   If not, what happened to it?

Are you sure you're not thinking of stats for functional indexes?
        regards, tom lane


Re: raising the default default_statistics_target

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Do all the columns have to have the same number of statistics buckets? 
> 
> They do not, but the effort spent by ANALYZE is proportional to the
> largest stats target among all the columns of the table.

Could we use previous stats to determine how many buckets to use when
running ANALYZE.  Also, if columns have a different number of buckets,
does that mean that we don't have the same per-query overhead for a
larger stats target?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: raising the default default_statistics_target

From
Josh Berkus
Date:
Tom,

> Are you sure you're not thinking of stats for functional indexes?

Positive.    I even remember seeing that the patch was accepted.

The patch specifically had to do with a multi-column correlation algorithm for 
improving the selectivity of multi-column indexes.

Problem is, with 1400 posts per month August to October, I can't find it, and 
the keywords that I think are obvious don't turn anything up.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: raising the default default_statistics_target

From
Robert Treat
Date:
I think the thread you're thinking of is on or about this post:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php

Manfred Koizar produced a patch that modified index correlation by
sorting equal key values based on item pointers. The patch went as far
as getting accepted into the patch queue, but Tom raised some doubts
about it and it was subsequently removed. 

Robert Treat

On Mon, 2004-03-08 at 14:41, Josh Berkus wrote:
> Tom,
> 
> > Are you sure you're not thinking of stats for functional indexes?
> 
> Positive.    I even remember seeing that the patch was accepted.
> 
> The patch specifically had to do with a multi-column correlation algorithm for 
> improving the selectivity of multi-column indexes.
> 
> Problem is, with 1400 posts per month August to October, I can't find it, and 
> the keywords that I think are obvious don't turn anything up.
> 
> -- 
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL



Re: raising the default default_statistics_target

From
Tom Lane
Date:
Robert Treat <xzilla@users.sourceforge.net> writes:
> I think the thread you're thinking of is on or about this post:
> http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php
> Manfred Koizar produced a patch that modified index correlation by
> sorting equal key values based on item pointers. The patch went as far
> as getting accepted into the patch queue, but Tom raised some doubts
> about it and it was subsequently removed. 

Hm, that had nothing to do with multi-column correlation though.
I'm at a loss to think of any work that matches with Josh's
recollection.
        regards, tom lane


Re: raising the default default_statistics_target

From
Josh Berkus
Date:
Guys,

> Hm, that had nothing to do with multi-column correlation though.
> I'm at a loss to think of any work that matches with Josh's
> recollection.

Hmmmm .... it's possible that early e-mails about Manfred's patch claimed to 
improve performance for multi-column indexes.    

But it's also possible I'm remembering something else.

Darn it, though!  'cause multi-column correlation is one of our big issues on 
estimates for complex queries.

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: raising the default default_statistics_target

From
"scott.marlowe"
Date:
On Sun, 7 Mar 2004, Tom Lane wrote:

> Neil Conway <neilc@samurai.com> writes:
> > Tom Lane wrote:
> >> This is something we need to consider, but we'll need more evidence
> >> before making a choice.  One thing that we have very little data about
> >> is how much difference it makes in the quality of planner choices.
> 
> > Right, but is there a practical way to actually get this data?
> 
> I haven't thought of one yet, but perhaps someone will have an idea.

Hi Tom.  I ran some very simple tests on analyze times and query plan 
times on a very simple table, with data randomly distributed.  The index 
was on a date field, since that's what I was testing last.

This was all done on my 512Meg memory 1.1GHz celeron workstation with an 
IDE drive.  I'd love more input on better testing methodologies here...

with 100k or 1M rows that look kinda like this:  (I'll test 10M rows 
later, which means the dataset won't fit in memory, so there'll be lots of 
access going on.  Right now the 1M row table is 80 meg)
select * from test2 limit 5;                   info                     |         dt          |   id
---------------------------------------------+---------------------+---------Francize perfectible swirling fluctuates
| 2004-05-20 20:12:04 | 2721995Fields chauffeur attentionality grandmother | 2004-04-07 14:36:02 | 2721996Belgium
bilkedexplosively defendant        | 2004-09-16 16:27:22 | 2721997perspectives Buenos Pollux discriminates    |
2004-11-1112:28:31 | 2721998Victorianize Savonarola blackmails sufficed | 2004-02-27 21:17:20 | 2721999
 
(5 rows)

here's what I get with different statistics targets for analyze times:
100k    1M    1M    analyze    analyze    plan    
target    ms    ms    ms    
10    250    875    2    
20    350    1250        
30    430    1500        
40    520    1725        
50    580    1900        
60    690    2100        
70    775    2175        
80    850    2300        
90    950    2400        
100    1000    2600    2.5    
200    1806    3700        
300    2600    4800        
400    2600    5900        
500    2600    7200        
700    2600    9500        
1000    2600    13000    5    

Since this data is randomly distributed, I didn't bother doing a lot of 
testing to see how accurate each target setting was.  If that would be 
useful to know I'd gladly test it, but I was only setting out to test the 
time to analyze and the time to plan.

Note that I only tested 3 targets for planning time, as it didn't seem to 
make a very big difference.  The query was:

select * from test2 where dt between 'march 11, 2004' and 'march 13, 2004';

I also ran some quick tests on smaller tables (1000 and 10k rows) and 
there, the plateau that we see in the 100k analyze shows up much quicker, 
at something like 50 or so.  I.e. the analyze time flattened out quickly 
and higher numbers cost very little if anything.

Since this query was quite an easy plan, I'd expect to need a much more 
complex one to test the increase in planning time, say something that has 
to look at a lot of statistics.  Any particular join type or something 
that's likely to do that?




Re: raising the default default_statistics_target

From
Tom Lane
Date:
"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Hi Tom.  I ran some very simple tests on analyze times and query plan 
> times on a very simple table, with data randomly distributed.  The index 
> was on a date field, since that's what I was testing last.

Thanks.

> I also ran some quick tests on smaller tables (1000 and 10k rows) and 
> there, the plateau that we see in the 100k analyze shows up much quicker, 
> at something like 50 or so.  I.e. the analyze time flattened out quickly 
> and higher numbers cost very little if anything.

The sample size is (IIRC) 300 times stats_target rows, so the "plateau"
that you're seeing occurs when the sample size becomes the entire table.
It would be useful to note how large the ANALYZE process got to be during
these runs.

> Since this query was quite an easy plan, I'd expect to need a much more 
> complex one to test the increase in planning time, say something that has 
> to look at a lot of statistics.  Any particular join type or something 
> that's likely to do that?

I'd say try a join on any reasonably plausible foreign-key relationship
(unique key on one side, not-unique data on the other).  That's probably
the most common situation.  As for making it complicated, just stack up
a bunch of such joins ...
        regards, tom lane