Thread: Simple postgresql.conf wizard

Simple postgresql.conf wizard

From
Greg Smith
Date:
One of the long-terms goals I'm working toward is wrapping a "wizard" 
interface around the tuning guidelines described by 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server now that 
those have gone through a fair amount of peer review.  Attached is a first 
simple version of such a wizard, implemented in Python.  Right now what it 
does is look the amount of memory in your system and adjust shared_buffers 
and effective_cache_size.  So if we started with this stock configuration:

shared_buffers = 32MB                   # min 128kB or 
#effective_cache_size = 128MB

And run it like this (from my system with 8GB of RAM):
./pg-generate-conf /home/gsmith/data/guc/postgresql.conf

You'd get a new file with these lines in it printed to standard out:

# shared_buffers = 32MB                 # min 128kB or 
#effective_cache_size = 128MB
shared_buffers 1931MB # pg_generate_conf wizard 2008-11-01
effective_cache_size 5793MB # pg_generate_conf wizard 2008-11-01

While I've got a pretty clear vision for what I'm doing with this next and 
will kick off a pgfoundry project real soon, I wanted to throw this out as 
a WIP for feedback at this point.  I was ultimately hoping to one day have 
something like this shipped as a contrib/ module to address the constant 
requests for such a thing.  I know it would be unreasonable to expect 
something in this state to make it into the 8.4 contrib at this late hour. 
But since it's independant of the core database stuff I figured I'd make 
it available right at the wire here today in the off chance that did seem 
a reasonable proposition to anybody.  It is already a big improvement over 
no tuning at all, and since it's a simple to change script I will rev this 
based on feedback pretty fast now that the most boring guts are done.

Possible feedback topics:

-Setting the next round of values requires asking the user for some input 
before making recommendations.  Is it worth building a curses-based 
interface to updating the values?  That would be really helpful for people 
with only ssh access to their server, but barring something like WCurses 
I'm not sure it would help on Windows.

-How about a GUI one with Python's Tkinter interface?  Now Windows isn't a 
problem, but people using ssh aren't going to be as happy.

-I'm not sure if there's any useful replacement for the os.sysconf 
interface I'm using to grab the memory information on the popular Windows 
Python ports.  Some of the other projects I looked at that tried to 
abstract that OS interaction more didn't seem much better here (i.e. the 
PSI library which doesn't support Windows either)

-Stepping back a bit from this particular code, is something in Python 
like this ever going to be appropriate to ship as a contrib module? 
There seems to be a bit more traction in this community for using Perl for 
such things; I might do a Perl port of this one day but that's not going 
to happen soon.

I think that's enough flametastic material now, and I do plan to join in 
on patch review in penance for the disruption I've introduced here.  The 
next specific things I'm doing with this regardless is making it read and 
respect the min/max values for settings as well as the rest of the 
information avaialable from pg_settings.  I eventually want to support all 
the syntax suggested for pg_generate_config described at 
http://wiki.postgresql.org/wiki/GUCS_Overhaul but for the moment I'm not 
being so ambitious.  Some of that is aimed at making a pg_generate_conf 
that is capable of replacing the sample postgresql.conf file, which is a 
couple of steps away from where I'm at right now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Greg Smith wrote:
> While I've got a pretty clear vision for what I'm doing with this next 
> and will kick off a pgfoundry project real soon, I wanted to throw 
> this out as a WIP for feedback at this point.  I was ultimately hoping 
> to one day have something like this shipped as a contrib/ module to 
> address the constant requests for such a thing.  I know it would be 
> unreasonable to expect something in this state to make it into the 8.4 
> contrib at this late hour. But since it's independant of the core 
> database stuff I figured I'd make it available right at the wire here 
> today in the off chance that did seem a reasonable proposition to 
> anybody.  It is already a big improvement over no tuning at all, and 
> since it's a simple to change script I will rev this based on feedback 
> pretty fast now that the most boring guts are done.
I've picked this up to look at it.  Will give feedback once I have a 
python installed.

-Setting the next round of values requires asking the user for some 
input before making recommendations.  Is it worth building a 
curses-based interface to updating the values?  That would be really 
helpful for people with only ssh access to their server, but barring 
something like WCurses I'm not sure it would help on Windows.

I'd start with command-line switches, e.g.

config --memory=32GB --type=DW --size=500GB --connections=20
> -I'm not sure if there's any useful replacement for the os.sysconf 
> interface I'm using to grab the memory information on the popular 
> Windows Python ports.  Some of the other projects I looked at that 
> tried to abstract that OS interaction more didn't seem much better 
> here (i.e. the PSI library which doesn't support Windows either)
I think in initial versions we should just get the info from the admin.  
I've explored the issues around getting OS-independant accurate system 
stats, and they are many.

-Stepping back a bit from this particular code, is something in Python 
like this ever going to be appropriate to ship as a contrib module? 
There seems to be a bit more traction in this community for using Perl 
for such things; I might do a Perl port of this one day but that's not 
going to happen soon.

If you do a python version, others will write versions in other 
languages.  I personally don't really care; Perl's main advantage is 
that it's pre-installed on more OSes than Python is.

Anyway, more feedback once I get it running.

--Josh



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Sun, 2 Nov 2008, Josh Berkus wrote:

> I'd start with command-line switches, e.g.
> config --memory=32GB --type=DW --size=500GB --connections=20

That seems reasonable, I think I'll push a fancier UI on the backburner 
then and just spec out an options interface like this one.

> I think in initial versions we should just get the info from the admin.  I've 
> explored the issues around getting OS-independant accurate system stats, and 
> they are many.

I'm aware how thorny a perfect solution is here.  One thing that's nice 
about Python is that there are two interfaces for getting system 
information built-in, the os.sysconf I used already and the the 
distutils.sysconfig one, aimed more at C-level module writers.  Far as 
I've been able to tell it's not that hard to get something that works on 
most platforms to auto-detect memory and architecture, and I've realized 
the latter is kind of important because it determines how big you can make 
some things on 32-bit platforms.

After some digging I see there isn't any good interface built-in for 
Windows, but it's not hard to ask a DDL for the information.  I think it's 
reasonable to try and detect total memory+bit width, allow overriding 
that, and if detection fails and nothing was specified to error out. 
Should make a good balance of reasonable automatic behavior in a lot of 
cases, while still allowing the admin to adjust.  There's a completely 
valid use-case for that even not considering detection failures, where 
you're building a config file on a system other than the one it's being 
deployed onto.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Hannu Krosing
Date:
On Sun, 2008-11-02 at 17:10 -0800, Josh Berkus wrote:
> Greg Smith wrote:
>
> -Stepping back a bit from this particular code, is something in Python 
> like this ever going to be appropriate to ship as a contrib module?

I think python should be as good as any relatively widespread language.

Unless we require all contrib to be in C.

> There seems to be a bit more traction in this community for using Perl 
> for such things; I might do a Perl port of this one day but that's not 
> going to happen soon.

The advantage of python over perl is that anybody who knows at least one
programming language is able to learn basic python in an afternoon.

> If you do a python version, others will write versions in other 
> languages.  

Yeah, if python is not accepted as contrib, then it can probably be
rewritten in C once it has stabilized enough.

> I personally don't really care; Perl's main advantage is 
> that it's pre-installed on more OSes than Python is.

I think most (if not all) modern OS's standard setup includes both perl
and python. Except of course windows which probably includes neither.


------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Simple postgresql.conf wizard

From
Hannu Krosing
Date:
On Sat, 2008-11-01 at 22:02 -0400, Greg Smith wrote:

> Possible feedback topics:
> 
> -Setting the next round of values requires asking the user for some input 
> before making recommendations.  Is it worth building a curses-based 
> interface to updating the values?  That would be really helpful for people 
> with only ssh access to their server, but barring something like WCurses 
> I'm not sure it would help on Windows.
> 
> -How about a GUI one with Python's Tkinter interface?  Now Windows isn't a 
> problem, but people using ssh aren't going to be as happy.

You could have it in two parts - an agent running on the server, started
over ssh and GUI interface running on users workstation which talk to
said agent.

And I'd suggest you use wxPython for GUI part if you want a relatively
modern look.


------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Simple postgresql.conf wizard

From
Tino Wildenhain
Date:
Greg Smith wrote:
> One of the long-terms goals I'm working toward is wrapping a "wizard" 
> interface around the tuning guidelines described by 
> http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server now that 
> those have gone through a fair amount of peer review.  Attached is a 
> first simple version of such a wizard, implemented in Python.  Right now 
> what it does is look the amount of memory in your system and adjust 
> shared_buffers and effective_cache_size.  So if we started with this 
> stock configuration:

Thats really nice. I have been playing with that idea also (and in
python) but not really having time to do more then some simple tests.
So its nice to see we have something to base development on.

Thx for starting!
Regards
Tino

Re: Simple postgresql.conf wizard

From
Magnus Hagander
Date:
Greg Smith wrote:
> On Sun, 2 Nov 2008, Josh Berkus wrote:
> 
>> I'd start with command-line switches, e.g.
>> config --memory=32GB --type=DW --size=500GB --connections=20
> 
> That seems reasonable, I think I'll push a fancier UI on the backburner
> then and just spec out an options interface like this one.
> 
>> I think in initial versions we should just get the info from the
>> admin.  I've explored the issues around getting OS-independant
>> accurate system stats, and they are many.
> 
> I'm aware how thorny a perfect solution is here.  One thing that's nice
> about Python is that there are two interfaces for getting system
> information built-in, the os.sysconf I used already and the the
> distutils.sysconfig one, aimed more at C-level module writers.  Far as
> I've been able to tell it's not that hard to get something that works on
> most platforms to auto-detect memory and architecture, and I've realized
> the latter is kind of important because it determines how big you can
> make some things on 32-bit platforms.
> 
> After some digging I see there isn't any good interface built-in for
> Windows, but it's not hard to ask a DDL for the information.  I think
> it's reasonable to try and detect total memory+bit width, allow
> overriding that, and if detection fails and nothing was specified to
> error out. Should make a good balance of reasonable automatic behavior
> in a lot of cases, while still allowing the admin to adjust.  There's a
> completely valid use-case for that even not considering detection
> failures, where you're building a config file on a system other than the
> one it's being deployed onto.

Let me know if you need help to write whatever's needed to detect that
stuff on Windows. It shouldn't be too hard, as long as you can live with
a couple of lines of win32-specific code.. :-)

//Magnus



Re: Simple postgresql.conf wizard

From
Magnus Hagander
Date:
Hannu Krosing wrote:
>> If you do a python version, others will write versions in other 
>> languages.  
> 
> Yeah, if python is not accepted as contrib, then it can probably be
> rewritten in C once it has stabilized enough.

It could. The question is if it makes sense to write something like this
in C, really ;) It might get slightly more portable, at the expense of a
lot more work.

I see no reason why we should on principle reject a python based program
from contrib. We already have stuff there in shellscript which is
actually *less* portable... As long as it's not a core utility needed to
get postgresql working, I think it's fine.


>> I personally don't really care; Perl's main advantage is 
>> that it's pre-installed on more OSes than Python is.
> 
> I think most (if not all) modern OS's standard setup includes both perl
> and python. Except of course windows which probably includes neither.

Windows ships with neither of the two languages (and you *really* don't
want to write it in vbscript or jscript which is what it does ship with
- other than .BAT). But they both have easy installers you can use to
get it in there - I don't see this as any difference between the two.

And I'll second the comment that I think most reasonably modern
platforms will ship with both of them. AFAIK, many of the newer linuxen
use python based stuff as part of the core installer functionality, for
example.

//Magnus


Re: Simple postgresql.conf wizard

From
Grzegorz Jaskiewicz
Date:
On 2008-11-02, at 02:02, Greg Smith wrote:

>
> Possible feedback topics:
>
> -Setting the next round of values requires asking the user for some  
> input before making recommendations.  Is it worth building a curses- 
> based interface to updating the values?  That would be really  
> helpful for people with only ssh access to their server, but barring  
> something like WCurses I'm not sure it would help on Windows.
>
> -How about a GUI one with Python's Tkinter interface?  Now Windows  
> isn't a problem, but people using ssh aren't going to be as happy.
create interface , and module/library - that doesn't 1:1 reflect GUI -  
and you'll be able to create interface in any language/toolkit.
>
>
Idea's great, I certainly know many ppl would love to see something  
like that. Cos so far, whole 'adjust setting in postgresql.conf' thing  
is a guess work. It would be even more good to put in some work into  
auto tuning of pgsql itself, with manual override in  
postgresql.conf :) (kinda like mssql).



Re: Simple postgresql.conf wizard

From
"Dickson S. Guedes"
Date:
Hannu Krosing escreveu:
> On Sat, 2008-11-01 at 22:02 -0400, Greg Smith wrote:
>   
>> Possible feedback topics:
>>
>> -Setting the next round of values requires asking the user for some input 
>> before making recommendations.  Is it worth building a curses-based 
>> interface to updating the values?  That would be really helpful for people 
>> with only ssh access to their server, but barring something like WCurses 
>> I'm not sure it would help on Windows.
>>
>> -How about a GUI one with Python's Tkinter interface?  Now Windows isn't a 
>> problem, but people using ssh aren't going to be as happy.
>>     
>
> You could have it in two parts - an agent running on the server, started
> over ssh and GUI interface running on users workstation which talk to
> said agent.
>
> And I'd suggest you use wxPython for GUI part if you want a relatively
> modern look.
>   

Or the GUI could be a part of pgAdmin.

-- 
[]s
Dickson S. Guedes
Administrador de Banco de Dados
Projeto Colmeia -  Florianópolis, SC
(48) 3322-1185, ramal: 26
http://planeta.postgresql.org.br
http://makeall.wordpress.com


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 3 Nov 2008, Hannu Krosing wrote:

> And I'd suggest you use wxPython for GUI part if you want a relatively
> modern look.

wxPython is GPL licensed and not popular enough to be available on a lot 
of systems already.  I've spent enough time fighting with installing 
wxWidgets (for pgAdmin) to know that I really don't want to push that 
install onto end-users, and due to the license it's impractical to just 
bundle in some situations.

There's no need for a fancy look here anyway, the only one I ever 
envisioned was a simple grid showing the recommendations the user could 
update before accepting.  In general here, if it doesn't ship with the 
stock Python, there would have to be a really, really compelling reason to 
use any external library that adds more dependencies.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Hannu Krosing
Date:
On Mon, 2008-11-03 at 12:42 -0500, Greg Smith wrote:
> On Mon, 3 Nov 2008, Hannu Krosing wrote:
> 
> > And I'd suggest you use wxPython for GUI part if you want a relatively
> > modern look.
> 
> wxPython is GPL licensed and not popular enough to be available on a lot 
> of systems already.

Wikipedia says thus about wxPython
----------------------------------

Being a wrapper, wxPython uses the same free software licence used by
wxWidgets (LGPL) —which is approved by Free Software Foundation and Open
Source Initiative.


and wxWindows itself seems to be even more relaxed
--------------------------------------------------

http://www.wxwidgets.org/about/newlicen.htm

wxWidgets is currently licensed under the "wxWindows Licence" pending
approval of the "wxWidgets Licence" which will be identical apart from
the name.

The wxWindows Licence is essentially the L-GPL (Library General Public
Licence), with an exception stating that derived works in binary form
may be distributed on the user's own terms. This is a solution that
satisfies those who wish to produce GPL'ed software using wxWidgets, and
also those producing proprietary software.

>   I've spent enough time fighting with installing 
> wxWidgets (for pgAdmin) to know that I really don't want to push that 
> install onto end-users, and due to the license it's impractical to just 
> bundle in some situations.

I don't think that GPL forbids bundling.

> There's no need for a fancy look here anyway, the only one I ever 
> envisioned was a simple grid showing the recommendations the user could 
> update before accepting.  In general here, if it doesn't ship with the 
> stock Python, there would have to be a really, really compelling reason to 
> use any external library that adds more dependencies.

I think that though tkInter is kind-of included wit the standard python
distribution, it is not always installed by default, or more exactly,
tcl/tk is often not installed. 

In that regard I think that using curses is safest bet.

> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> 
-- 
------------------------------------------
Hannu Krosing   http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability   Services, Consulting and Training



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Sun, 2 Nov 2008, Josh Berkus wrote:

> I'd start with command-line switches, e.g.
> config --memory=32GB --type=DW --size=500GB --connections=20

Attached version takes all its input via command line switches.  If you 
don't specify an explict number of connections, it also implements setting 
max_connections via some of the logic from your calcfactors spreadsheet. 
I think using more of the ideas from there will drive me crazy unless I 
put that into a text file it pulls in instead of a bunch of if statements. 
I'm not sure if that much of the complexity of that larger model is 
warranted at this point; could use a second opinion on that.

Here's the syntax it accepts now:

$ ./pg-generate-conf --help
Usage: pg-generate-conf [options]

Options:  --version             show program's version number and exit  -h, --help            show this help message
andexit  -i INPUTCONFIG, --input-config=INPUTCONFIG                        Input configuration file  -o OUTPUTCONFIG,
--output-config=OUTPUTCONFIG                       Output configuration file, defaults to standard output  -M
TOTALMEMORY,--memory=TOTALMEMORY                        Total system memory, will attempt to detect if
     unspecified  -T DBTYPE, --type=DBTYPE                        Database type, defaults to Mixed, valid options are
                    DSS, OLTP, Web, Mixed, Desktop  -c CONNECTIONS, --connections=CONNECTIONS
Maximumnumber of expected connections, default                        depends on database type  -D, --debug
Enabledebugging mode
 

I've realized I need to duplicate all the memory setting parsing stuff 
from the GUC code (as well as the code in SHOW that picks a reasonable 
output size) as my next step here, that will be in my next update to this 
program.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Greg,

> Attached version takes all its input via command line switches.  If you 
> don't specify an explict number of connections, it also implements 
> setting max_connections via some of the logic from your calcfactors 
> spreadsheet.

OK, I'll review.  What follows is a review of the *previous* version, 
because I'm currently on the road and didn't see your message to 
-hackers.  Some of the information in the review will still be relevant; 
for one thing, I've simplified the "what color is your application" 
logic to a few calculations.

----------------------------

Review of simple_config.py:

1) don't bother with os.sysconf, or make it optional and error-trap it.
Instead, solicit the following information from the user:
-- Available RAM
-- Expected Database Size (to nearest 10x)
-- Type of Application-- Web-- Data Warehouse-- Mixed-- Desktop
-- Operating System [Linux/Windows/OSX/Solaris/FreeBSD/other]
From the above, you can derive all necessary calculations for the 
basics.  In the advanced version, we'll also want to ask:
-- Memory used by other applications on the system?
-- Analyze queries for performance?
-- SSL?
-- Production vs. Development status?
-- How many connections?
-- Logging setup:SyslogAnalyze PerformancePrivate log with weekly rotation

2) It's completely unnecessary to account for OS overhead.  This can and 
should be taken into account as part of the calculations for other 
figures.  For example, my 1/4 and 3/4 calculations ignore OS overhead.  You only need to reduce Available RAM when the
serverwill be running 
 
something else, like a J2EE server or multiple databases.

3) You need to provide a whole bunch more values.  shared_buffers and 
effective_cache_size isn't nearly enough.  We should also provide, based 
on these calculations, and by database type.

(I'm happy to argue out the figures below.  They are simply based on my 
direct turning experience with a variety of databases and could probably 
use more tweaking for the general case.)

web / oltplisten_addresses = '*'max_connections = 200shared_buffers = 1/4 AvRAMeffective_cache_size = 3/4 AvRAMwork_mem
=AvRAM / max_connections, round downmaint_work_mem = AvRAM / 16, round upwal_buffers = 8mbautovacuum = onmax_fsm_pages
=DBsize / PageSize / 8checkpoint_segments = 8default_statistics_target = 10constraint_exclusion = off
 
DW:listen_addresses = '*'max_connections = 20shared_buffers = 1/4 AvRAMeffective_cache_size = 3/4 AvRAMwork_mem = AvRAM
/max_connections / 2, round downmaint_work_mem = AvRAM / 8, round upwal_buffers = 32mbautovacuum = offmax_fsm_pages =
DBsize/ PageSize / 32*    (unless LoadSize is known)checkpoint_segments = 64default_statistics_target =
400constraint_exclusion= on
 

Mixed:listen_addresses = '*'max_connections = 80shared_buffers = 1/4 AvRAMeffective_cache_size = 3/4 AvRAMwork_mem =
AvRAM/ max_connections / 2, round downmaint_work_mem = AvRAM / 16, round upwal_buffers = 8mbautovacuum =
onmax_fsm_pages= DBsize / PageSize / 8checkpoint_segments = 16default_statistics_target = 100constraint_exclusion = on
 

Desktop:listen_addresses = 'localhost'max_connections = 5shared_buffers = 1/16 AvRAMeffective_cache_size = 1/4
AvRAMwork_mem= AvRAM / 32, round downmaint_work_mem = AvRAM / 16, round upwal_buffers = 1mbautovacuum = onmax_fsm_pages
=DBsize / PageSize / 8checkpoint_segments = 3default_statistics_target = 10constraint_exclusion = off
 

4)  Because this comes up so often, we should output to a seperate file 
a set of sysctl.conf lines to support SysV memory, depending on OS.


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> DW:
>     default_statistics_target = 400
> Mixed:
>     default_statistics_target = 100

You, my friend, are certifiably insane.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Gregory Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> 
>> DW:
>>     default_statistics_target = 400
>> Mixed:
>>     default_statistics_target = 100
> 
> You, my friend, are certifiably insane.

Hmmm?  Why?  I've used those settings in the field, fairly frequently. 
I was actually wondering if we should raise the default for web as well, 
but decided to let it alone.

Actually, I think a DW should begin at 400; often it needs to go up to 
1000, but I don't think a script should do that.

--Josh



Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Greg,

BTW, I think this is still in enough flux that we really ought to make 
it a pgfoundry project.  I don't think we'll have anything ready for 8.4 
contrib.

--Josh


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Gregory Stark wrote:
>> Josh Berkus <josh@agliodbs.com> writes:
>>> DW:
>>> default_statistics_target = 400
>>> Mixed:
>>> default_statistics_target = 100
>> 
>> You, my friend, are certifiably insane.

> Hmmm?  Why?  I've used those settings in the field, fairly frequently. 

Even though we all agree default_statistics_target = 10 is too low,
proposing a 40X increase in the default value requires more evidence
than this.  In particular, the prospect of a 1600-fold increase in
the typical cost of eqjoinsel() is a mite scary.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Greg Stark
Date:
A statistic target of 400 fir a specific column may make sense but  
even then I would recommend monitoring performance to ensure it  
doesn't cause problems. As a global setting it's, IMHO, ridiculous.

Even for the smaller data types (except boolean and "char") and array  
of 400 will be large enough to be toasted. Planning queries will  
involve many more disk I/Os than some of those queries end up taking  
themselves. Even for stats which are already cached there are some  
algorithms in the planner known to be inefficient for large arrays.

It may make sense for specific skewed columns with indexes on them,  
but keep in mind postgres needs to consult the statistics on any  
column referenced in a qual even if there are no indexes and for most  
data distributions do fine with a target of 10.

I think we all agree the default may need to be raised but until there  
is some data we have little basis to recommend anything specific.

I would suggest starting from the basis  that "mixed" (with a  
conservative memory setting) is the same as "Postgres default".  
Perhaps (probably) the defaults should be changed but we shouldn't  
have two different tools with different (drastically different!) ideas  
for the same situation.

greg

On 13 Nov 2008, at 07:46 PM, Josh Berkus <josh@agliodbs.com> wrote:

> Gregory Stark wrote:
>> Josh Berkus <josh@agliodbs.com> writes:
>>> DW:
>>>    default_statistics_target = 400
>>> Mixed:
>>>    default_statistics_target = 100
>> You, my friend, are certifiably insane.
>
> Hmmm?  Why?  I've used those settings in the field, fairly  
> frequently. I was actually wondering if we should raise the default  
> for web as well, but decided to let it alone.
>
> Actually, I think a DW should begin at 400; often it needs to go up  
> to 1000, but I don't think a script should do that.
>
> --Josh
>


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> 
>> DW:
>>     default_statistics_target = 400
>> Mixed:
>>     default_statistics_target = 100
> 
> You, my friend, are certifiably insane.

I almost fell off the chair because of that comment, but after I stopped 
laughing and actually looked at those values, it doesn't seem that 
unreasonable. Arbitrary, sure, but not insane. Or do I need stronger 
glasses?

A lot of people have suggested raising our default_statistics target, 
and it has been rejected because there's some O(n^2) behavior in the 
planner, and it makes ANALYZE slower, but it's not that crazy.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> A lot of people have suggested raising our default_statistics target, 
> and it has been rejected because there's some O(n^2) behavior in the 
> planner, and it makes ANALYZE slower, but it's not that crazy.

I think everyone agrees it ought to be raised.  Where the rubber meets
the road is deciding just *what* to raise it to.  We've got no
convincing evidence in favor of any particular value.

If someone actually wanted to put some effort into this, I'd suggest
taking some reasonably complex benchmark (maybe TPCH or one of the DBT
series) and plotting planner runtime for each query as a function of
statistics_target, taking care to mark the breakpoints where it shifted
to a better (or worse?) plan due to having better stats.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> A lot of people have suggested raising our default_statistics target, 
>> and it has been rejected because there's some O(n^2) behavior in the 
>> planner, and it makes ANALYZE slower, but it's not that crazy.
> 
> I think everyone agrees it ought to be raised.  Where the rubber meets
> the road is deciding just *what* to raise it to.  We've got no
> convincing evidence in favor of any particular value.
> 
> If someone actually wanted to put some effort into this, I'd suggest
> taking some reasonably complex benchmark (maybe TPCH or one of the DBT
> series) and plotting planner runtime for each query as a function of
> statistics_target, taking care to mark the breakpoints where it shifted
> to a better (or worse?) plan due to having better stats.

Yeah, that would be a good starting point. After we have some data to 
work with, we could also look into making the planner faster with large 
samples.

Another idea would be to take a large sample in ANALYZE, but if the 
distribution looks "regular enough", store less samples in the 
Most-Common-Values list and fewer histograms, to make the planning faster.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Another idea would be to take a large sample in ANALYZE, but if the 
> distribution looks "regular enough", store less samples in the 
> Most-Common-Values list and fewer histograms, to make the planning faster.

Yeah, some flexibility at that end might not be a bad idea either.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Grzegorz Jaskiewicz
Date:
On 2008-11-13, at 19:33, Greg Stark wrote:

> A statistic target of 400 fir a specific column may make sense but  
> even then I would recommend monitoring performance to ensure it  
> doesn't cause problems. As a global setting it's, IMHO, ridiculous.
>

If that's the situation, me thinks you guys have to start thinking  
about some sort of automated way to increase this param per column as  
needed.
Is there any way planner could actually tell, that it would do better  
job with more stats for certain column ?



Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Thu, Nov 13, 2008 at 3:20 PM, Grzegorz Jaskiewicz
<gj@pointblue.com.pl> wrote:
> If that's the situation, me thinks you guys have to start thinking about
> some sort of automated way to increase this param per column as needed.
> Is there any way planner could actually tell, that it would do better job
> with more stats for certain column ?

Other systems do it.  For example, Oracle tracks column usage and
attempts to determine the optimal statistics for that column (based on
the queries that used it) on an iterative basis.  We don't track
column usage at all, so that option wouldn't be quite that easy to
implement.  Though, there are certain things ANALYZE would be able to
determine with a little help, such as knowing to collect more samples
for columns it finds extremely skewed data in.

There are other things that could be done as well... so the answer is, yes.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 13 Nov 2008, Josh Berkus wrote:

> don't bother with os.sysconf, or make it optional and error-trap it.

Right, I've moved in that direction in the updated rev I already 
sent--memory is an input value, but if you leave it out it tries to guess. 
Just need a bit more error trapping around that I think, then I can add 
Windows support too.  A goal I don't think is unreachable here is to give 
a useful middle of the road tuning on most platforms if you just run it 
but don't tell it anything.

> In the advanced version, we'll also want to ask...

I'm pretty focused right now only on getting a simple version solidified, 
to keep the scope under control.  Thanks for the more focused parameter 
suggestions.  The spreadsheet version I had from you was a bit too 
complicated to work with, this reformulation is much easier to automate.

> Because this comes up so often, we should output to a seperate file a 
> set of sysctl.conf lines to support SysV memory, depending on OS.

Agreed, will add that to the required feature list.

> BTW, I think this is still in enough flux that we really ought to make 
> it a pgfoundry project.  I don't think we'll have anything ready for 8.4 
> contrib.

I find your lack of faith disturbing.  I'll have another rev that 
incorporates all your feedback done within a week.  There are some pretty 
hairy patches still active in this final 'Fest.  I think I'll have the 
simple version feature complete, documented, and have already done a round 
of polishing at least a week or two before all that work wraps up.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Simon Riggs
Date:
On Thu, 2008-11-13 at 20:33 +0100, Greg Stark wrote:

> A statistic target of 400 fir a specific column may make sense but  
> even then I would recommend monitoring performance to ensure it  
> doesn't cause problems. As a global setting it's, IMHO, ridiculous.
> 
> Even for the smaller data types (except boolean and "char") and array  
> of 400 will be large enough to be toasted. Planning queries will  
> involve many more disk I/Os than some of those queries end up taking  
> themselves. Even for stats which are already cached there are some  
> algorithms in the planner known to be inefficient for large arrays.
> 
> It may make sense for specific skewed columns with indexes on them,  
> but keep in mind postgres needs to consult the statistics on any  
> column referenced in a qual even if there are no indexes and for most  
> data distributions do fine with a target of 10.

Your factual comments are accurate, but for Josh's stated target of Data
Warehousing, a stats target of 400 is not unreasonable in some cases.
What you forget to mention is that sample size is also determined by
stats target and for large databases this can be a more important
consideration than the points you mention.

In any case, saying that somebody is certifiably insane in a public
forum is at best questionable. I would like to see the comment
withdrawn.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Tom,

> Even though we all agree default_statistics_target = 10 is too low,
> proposing a 40X increase in the default value requires more evidence
> than this.  In particular, the prospect of a 1600-fold increase in
> the typical cost of eqjoinsel() is a mite scary.

It's a *completely* acceptable tradeoff for a *data warehousing* 
database, where queries take multiple seconds to execute even under the 
best plans ... and minutes or hours for the worst.  And that's what I'm 
proposing a value of 400 for, if you read my posting rather than just 
Greg's outraged response.

(and yes, I've done this for multiple *production* data warehouses and 
the results have been good)

For a web database, keep it at 10.  It might turn out that an increase 
to 25 or 50 would benefit even web applications, but we don't yet have 
the testing resources to determine that.

Of *course* it would be better for the DBA to go through and set 
statistics column-by-column.  But few will.

--Josh



Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Simon,

> In any case, saying that somebody is certifiably insane in a public
> forum is at best questionable. I would like to see the comment
> withdrawn.

Thanks for defending me.  I think Greg threw that at me because he knows 
I'm very difficult to offend, though. I assume that Greg wouldn't make a 
post like that to other members of the community.

--Josh



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 13 Nov 2008, Josh Berkus wrote:

>> Even though we all agree default_statistics_target = 10 is too low,
>> proposing a 40X increase in the default value requires more evidence
>> than this.  In particular, the prospect of a 1600-fold increase in
>> the typical cost of eqjoinsel() is a mite scary.
>
> It's a *completely* acceptable tradeoff for a *data warehousing* database, 
> where queries take multiple seconds to execute even under the best plans ... 
> and minutes or hours for the worst.  And that's what I'm proposing a value of 
> 400 for

The idea that planning time is trivial compared with query runtime in a 
data warehouse application is certainly true.  I remain a bit concerned 
about making the target so large for everyone just because they picked 
that option though.  I'd hate to see somebody who doesn't quite understand 
what that term implies get their plan times exploding.

Since Josh's latest parameter model takes a database size as an input, 
perhaps a reasonable way to proceed here is to put the DW model into size 
tiers.  Something like this:

DW default_statistics_target:

db size        setting
<1GB        10
1GB-10GB    50
10GB-100GB    100
100GB-1TB    200
>1TB        400

Going along with my idea that this tool should produce a reasonable result 
with minimal data, I was thinking of making the database size default to 
10GB if there isn't any input given there.  That would give someone who 
specified DW but nothing else a result of 100, which seems a less 
controversial setting.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
>        listen_addresses = '*'

This doesn't seem like a good thing to autogenerate from a security
perspective.  I think we should not attempt to guess the user's
requirements in this area.

>        max_fsm_pages = DBsize / PageSize / 8

Isn't this moot for 8.4?

...Robert


Re: Simple postgresql.conf wizard

From
Alvaro Herrera
Date:
Josh Berkus wrote:

> Thanks for defending me.  I think Greg threw that at me because he knows  
> I'm very difficult to offend, though. I assume that Greg wouldn't make a  
> post like that to other members of the community.

I would print it and frame it to hang somewhere in the office ... or
maybe get some business cards with it.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-11-13 at 17:21 -0500, Greg Smith wrote:
> On Thu, 13 Nov 2008, Josh Berkus wrote:

> > BTW, I think this is still in enough flux that we really ought to make 
> > it a pgfoundry project.  I don't think we'll have anything ready for 8.4 
> > contrib.
> 
> I find your lack of faith disturbing.  I'll have another rev that 
> incorporates all your feedback done within a week.  There are some pretty 
> hairy patches still active in this final 'Fest.  I think I'll have the 
> simple version feature complete, documented, and have already done a round 
> of polishing at least a week or two before all that work wraps up.
> 

I agree if we stick to the actual point (simple postgresql.conf wizard)
then we shouldn't have any worry of getting it in.

Joshua D. Drake

> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> 
-- 



Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-11-13 at 18:07 -0500, Greg Smith wrote:
> On Thu, 13 Nov 2008, Josh Berkus wrote:

> Since Josh's latest parameter model takes a database size as an input, 
> perhaps a reasonable way to proceed here is to put the DW model into size 
> tiers.  Something like this:
> 
> DW default_statistics_target:
> 
> db size        setting
> <1GB        10
> 1GB-10GB    50
> 10GB-100GB    100
> 100GB-1TB    200
> >1TB        400
> 
> Going along with my idea that this tool should produce a reasonable result 
> with minimal data, I was thinking of making the database size default to 
> 10GB if there isn't any input given there.  That would give someone who 
> specified DW but nothing else a result of 100, which seems a less 
> controversial setting.
> 

Why are we building wizards for settings that will be configured by
experts? I thought the idea here was:

Simple postgresql.conf wizard

If you are running a DW you are beyond the point of this tool are you
not?

Joshua D. Drake




> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> 
-- 



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 13 Nov 2008, Robert Haas wrote:

>>        listen_addresses = '*'
>
> This doesn't seem like a good thing to autogenerate from a security
> perspective.  I think we should not attempt to guess the user's
> requirements in this area.

Yeah, I don't want to be the guy who flips the switch for being less 
secure by default.  Particularly because it's unlikely to do anything by 
itself--need some changes to pg_hba.conf in most cases.  However, not 
setting listen_addresses to something useful is a common newbie problem. 
I was thinking of producing a warning to standard error with some 
suggestions if listen_addresses isn't set to the usual '*', but not 
actually changing the setting.

>>        max_fsm_pages = DBsize / PageSize / 8
>
> Isn't this moot for 8.4?

At some point this is going to target earlier versions as well so it's 
good to have that intelligence in the app, even if it ends up not being a 
setting that is altered.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:

> Your factual comments are accurate, but for Josh's stated target of Data
> Warehousing, a stats target of 400 is not unreasonable in some cases.
> What you forget to mention is that sample size is also determined by
> stats target and for large databases this can be a more important
> consideration than the points you mention.

Even for data warehousing I would not recommend setting it as a *default*
statistics target, at least not without verifying that it doesn't cause any
problems.

I would certainly consider 400 reasonable for specific columns. But the
default statistics target controls how large a histogram to store for *every*
column. Even columns never used by any clauses or used by clauses which do not
have any indexes on them.

Actually a plausible argument could be made that for data warehousing
databases in particular large values of default_statistics_target are
especially damaging. Queries on these databases are likely to have a large
number of clauses which are not indexed and a large number of joins with
complex join clauses.

Not every data warehouse query runs for hours, what I'm afraid of is
potentially the first time someone pops up complaining how Postgres sucks
because it randomly takes minutes to plan their queries. Only to find it's
retrieving kilobytes of data from toasted statistics arrays and performing n^2
comparisons of that data.

> In any case, saying that somebody is certifiably insane in a public
> forum is at best questionable. I would like to see the comment
> withdrawn.

I'm not too nervous that Josh might have actually thought I thought he was
really insane. (Or for that matter that anyone else reading it might have
thought so.)

On the other hand what does occur to me in retrospect is that I regret that I
didn't think about how I was disparaging the importance of mental illness and
hope nobody took offense for that reason.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
>> In any case, saying that somebody is certifiably insane in a public
>> forum is at best questionable. I would like to see the comment
>> withdrawn.
>
> I'm not too nervous that Josh might have actually thought I thought he was
> really insane. (Or for that matter that anyone else reading it might have
> thought so.)
>
> On the other hand what does occur to me in retrospect is that I regret that I
> didn't think about how I was disparaging the importance of mental illness and
> hope nobody took offense for that reason.

I hope so too, but I think we're taking this way too seriously.

...Robert


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Jonah H. Harris wrote:
> On Thu, Nov 13, 2008 at 3:20 PM, Grzegorz Jaskiewicz
> <gj@pointblue.com.pl> wrote:
>> If that's the situation, me thinks you guys have to start thinking about
>> some sort of automated way to increase this param per column as needed.
>> Is there any way planner could actually tell, that it would do better job
>> with more stats for certain column ?
> 
> Other systems do it.  For example, Oracle tracks column usage and
> attempts to determine the optimal statistics for that column (based on
> the queries that used it) on an iterative basis.  We don't track
> column usage at all, so that option wouldn't be quite that easy to
> implement.  Though, there are certain things ANALYZE would be able to
> determine with a little help, such as knowing to collect more samples
> for columns it finds extremely skewed data in.

That kind of feedback loops are a bit dangerous. For starters, it would 
mean that your test system would behave differently than your production 
system, just because you run different queries on it. There's also all 
kinds of weird dynamic behaviors that could kick in. For example, a 
query could run fine for the first few hundred times, but then the 
analyzer notices that a certain column is being accessed frequently and 
decides to increase the stats target for it, which changes the plan, for 
worse. Usually the new plan would be better, but the planner isn't perfect.

> There are other things that could be done as well... so the answer is, yes.

Yes, just have to be careful..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Simon Riggs
Date:
On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:

> On the other hand what does occur to me in retrospect is that I regret
> that I didn't think about how I was disparaging the importance of
> mental illness and hope nobody took offense for that reason.

Your comments surprise me because you mentioned to me privately that you
disliked on-list bullies.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Simple postgresql.conf wizard

From
"Dave Page"
Date:
On Fri, Nov 14, 2008 at 8:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:
>
>> On the other hand what does occur to me in retrospect is that I regret
>> that I didn't think about how I was disparaging the importance of
>> mental illness and hope nobody took offense for that reason.
>
> Your comments surprise me because you mentioned to me privately that you
> disliked on-list bullies.

It hardly seems like bullying to me - a tongue-in-cheek humorous
remark to someone many of us, including Greg, have known and worked
with for years. If it were made to a newbie to the community, that
would be another matter.

-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Simon Riggs
Date:
On Fri, 2008-11-14 at 08:57 +0000, Dave Page wrote:
> On Fri, Nov 14, 2008 at 8:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:
> >
> >> On the other hand what does occur to me in retrospect is that I regret
> >> that I didn't think about how I was disparaging the importance of
> >> mental illness and hope nobody took offense for that reason.
> >
> > Your comments surprise me because you mentioned to me privately that you
> > disliked on-list bullies.
> 
> It hardly seems like bullying to me - a tongue-in-cheek humorous
> remark to someone many of us, including Greg, have known and worked
> with for years. If it were made to a newbie to the community, that
> would be another matter.

I was not the one complaining about bullying comments.

It seems a strange moderation policy that allows bald insults from some
people but not others. And stranger still that you think you should leap
to the defence of any person making them. If the comments were meant so
lightly, it would seem easy to withdraw them also, with a smile.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Simple postgresql.conf wizard

From
"Dave Page"
Date:
On Fri, Nov 14, 2008 at 9:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Fri, 2008-11-14 at 08:57 +0000, Dave Page wrote:
>> On Fri, Nov 14, 2008 at 8:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> >
>> > On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:
>> >
>> >> On the other hand what does occur to me in retrospect is that I regret
>> >> that I didn't think about how I was disparaging the importance of
>> >> mental illness and hope nobody took offense for that reason.
>> >
>> > Your comments surprise me because you mentioned to me privately that you
>> > disliked on-list bullies.
>>
>> It hardly seems like bullying to me - a tongue-in-cheek humorous
>> remark to someone many of us, including Greg, have known and worked
>> with for years. If it were made to a newbie to the community, that
>> would be another matter.
>
> I was not the one complaining about bullying comments.

No, but you drew the comparison.

> It seems a strange moderation policy that allows bald insults from some
> people but not others. And stranger still that you think you should leap
> to the defence of any person making them. If the comments were meant so
> lightly, it would seem easy to withdraw them also, with a smile.

No stranger than you leaping to demand an apology. My point was, that
I didn't see an deliberate insult, and by the sound of it neither did
Josh. I saw a humorous response that was unfortunately missing a
smiley to make the intent clear for those that haven't been in these
parts for years. I would have thought that you too would know Greg
well enough to know that he wasn't trying to be rude but was making a
joke.

Anyhoo, I don't think this is worth wasting any more bandwidth so I'll
shut up now.

-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Richard Huxton
Date:
Josh Berkus wrote:
> Greg,
> 
> BTW, I think this is still in enough flux that we really ought to make
> it a pgfoundry project.  I don't think we'll have anything ready for 8.4
> contrib.

[Been trying to find the right point to post this reply.]

Is it only me that thinks this should be a service on the website too
(or even first)? Fill in web form, click button, get sample
postgresql.conf (with comments) back.

Add a tick-box asking if we can keep a copy of their answers and you
might get some useful usage info too.

--  Richard Huxton Archonet Ltd


Re: Simple postgresql.conf wizard

From
Simon Riggs
Date:
On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > Your factual comments are accurate, but for Josh's stated target of Data
> > Warehousing, a stats target of 400 is not unreasonable in some cases.
> > What you forget to mention is that sample size is also determined by
> > stats target and for large databases this can be a more important
> > consideration than the points you mention.
> 
> Even for data warehousing I would not recommend setting it as a *default*
> statistics target, at least not without verifying that it doesn't cause any
> problems.

I agree with you that it would be likely misused by many people, simply
because the term "data warehouse" is a very loosely defined concept.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Fri, Nov 14, 2008 at 2:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Other systems do it.  For example, Oracle tracks column usage and
>> attempts to determine the optimal statistics for that column (based on
>> the queries that used it) on an iterative basis.  We don't track
>> column usage at all, so that option wouldn't be quite that easy to
>> implement.  Though, there are certain things ANALYZE would be able to
>> determine with a little help, such as knowing to collect more samples
>> for columns it finds extremely skewed data in.
>
> That kind of feedback loops are a bit dangerous. For starters, it would mean
> that your test system would behave differently than your production system,
> just because you run different queries on it. There's also all kinds of
> weird dynamic behaviors that could kick in. For example, a query could run
> fine for the first few hundred times, but then the analyzer notices that a
> certain column is being accessed frequently and decides to increase the
> stats target for it, which changes the plan, for worse. Usually the new plan
> would be better, but the planner isn't perfect.

Oracle already thought of that a long time ago, which is why the plan
has to come out better for it to take effect.  As for bad plans, you
obviously haven't used Postgres in production enough to deal with it
continually changing plans for the worse due to index bloat, data
skew, phase of the moon, etc. :)

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Jonah H. Harris wrote:
> Oracle already thought of that a long time ago, which is why the plan
> has to come out better for it to take effect.

Huh? We would never willingly choose a worse plan, of course, but the 
point is that what looks like a better plan, with a smaller cost 
estimate, is sometimes actually worse.

>  As for bad plans, you
> obviously haven't used Postgres in production enough to deal with it
> continually changing plans for the worse due to index bloat, data
> skew, phase of the moon, etc. :)

You're right, I haven't, but yes I know that's a problem. We've chatted 
about that with Greg sometimes. It would be nice to have more stable 
plans. My favorite idea is to stop using the current relation size in 
the planner, and use the value snapshotted at ANALYZE instead. That way, 
the planner would be completely deterministic, based on the statistics. 
Then, we could have tools to snapshot the statistics, move them to a 
test system, store them, revert back to old statistics etc.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Fri, Nov 14, 2008 at 10:50 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>> Oracle already thought of that a long time ago, which is why the plan
>> has to come out better for it to take effect.
>
> Huh? We would never willingly choose a worse plan, of course, but the point
> is that what looks like a better plan, with a smaller cost estimate, is
> sometimes actually worse.

Oracle bases it on cost and elapsed execution time.

>>  As for bad plans, you
>> obviously haven't used Postgres in production enough to deal with it
>> continually changing plans for the worse due to index bloat, data
>> skew, phase of the moon, etc. :)
>
> You're right, I haven't, but yes I know that's a problem. We've chatted
> about that with Greg sometimes. It would be nice to have more stable plans.
> My favorite idea is to stop using the current relation size in the planner,
> and use the value snapshotted at ANALYZE instead. That way, the planner
> would be completely deterministic, based on the statistics. Then, we could
> have tools to snapshot the statistics, move them to a test system, store
> them, revert back to old statistics etc.

Yes, plan stability would be a Good Thing(tm) IMO.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> You're right, I haven't, but yes I know that's a problem. We've chatted 
> about that with Greg sometimes. It would be nice to have more stable 
> plans. My favorite idea is to stop using the current relation size in 
> the planner, and use the value snapshotted at ANALYZE instead.

Sorry, that's a nonstarter.  We used to act that way and it was
completely fatal to plan quality in very many typical scenarios.

IMHO, the only thing worse than an unstable plan is a stable one.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Fri, Nov 14, 2008 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> IMHO, the only thing worse than an unstable plan is a stable one.

Your opinion contradicts the majority of the industry then, I'm
afraid.  Like query hints, people are sometimes smarter than the
optimizer.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> On Fri, Nov 14, 2008 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> IMHO, the only thing worse than an unstable plan is a stable one.

> Your opinion contradicts the majority of the industry then, I'm
> afraid.  Like query hints, people are sometimes smarter than the
> optimizer.

And, very often, they're not --- or more likely, they were smarter than
the optimizer last year, but now conditions have changed.  Failing to
adapt to new conditions is exactly the problem with query hints, and
in general with any insistence that plans should be "stable".
        regards, tom lane


Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Fri, Nov 14, 2008 at 12:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And, very often, they're not --- or more likely, they were smarter than
> the optimizer last year, but now conditions have changed.  Failing to
> adapt to new conditions is exactly the problem with query hints, and
> in general with any insistence that plans should be "stable".

Well, at least they didn't have to wait a year to fix the problem.

Similarly, whether or not the plan changed due to bad hints or bad
plans, detecting the change is relatively easy, so I don't really see
an argument based on *why* the plan failed.  In my, and many others
opinion, if you decide to take your query plan into your own hands,
it's your problem if it fails.  I do agree that hints are a little too
nice and simple, and generally get people into trouble because they're
hard-coded in an app, tend to cause issues later, and are then
difficult to track down.  Oracle solved that years ago as well, which
is why they support more advanced plan stability features than just
hints.

However, given the number of large-scale OLTP sites I've been to, I
can tell you from experience that post-ANALYZE plan changes wreak
complete havoc on a system and in many cases, bring it to its knees.
In those cases, the proper query plan is well-known, and a hint (or
some other form of plan stability) is all that would be required to
prevent it from happening.

This is pretty off-topic for this thread, so I'll postpone the
discussion for 8.5.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> "Jonah H. Harris" <jonah.harris@gmail.com> writes:
>> On Fri, Nov 14, 2008 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> IMHO, the only thing worse than an unstable plan is a stable one.
> 
>> Your opinion contradicts the majority of the industry then, I'm
>> afraid.  Like query hints, people are sometimes smarter than the
>> optimizer.
> 
> And, very often, they're not --- or more likely, they were smarter than
> the optimizer last year, but now conditions have changed.  Failing to
> adapt to new conditions is exactly the problem with query hints, and
> in general with any insistence that plans should be "stable".

Those are both very simplistic views. Yes, the planner often knows 
better. Yes, the DBA also often knows better.

Stable plans are by no means a silver bullet. If a table suddenly 
changes dramatically in size, you certainly do want to replan your 
queries. The typical scenario that people want to avoid with stable 
plans is where a table grows slowly, until it reaches a critical point, 
and the planner chooses a different plan, and the new plan happens to be 
very bad. The scary thing about that is that it can happen at an 
unpredictable time, and it can be really hard to convince the planner to 
choose the old plan again.

It's a tradeoff, for sure. Ideally the planner would be infallible. 
Failing that, it would be nice to provide the DBA some tools to control 
when new plans are taken into use.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
"Mark Wong"
Date:
On Thu, Nov 13, 2008 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> A lot of people have suggested raising our default_statistics target,
>> and it has been rejected because there's some O(n^2) behavior in the
>> planner, and it makes ANALYZE slower, but it's not that crazy.
>
> I think everyone agrees it ought to be raised.  Where the rubber meets
> the road is deciding just *what* to raise it to.  We've got no
> convincing evidence in favor of any particular value.
>
> If someone actually wanted to put some effort into this, I'd suggest
> taking some reasonably complex benchmark (maybe TPCH or one of the DBT
> series) and plotting planner runtime for each query as a function of
> statistics_target, taking care to mark the breakpoints where it shifted
> to a better (or worse?) plan due to having better stats.

Almost there...  I have a MSA70 plugged into the DL380 I have from HP
and I'm trying to find time to get my scripts updated to deal with how
tools have changed over the years...  I'm updating the DBT-2 (tpc-c
kit) I have first

Regards,
Mark


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Fri, 14 Nov 2008, Richard Huxton wrote:

> Is it only me that thinks this should be a service on the website too
> (or even first)? Fill in web form, click button, get sample
> postgresql.conf (with comments) back.

Sounds familiar...ah, here it is: 
http://archives.postgresql.org/pgsql-performance/2007-06/msg00377.php

I'm ignoring the temptation of working on the UI first and instead staying 
focused on getting the parameter model slimmed down to a readable set of 
code.  Ports to other interfaces should follow--I've already got two 
people who want to build alternate ones lined up.

To give you an idea how overdiscussed this general topic is, I just sent a 
message to Josh suggesting we might put database size into tiers and set 
some parameters based on that.  Guess what?  That was his idea the last 
time around, I subconsciously regurgitated it: 
http://archives.postgresql.org/pgsql-performance/2007-06/msg00602.php

> Add a tick-box asking if we can keep a copy of their answers and you
> might get some useful usage info too.

It's rare I work with a company that wants their internal use of 
PostgreSQL to be public knowledge.  Putting a collection flag on the page 
is just going to spook such commercial users, even if it defaults to off. 
The privacy issues are one reason I put the web-based port far down 
relative to my priorities; another is that I don't do much web application 
development.  But if somebody else wants to run with that, fine by me.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Chris Browne
Date:
dpage@pgadmin.org ("Dave Page") writes:
> On Fri, Nov 14, 2008 at 8:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On Fri, 2008-11-14 at 02:21 +0000, Gregory Stark wrote:
>>
>>> On the other hand what does occur to me in retrospect is that I regret
>>> that I didn't think about how I was disparaging the importance of
>>> mental illness and hope nobody took offense for that reason.
>>
>> Your comments surprise me because you mentioned to me privately that you
>> disliked on-list bullies.
>
> It hardly seems like bullying to me - a tongue-in-cheek humorous
> remark to someone many of us, including Greg, have known and worked
> with for years. If it were made to a newbie to the community, that
> would be another matter.

It's pretty much like the "you'd have to be on crack to do that!",
erm, crack.
-- 
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://linuxfinances.info/info/rdbms.html
Zaphod's just zis guy, you know?


Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Greg,

> To give you an idea how overdiscussed this general topic is, I just sent 
> a message to Josh suggesting we might put database size into tiers and 
> set some parameters based on that.  Guess what?  That was his idea the 
> last time around, I subconsciously regurgitated it: 
> http://archives.postgresql.org/pgsql-performance/2007-06/msg00602.php

Heh, no wonder it sounded good.

However, after a year more of experience, I'd suggest that we solicit 
specific type from the user rather than determining it strictly from 
database size.  The specific elements of a "DW" use-case aren't 
necessarily tied to size.  They are:
* data comes in in large batches rather than individual rows
* small numbers of users
* large complex queries

For example, right now, I'm refactoring a database which is only 15GB, 
but is definitely DW behavior, so we want to keep max_connections to < 
20 and turn autovaccum off.

So I think we should ask the user what kind of DB they have (*with* docs 
which explain what the types mean) and fall back to testing by size if 
the info is not supplied.

Regarding the level of default_stats_target, it sounds like people agree 
that it ought to be raised for the DW use-case, but disagree how much. 
If that's the case, what if we compromize at 50 for "mixed" and 100 for 
DW?  That should allay people's fears, and we can tinker with it when we 
have more data.

--Josh


Re: On-list behavior WAS: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Simon, Stark, others:

> It seems a strange moderation policy that allows bald insults from some
> people but not others. And stranger still that you think you should leap
> to the defence of any person making them. If the comments were meant so
> lightly, it would seem easy to withdraw them also, with a smile.

Let's not get carried away with this.  It is important for us to speak 
up when someone uses forceful language (for the sake of the archives), 
and very important when it's used with someone who's not a native 
English speaker or is a newbie.

However, having made a comment to tone things down, it's best to drop 
it.  Especially when the person targeted is an American and a core team 
member who isn't actually going to be insulted.  An apology is only 
really necessary if someone is hurt.

--Josh



Re: Simple postgresql.conf wizard

From
Bruce Momjian
Date:
Mark Wong wrote:
> On Thu, Nov 13, 2008 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> >> A lot of people have suggested raising our default_statistics target,
> >> and it has been rejected because there's some O(n^2) behavior in the
> >> planner, and it makes ANALYZE slower, but it's not that crazy.
> >
> > I think everyone agrees it ought to be raised.  Where the rubber meets
> > the road is deciding just *what* to raise it to.  We've got no
> > convincing evidence in favor of any particular value.
> >
> > If someone actually wanted to put some effort into this, I'd suggest
> > taking some reasonably complex benchmark (maybe TPCH or one of the DBT
> > series) and plotting planner runtime for each query as a function of
> > statistics_target, taking care to mark the breakpoints where it shifted
> > to a better (or worse?) plan due to having better stats.
> 
> Almost there...  I have a MSA70 plugged into the DL380 I have from HP
> and I'm trying to find time to get my scripts updated to deal with how
> tools have changed over the years...  I'm updating the DBT-2 (tpc-c
> kit) I have first

Yes, please test something.  I am tired if us saying we need to increase
default_statistics_target, but because we don't know the magic number,
we do nothing release after release.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Yes, please test something.  I am tired if us saying we need to increase
> default_statistics_target, but because we don't know the magic number,
> we do nothing release after release.

I think the idea that there IS a magic number is the problem.

No amount of testing is ever going to refute the argument that, under
some other workload, a different value might better.

But that doesn't amount to a reason to leave it the way it is.

...Robert


Re: Simple postgresql.conf wizard

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Wednesday, November 19, 2008 5:05 PM
> To: Bruce Momjian
> Cc: Mark Wong; Tom Lane; Heikki Linnakangas; Gregory Stark; Josh
> Berkus; Greg Smith; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Simple postgresql.conf wizard
>
> > Yes, please test something.  I am tired if us saying we need to
> increase
> > default_statistics_target, but because we don't know the magic
> number,
> > we do nothing release after release.
>
> I think the idea that there IS a magic number is the problem.
>
> No amount of testing is ever going to refute the argument that, under
> some other workload, a different value might better.
>
> But that doesn't amount to a reason to leave it the way it is.

Perhaps a table of experimental data could serve as a rough guideline.



Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Dann Corbit" <DCorbit@connx.com> writes:
>> I think the idea that there IS a magic number is the problem.
>> 
>> No amount of testing is ever going to refute the argument that, under
>> some other workload, a different value might better.
>> 
>> But that doesn't amount to a reason to leave it the way it is.

> Perhaps a table of experimental data could serve as a rough guideline.

The problem is not that anyone wants to leave it the way it is.
The problem is that no one has done even a lick of work to identify
a specific number that is demonstrably better than others -- on *any*
scale.  How about fewer complaints and more effort?
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Decibel!
Date:
On Nov 19, 2008, at 11:51 PM, Tom Lane wrote:
> "Dann Corbit" <DCorbit@connx.com> writes:
>>> I think the idea that there IS a magic number is the problem.
>>>
>>> No amount of testing is ever going to refute the argument that,  
>>> under
>>> some other workload, a different value might better.
>>>
>>> But that doesn't amount to a reason to leave it the way it is.
>
>> Perhaps a table of experimental data could serve as a rough  
>> guideline.
>
> The problem is not that anyone wants to leave it the way it is.
> The problem is that no one has done even a lick of work to identify
> a specific number that is demonstrably better than others -- on *any*
> scale.  How about fewer complaints and more effort?

Is there even a good way to find out what planning time was? Is there  
a way to gather that stat for every query a session runs?

The thought occurs to me that we're looking at this from the wrong  
side of the coin. I've never, ever seen query plan time pose a  
problem with Postgres, even without using prepared statements. Anyone  
who actually cares that much about plan time is certainly going to  
use prepared statements, which makes the whole plan time argument  
moot (plan time, not parse time, but of course stats_target doesn't  
impact parsing at all).

What I *have* seen, on many different databases, was problems with  
bad plans due to default_stats_target being too low. Most of the time  
this was solved by simply setting them to 1000. The only case where I  
backed down from that and went with like 100 was a database that had  
150k tables.

We've been talking about changing default_stats_target for at least 2  
or 3 years now. We know that the current value is causing problems.  
Can we at least start increasing it? 30 is pretty much guaranteed to  
be better than 10, even if it's nowhere close to an ideal value. If  
we start slowly increasing it then at least we can start seeing where  
people start having issues with query plan time.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Decibel! <decibel@decibel.org> writes:
> The thought occurs to me that we're looking at this from the wrong  
> side of the coin. I've never, ever seen query plan time pose a  
> problem with Postgres, even without using prepared statements.

That tells more about the type of queries you tend to run than about
whether there's an issue in general.

> Anyone  
> who actually cares that much about plan time is certainly going to  
> use prepared statements,

This is simply false.  There's a significant performance hit caused
by using prepared statements in many cases where the planner needs
to know the parameter values in order to make good decisions.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Decibel! <decibel@decibel.org> writes:

> Is there even a good way to find out what planning time was? Is there  a way to
> gather that stat for every query a session runs?

\timing
explain select ...

> The thought occurs to me that we're looking at this from the wrong  side of the
> coin. I've never, ever seen query plan time pose a  problem with Postgres, even
> without using prepared statements. 

I certainly have seen plan times be a problem. I wonder if you have too and
just didn't realize it. With a default_stats_target of 1000 you'll have
hundreds of kilobytes of data to slog through to plan a moderately complex
query with a few text columns. Forget about prepared queries, I've seen plan
times be unusable for ad-hoc interactive queries before.

> We've been talking about changing default_stats_target for at least 2  or 3
> years now. We know that the current value is causing problems.  Can we at least
> start increasing it? 30 is pretty much guaranteed to  be better than 10, even
> if it's nowhere close to an ideal value. If  we start slowly increasing it then
> at least we can start seeing where  people start having issues with query plan
> time.

How would you see anything from doing that? We only hear from people who have
problems so we only see half the picture. You would have no way of knowing
whether your change has helped or hurt anyone.

In any case I don't see "we know that the current value is causing problems"
as a reasonable statement. It's the *default* stats target. There's a reason
there's a facility to raise the stats target for individual columns.

As Dann said, "the idea that there IS a magic number is the problem". *Any*
value of default_stats_target will "cause" problems. Some columns will always
have skewed data sets which require unusually large samples, but most won't
and the system will run faster with a normal sample size for that majority.

The question is what value represents a good trade-off between the costs of
having large stats targets -- longer analyze, more data stored in
pg_statistics, more vacuuming of pg_statistics needed, longer plan times --
and the benefits of having larger stats targets -- fewer columns which need
raised stats targets.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simple postgresql.conf wizard

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Greg Stark [mailto:greg.stark@enterprisedb.com] On Behalf Of
> Gregory Stark
> Sent: Tuesday, November 25, 2008 5:06 PM
> To: Decibel!
> Cc: Tom Lane; Dann Corbit; Robert Haas; Bruce Momjian; Mark Wong;
> Heikki Linnakangas; Josh Berkus; Greg Smith; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Simple postgresql.conf wizard


[snip]

> As Dann said, "the idea that there IS a magic number is the problem".
> *Any*
> value of default_stats_target will "cause" problems. Some columns will
> always
> have skewed data sets which require unusually large samples, but most
> won't
> and the system will run faster with a normal sample size for that
> majority.

No, it was somebody smarter than me who said that.

My idea was to create some kind of table which shows curves for
different values and then users will have some sort of basis for
choosing.
Of course, the guy who has 40 tables in his join with an average of 7
indexes on each table (each table containing millions of rows) and a
convoluted WHERE clause will have different needs than someone who has
simple queries and small data loads.  The quality of the current
statistical measures stored will also affect the intelligence of the
query preparation process, I am sure.  I do have a guess that larger and
more expensive queries can probably benefit more from larger samples
(this principle is used in sorting, for instance, where the sample I
collect to estimate the median might grow as {for instance} the log of
the data set size).

P.S.
I also do not believe that there is any value that will be the right
answer.  But a table of data might be useful both for people who want to
toy with altering the values and also for those who want to set the
defaults.  I guess that at one time such a table was generated to
produce the initial estimates for default values.
[snip]



Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Dann Corbit" <DCorbit@connx.com> writes:
> I also do not believe that there is any value that will be the right
> answer.  But a table of data might be useful both for people who want to
> toy with altering the values and also for those who want to set the
> defaults.  I guess that at one time such a table was generated to
> produce the initial estimates for default values.

Sir, you credit us too much :-(.  The actual story is that the current
default of 10 was put in when we first implemented stats histograms,
replacing code that kept track of only a *single* most common value
(and not very well, at that).  So it was already a factor of 10 more
stats than we had experience with keeping, and accordingly conservatism
suggested not boosting the default much past that.

So we really don't have any methodically-gathered evidence about the
effects of different stats settings.  It wouldn't take a lot to convince
us to switch to a different default, I think, but it would be nice to
have more than none.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Tue, 2008-11-25 at 20:33 -0500, Tom Lane wrote:
> "Dann Corbit" <DCorbit@connx.com> writes:
> > I also do not believe that there is any value that will be the right
> > answer.  But a table of data might be useful both for people who want to
> > toy with altering the values and also for those who want to set the
> > defaults.  I guess that at one time such a table was generated to
> > produce the initial estimates for default values.
> 
> Sir, you credit us too much :-(.  

Better than not enough :)


> So we really don't have any methodically-gathered evidence about the
> effects of different stats settings.  It wouldn't take a lot to convince
> us to switch to a different default, I think, but it would be nice to
> have more than none.

I don't this is not empirical but really, 150 is very reasonable. Let's
just set it to that by default and be done with it. It won't hurt
anything and if they need more than that, they are already investigating
either via the lists or via a vendor anyway.

Joshua D. Drake


> 
>             regards, tom lane
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Jonah H. Harris"
Date:
On Tue, Nov 25, 2008 at 8:38 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> I don't this is not empirical but really, 150 is very reasonable. Let's
> just set it to that by default and be done with it. It won't hurt
> anything and if they need more than that, they are already investigating
> either via the lists or via a vendor anyway.

Agreed.

-- 
Jonah H. Harris, Senior DBA
myYearbook.com


Re: Simple postgresql.conf wizard -- Statistics idea...

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Tuesday, November 25, 2008 5:33 PM
> To: Dann Corbit
> Cc: Gregory Stark; Decibel!; Robert Haas; Bruce Momjian; Mark Wong;
> Heikki Linnakangas; Josh Berkus; Greg Smith; pgsql-
> hackers@postgresql.org
> Subject: Re: [HACKERS] Simple postgresql.conf wizard
>
> "Dann Corbit" <DCorbit@connx.com> writes:
> > I also do not believe that there is any value that will be the right
> > answer.  But a table of data might be useful both for people who
want
> to
> > toy with altering the values and also for those who want to set the
> > defaults.  I guess that at one time such a table was generated to
> > produce the initial estimates for default values.
>
> Sir, you credit us too much :-(.  The actual story is that the current
> default of 10 was put in when we first implemented stats histograms,
> replacing code that kept track of only a *single* most common value
> (and not very well, at that).  So it was already a factor of 10 more
> stats than we had experience with keeping, and accordingly
conservatism
> suggested not boosting the default much past that.
>
> So we really don't have any methodically-gathered evidence about the
> effects of different stats settings.  It wouldn't take a lot to
> convince
> us to switch to a different default, I think, but it would be nice to
> have more than none.

I do have a statistics idea/suggestion (possibly useful with some future
PostgreSQL 9.x or something):
It is a simple matter to calculate lots of interesting univarate summary
statistics with a single pass over the data (perhaps during a vacuum
full).
For instance with numerical columns, you can calculate mean, min, max,
standard deviation, skew, kurtosis and things like that with a single
pass over the data.
Here is a C++ template I wrote to do that:
http://cap.connx.com/tournament_software/STATS.HPP
It also uses this template:
http://cap.connx.com/tournament_software/Kahan.Hpp
which is a high-accuracy adder.  These things could easily be rewritten
in C instead of C++.

Now, if you store a few numbers calculated in this way, it can be used
to augment your histogram data when you want to estimate the volume of a
request. So (for instance) if someone asks for a scalar that is ">
value" you can look to see what percentage of the tail will hang out in
that neck of the woods using standard deviation and the mean.

I have another, similar idea (possibly useful someday far off in the
future) that I think may have some merit.  The idea is to create a
"statistical index".  This index is updated whenever data values are
modified in any way.  For scalar/ordinal values such as float, integer,
numeric it would simply store and update a statistics accumulator (a
small vector of a few items holding statistical moments, counts and
sums) for the column of interest.   These indexes would be very small
and inexpensive to {for instance} memory map.

For categorical values (things like 'color' or 'department') we might
store the count for the number of items that correspond to a hash in our
statistical index.  It would give you a crude "count distinct" for any
item -- the only caveat being that more than one item could possibly
have the same hash code (we would also keep a count of null items).  If
the count needed to be exact, we could generate a perfect hash for the
data or store each distinct column value in the categorical index with
its hash.  The size of such an index would depend on the data so that
bit or char='m'/'f' for male/female or 'y'/'n' for yes/no indexes would
contain just two counts and a column that is unique would have one hash
paired with the number one for each row of the table (a regular unique
index would clearly be better in such a case, but such distributions
could easily arise).  The value of an index like this is that it is good
where the normal index types are bad (e.g. it is useless to create a
btree index on a bit column or male/female, yes/no -- things of that
nature but a statistics counter would work nicely -- to give you whole
table measures only of course).  The thing that is odd about these
statistics indexes is that they do not even bother to point back to the
data that they represent -- they are just abstract measures of the
general contents.

It seems to me that this kind of information could be used to improve
the query plans for both categorical values and scalars and also be used
to generate instant answers for some kinds of statistical queries.  If
used only for query planning, the values would not need to be exact, and
could be updated only at vacuum full time or some other convenient time.
The notion behind creation of a stats index would be that we may not
need to maintain this detailed information for every column, but we can
maintain such data for columns that we often filter with in our queries
to get an idea of cardinality for a subset.


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:
> On Tue, 2008-11-25 at 20:33 -0500, Tom Lane wrote:
>> So we really don't have any methodically-gathered evidence about the
>> effects of different stats settings.  It wouldn't take a lot to convince
>> us to switch to a different default, I think, but it would be nice to
>> have more than none.

> I don't this is not empirical but really, 150 is very reasonable. Let's
> just set it to that by default and be done with it.

What happened to the "more than zero evidence" part of the discussion?
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Dann Corbit" <DCorbit@connx.com> writes:
>> I also do not believe that there is any value that will be the right
>> answer.  But a table of data might be useful both for people who want to
>> toy with altering the values and also for those who want to set the
>> defaults.  I guess that at one time such a table was generated to
>> produce the initial estimates for default values.
>
> Sir, you credit us too much :-(.  The actual story is that the current
> default of 10 was put in when we first implemented stats histograms,
> replacing code that kept track of only a *single* most common value
> (and not very well, at that).  So it was already a factor of 10 more
> stats than we had experience with keeping, and accordingly conservatism
> suggested not boosting the default much past that.

I think that's actually too little credit. The sample size is chosen quite
carefully based on solid mathematics to provide a specific confidence interval
estimate for queries covering ranges the size of a whole bucket.

The actual number of buckets more of an arbitrary choice. It depends entirely
on how your data is distributed and how large a range your queries are
covering. A uniformly distributed data set should only need a single bucket to
generate good estimates. Less evenly distributed data sets need more.

I wonder actually if there are algorithms for estimating the number of buckets
needed for a histogram to achieve some measurable goal. That would close the
loop. It would be much more reassuring to base the size of the sample on solid
statistics than on hunches.

> So we really don't have any methodically-gathered evidence about the
> effects of different stats settings.  It wouldn't take a lot to convince
> us to switch to a different default, I think, but it would be nice to
> have more than none.

I think the difficulty (aside from testing being laborious at the best of
times) is that it's heavily dependent on data sets which are hard to generate
good examples for. Offhand I would think the census data might make a good
starting point -- it should have columns which range from perfectly uniform to
highly skewed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Simple postgresql.conf wizard -- Statistics idea...

From
Joshua Tolley
Date:
On Tue, Nov 25, 2008 at 06:59:25PM -0800, Dann Corbit wrote:
> I do have a statistics idea/suggestion (possibly useful with some future
> PostgreSQL 9.x or something):
> It is a simple matter to calculate lots of interesting univarate summary
> statistics with a single pass over the data (perhaps during a vacuum
> full).
> For instance with numerical columns, you can calculate mean, min, max,
> standard deviation, skew, kurtosis and things like that with a single
> pass over the data.

Calculating "interesting univariate summary statistics" and having
something useful to do with them are two different things entirely. Note
also that whereas this is simple for numeric columns, it's a very
different story for non-numeric data types, that don't come from a
metric space. That said, the idea of a probability metric space is well
explored in the literature, and may have valuable application. The
current histogram implementation is effectively a description of the
probability metric space the column data live in.

> Now, if you store a few numbers calculated in this way, it can be used
> to augment your histogram data when you want to estimate the volume of a
> request. So (for instance) if someone asks for a scalar that is ">
> value" you can look to see what percentage of the tail will hang out in
> that neck of the woods using standard deviation and the mean.

Only if you know that the data follow a distribution that can be
described accurately with a standard deviation and a mean. If your data
don't follow a Gaussian distribution, this will give you bad estimates.

- Josh / eggyknap

Re: Simple postgresql.conf wizard -- Statistics idea...

From
Decibel!
Date:
On Nov 25, 2008, at 8:59 PM, Dann Corbit wrote:
> It is a simple matter to calculate lots of interesting univarate  
> summary
> statistics with a single pass over the data (perhaps during a vacuum
> full).


I don't think that the problem we have is how to collect statistics  
(well, except for cross-field stuff); the problem is what to actually  
do with them. What we need people to look at is how we can improve  
query plan estimates across the board. Row count estimates, page  
access estimates, the cost estimates for accessing those pages, etc.  
This isn't a coding problem, it's an algorithm problem. It needs  
someone with an advanced (if not expert) grasp of statistics who can  
come up with better ways of estimating these things.

So, if you have a statistics hammer to wield, I think you'll find a  
lot of nails sticking up in the planner code. Hammer on those before  
worrying about additional stats to collect. :)
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: Simple postgresql.conf wizard

From
Decibel!
Date:
On Nov 25, 2008, at 7:06 PM, Gregory Stark wrote:
>> The thought occurs to me that we're looking at this from the  
>> wrong  side of the
>> coin. I've never, ever seen query plan time pose a  problem with  
>> Postgres, even
>> without using prepared statements.
>
> I certainly have seen plan times be a problem. I wonder if you have  
> too and
> just didn't realize it. With a default_stats_target of 1000 you'll  
> have
> hundreds of kilobytes of data to slog through to plan a moderately  
> complex
> query with a few text columns. Forget about prepared queries, I've  
> seen plan
> times be unusable for ad-hoc interactive queries before.


Can you provide any examples?

And no, I've never seen a system where a few milliseconds of plan  
time difference would pose a problem. I'm not saying they don't  
exist, only that I haven't seen them (including 2 years working as a  
consultant).

I'll also make the argument that anyone with a system that does have  
those kind of requirements will have also needed to actually tune  
their config, and tune it well. I can't see them being bothered by  
having to set one more parameter. There are a lot of systems that are  
being impacted by our ultra-low stats target, and a lot of those  
don't necessarily need a lot of hand tuning beyond the stats target.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828




Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Decibel! <decibel@decibel.org> writes:

> On Nov 25, 2008, at 7:06 PM, Gregory Stark wrote:
>>> The thought occurs to me that we're looking at this from the  wrong  side of
>>> the
>>> coin. I've never, ever seen query plan time pose a  problem with  Postgres,
>>> even
>>> without using prepared statements.
>>
>> I certainly have seen plan times be a problem. I wonder if you have  too and
>> just didn't realize it. With a default_stats_target of 1000 you'll  have
>> hundreds of kilobytes of data to slog through to plan a moderately  complex
>> query with a few text columns. Forget about prepared queries, I've  seen plan
>> times be unusable for ad-hoc interactive queries before.
>
> Can you provide any examples?

At the time I couldn't understand what the problem was. In retrospect I'm
certain this was the problem. I had a situation where just running EXPLAIN
took 5-10 seconds. I suspect I had some very large toasted arrays which were
having to be detoasted each time. IIRC I actually reloaded the database with
pg_dump and the problem went away.

> And no, I've never seen a system where a few milliseconds of plan  time
> difference would pose a problem. I'm not saying they don't  exist, only that I
> haven't seen them (including 2 years working as a  consultant).

How many milliseconds does it take to read a few hundred kilobytes of toasted,
compressed data? These can easily be more data than the actual query is going
to read.

Now ideally this will all be cached but the larger the data set the less
likely it will be.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Decibel! <decibel@decibel.org> wrote: 
> On Nov 25, 2008, at 7:06 PM, Gregory Stark wrote:
>>> The thought occurs to me that we're looking at this from the  
>>> wrong  side of the
>>> coin. I've never, ever seen query plan time pose a  problem with  
>>> Postgres, even
>>> without using prepared statements.
>>
>> I certainly have seen plan times be a problem. I wonder if you have 

>> too and
>> just didn't realize it. With a default_stats_target of 1000 you'll 

>> have
>> hundreds of kilobytes of data to slog through to plan a moderately 

>> complex
>> query with a few text columns. Forget about prepared queries, I've 

>> seen plan
>> times be unusable for ad-hoc interactive queries before.
> 
> Can you provide any examples?
> 
> And no, I've never seen a system where a few milliseconds of plan  
> time difference would pose a problem.
When we first brought the statewide circuit court data onto
PostgreSQL, on some early version of 8.1, we tried boosting the
statistics targets for a few dozen important columns, and had to back
off because of plan times up in the 20 to 30 second range.  I hadn't
tried it lately, so I just gave it a go with switching from a default
statistics target of 10 with no overrides to 1000.
The plan time for a fairly complex query which is run over 300,000
times per day went from 55 ms to 315 ms; however, with the particular
search criteria I used (which I knew to be "challenging") the run time
went from something which exceeded my patience tolerance for the test
(over two minutes) to two seconds, so a better plan was definitely
found.
I'm not sure what this suggests in terms of a good default value, but
just to put some numbers out there from a real-world application....
-Kevin


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: 
> I hadn't
> tried it lately, so I just gave it a go with switching from a
default
> statistics target of 10 with no overrides to 1000.
Oh, this was on 8.2.7, Linux, pretty beefy machine.  Do you want the
whole set of config info and the hardware specs, or would that just be
clutter?
-Kevin


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Even though we all agree default_statistics_target = 10 is too low,
> proposing a 40X increase in the default value requires more evidence
> than this.  In particular, the prospect of a 1600-fold increase in
> the typical cost of eqjoinsel() is a mite scary.

I just did some very quick testing of a couple values for
default_statistics_target on a database for work whose contents I am
not at liberty to share.  Suffice it to say it is a database with
small tables, but a lot of the queries involve many joins.

ANALYZE with default_statistics_target set to 10 takes 13 s.  With
100, 92 s.  With 1000, 289 s.

I tried several representative queries and there was no measurable
change in run-time.  I compared a couple of the plans and the plans
didn't change either.

So then I said - gee, how much can I reduce this?  I looked at two of
the queries in more detail.  The lowest default_statistics_target that
produced the same plan for both queries was 5.  Reducing the
default_statistics_target all the way down to 1 changed the plans, but
the new plans were just as fast as the old plans.

Given the amount of clamor for a higher value for
default_statistics_target, I'm a little surprised by these results.
It may be that the queries I'm running are not ones for which more
statistics generate better plans, but I think they really are
representative of what we run.  Maybe someone could suggest some types
of query that would be likely to helped by better statistics?

A random thought: maybe the reason I'm not seeing any benefit is
because my tables are just too small - most contain at most a few
thousand rows, and some are much smaller.  Maybe
default_statistics_target should vary with the table size?  Something
like, 0.1% of the rows to a maximum of 100...  and then 0.01% of the
rows after that to some higher maximum.... but always a minimum of at
least 10.  I'm guessing that people with really big tables are less
likely to mind longer planning times and more likely to benefit from
finding better plans...

...Robert


Re: Simple postgresql.conf wizard

From
Joshua Tolley
Date:
On Thu, Nov 27, 2008 at 05:15:04PM -0500, Robert Haas wrote:
> A random thought: maybe the reason I'm not seeing any benefit is
> because my tables are just too small - most contain at most a few
> thousand rows, and some are much smaller.  Maybe
> default_statistics_target should vary with the table size?  Something
> like, 0.1% of the rows to a maximum of 100...  and then 0.01% of the
> rows after that to some higher maximum.... but always a minimum of at
> least 10.  I'm guessing that people with really big tables are less
> likely to mind longer planning times and more likely to benefit from
> finding better plans...

Something like this makes an awful lot of sense to me. A higher
statistics target (in theory) better captures the fine details of a
distribution of values; tables with fewer rows are unlikely to
demonstrate fine details (as are tables with categorical, as
opposed to continuous, values). Returning to the original topic of a
configuration creator, perhaps such a tool should ask the user how big
(s)he expects the tables to grow, or perhaps it should be able to
investigate a particular column and recommend a statistics target based
on its distribution, size, etc.

- Josh / eggyknap

Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Robert Haas" <robertmhaas@gmail.com> writes:

> ANALYZE with default_statistics_target set to 10 takes 13 s.  With
> 100, 92 s.  With 1000, 289 s.

That is interesting. It would also be interesting to total up the time it
takes to run EXPLAIN (without ANALYZE) for a large number of queries.

I did start looking at running the experiment Tom was asking for with DBT3
data sets but, uh, the most recent dbt-3 data generation source I could find
was four years old. Is there an active source repository for this stuff?

> Given the amount of clamor for a higher value for
> default_statistics_target, I'm a little surprised by these results.
> It may be that the queries I'm running are not ones for which more
> statistics generate better plans, but I think they really are
> representative of what we run.  Maybe someone could suggest some types
> of query that would be likely to helped by better statistics?

I think there are a lot of variables here. One is the distribution of the
data. Your data sounds like it's reasonably evenly distributed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Simple postgresql.conf wizard

From
tomas@tuxteam.de
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, Nov 27, 2008 at 05:15:04PM -0500, Robert Haas wrote:
> [...]  Maybe
> default_statistics_target should vary with the table size?  Something
> like, 0.1% of the rows to a maximum of 100...  and then 0.01% of the
> rows after that to some higher maximum.... but always a minimum of at
> least 10.  I'm guessing that people with really big tables are less
> likely to mind longer planning times and more likely to benefit from
> finding better plans...

What was the running time of the planner? Quadratic in statistics_target?
Take the square root of table size. Exponential? The logarithm. Or
something like that.

My 1.75 cent
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJL6HyBcgs9XrR2kYRAqfzAJ47w720VJLHVXeS8WKH6zUjAHkbeACfTtDh
rFq512eztVnib1ozjw9sibs=
=JXyS
-----END PGP SIGNATURE-----


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Tue, 18 Nov 2008, Josh Berkus wrote:

> Regarding the level of default_stats_target, it sounds like people agree 
> that it ought to be raised for the DW use-case, but disagree how much. 
> If that's the case, what if we compromize at 50 for "mixed" and 100 for 
> DW?

That's what I ended up doing.  The attached version of this script and its 
data files (I dumped all the useful bits in the current HEAD pg_settings 
for it to use) now hits all of the initial goals I had for a useful 
working tool here.  Here's a sample run on a system with 8GB of RAM and 
the default "mixed" workload.  I told the tool absolutely nothing:

./pgtune -i ~/data/postgresql.conf

And it came up with a decent mixed starter configuration for this system 
appended to the input postgresql.conf:

default_statistics_target = 50 # pg_generate_conf wizard 2008-11-30
maintenance_work_mem = 480MB # pg_generate_conf wizard 2008-11-30
constraint_exclusion = on # pg_generate_conf wizard 2008-11-30
checkpoint_completion_target = 0.9 # pg_generate_conf wizard 2008-11-30
effective_cache_size = 5632MB # pg_generate_conf wizard 2008-11-30
work_mem = 48MB # pg_generate_conf wizard 2008-11-30
wal_buffers = 8MB # pg_generate_conf wizard 2008-11-30
checkpoint_segments = 16 # pg_generate_conf wizard 2008-11-30
shared_buffers = 1920MB # pg_generate_conf wizard 2008-11-30
max_connections = 80 # pg_generate_conf wizard 2008-11-30

I do plan to make that look a bit prettier.

Loose ends
----------

-Kernel shm suggestions.  The main hard part there is implementing the 
formula from the documentation to figure out the total expected shared 
block size that will be allocated.  That just needs another block of time 
to finish, will fit that in this week.

-Providing hints (to stderr) for things that fall somewhere between 
changes to be made automatically and just things to put into the docs.  I 
want to warn about the implications of the default listen_addresses for 
example.

-Documentation.  I've got the whole outline sketched out and some text 
written, just haven't had time yet to finish implementing that as Docbook.

-Tests on more platforms (this was developed on Linux).  Memory detection 
works on recent (>=2.5) version of Python for Windows now.  I want to do 
some more tests there, on Mac OS X, and on Solaris.  Some *BSD testing 
would also be in order once I'm outputting shm suggestions, I don't have 
any systems to test that platform myself.

Other than the docs and testing, the rest of these are really optional 
anyway.  What is completely done is the tuning model itself and the inputs 
it needs.  I made some small tweaks to what Josh suggested, documented 
below, and consider that frozen from my side except for review feedback. 
I'll get everything mentioned above done next weekend, and expect to 
submit something of commit candidate quality at that time.

Naming and promotion
--------------------

Now that the finish line is in sight, I thought a bit about what to name 
this thing.  The original idea Josh threw out was based on the idea that 
this would generate the postgresql.conf file from scratch, which may still 
happen eventually but is not a goal for this release.  This really just a 
tuning tool.  When searching for possible names that might fit, one of the 
ones I thought of was already suggested quite some time ago, in the 
original discussion that led to the TODO item I'm trying to complete:

http://archives.postgresql.org/pgsql-advocacy/2003-02/msg00085.php

Since that was the only claim I could find on the name, I've changed the 
name on this new version to "pgtune" and would propose that as its final 
name in contrib.

I'd like to add some text mentioning the availability of this tool to the 
sample postgresql.conf, and I'd certainly be in favor of the suggestion 
from that old message that a note suggesing you run it should even show up 
in the output from initdb.

Tuning model
------------

The actual tuning part of the code is exactly 83 lines right now and 
pretty easy to read, the rest is mainly text and GUC-related baggage. 
For anyone who wants to dig into the model itself, here are the other 
changes I made from what Josh has suggested.

-Since I'm focused right now just on getting a good 8.4 version of this 
tool running, I dropped any attempt at tuning the FSM parameters since 
they're gone.  With that change, there's no need to even ask about 
database size anymore.  I'll put something related to that in a future 
release that also targets 8.3.

-The tool doesn't do anything memory-related if you have less than 256MB 
of RAM.  I wasn't sure how far down this model scaled and may revisit that 
cut-off.

-I was uncomfortable turning off auto-vacuum in the DW case, for fear it 
would cause trouble for a newbie admin who happens to be stuck with such a 
job.  Instead I'm putting a section in the documentation about the 
parameters that might be turned by hand beyond what the tool does.  That 
can mention that further increases in default_stats_target and disabling 
auto-vacuum might make sense in the DW case.

-I will add a HINT generated if listen_addresses isn't set to '*", but the 
tool itself never makes that change.  As I said before in this thread, I 
don't want to be responsible for making a change that might not be noticed 
that impacts system security.

-There was no difference in the Web vs. OLTP parameters.  I added one 
place they diverge:  the OLTP settings now have a higher 
checkpoint_segments parameter.  I felt that 8 is a bit low for that 
application type and increased it to 16.  This makes it possible to spread 
the checkpoint out much more usefully.

-I changed the model for wal_buffers to be 512KB * checkpoint_segments. 
That is the same value in most cases as what was originally suggested, and 
the delta in the ones that changed wasn't large.

-maint_work_mem is now capped at 1GB.  My own tests and others I've seen 
reported on the lists here have suggested its useful upper setting is 
around that range, and I'd hate to see someone run this on a DW system 
with, say, 64GB of RAM and set aside 8GB of RAM just for maint_work_mem.

-checkpoint_completion_target is raised to the maximum you can set it to 
while still having completion with at least one segment of slop for 
rounding.  That's the logic behind why it's only at 0.5 with the default 
of 3 segments.  The actual underlying formula I normally use is to be 
slightly more conservative than
(checkpoint_segments - 1)/checkpoint_segments, capped at 0.9.

-I ignored the "round up/down" suggestions and just round everything down 
using an algorithm that makes it more likely that a larger unit (MB 
instead of kB, GB instead of MB) can be used when displaying the value.

And here are some platform improvements made since my last code drop here:

-Operating system is detected, as is RAM if on an OS that the detection 
works on.  The only parameter you really should need to specify if all 
that works is the application type.  I expect a future version of this 
might support overriding the OS, so that you can run the tuning tool on a 
system other than the server that configuration is going to be deployed 
on.

-Platform bit width is detected (Python looks at how wide a pointer is to 
figure that out), and that's used to figure out whether to load a 32-bit 
based set of information from pg_settings or a 64-bit one.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Sun, 30 Nov 2008, Greg Smith wrote:

> Memory detection works on recent (>=2.5) version of Python for Windows 
> now.

I just realized that the provided configuration is really not optimal for 
Windows users because of the known limitations that prevent larger 
shared_buffers settings from being effective on that platform.  I know 
there are some notes on that subject in the archives that I'll look 
though, but I'd appreciate a suggestion for what a good upper-limit for 
that setting is on Windows.  I also wonder whether any of the other 
parameters have similar restrictions on their useful range.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Martijn van Oosterhout
Date:
On Sun, Nov 30, 2008 at 09:17:37PM -0500, Greg Smith wrote:
> That's what I ended up doing.  The attached version of this script and its
> data files (I dumped all the useful bits in the current HEAD pg_settings
> for it to use) now hits all of the initial goals I had for a useful
> working tool here.  Here's a sample run on a system with 8GB of RAM and
> the default "mixed" workload.  I told the tool absolutely nothing:
>
> ./pgtune -i ~/data/postgresql.conf

Looks very nice.

> shared_buffers = 1920MB # pg_generate_conf wizard 2008-11-30

Do you have a check somewhere to see if this exceeds the total SYSV
memory allowed by the OS. Otherwise you've just output an unstartable
config. The output of /sbin/sysctl should tell you.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 1 Dec 2008, Martijn van Oosterhout wrote:

> Do you have a check somewhere to see if this exceeds the total SYSV
> memory allowed by the OS. Otherwise you've just output an unstartable
> config. The output of /sbin/sysctl should tell you.

Something to address that is listed as the first thing in the "Loose Ends" 
section of the message.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Dave Page"
Date:
On Mon, Dec 1, 2008 at 3:21 AM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Sun, 30 Nov 2008, Greg Smith wrote:
>
>> Memory detection works on recent (>=2.5) version of Python for Windows
>> now.
>
> I just realized that the provided configuration is really not optimal for
> Windows users because of the known limitations that prevent larger
> shared_buffers settings from being effective on that platform.  I know there
> are some notes on that subject in the archives that I'll look though, but
> I'd appreciate a suggestion for what a good upper-limit for that setting is
> on Windows.  I also wonder whether any of the other parameters have similar
> restrictions on their useful range.

It's going to be of little use to 99% of Windows users anyway as it's
written in Python. What was wrong with C?

FWIW though, in some pgbench tests on XP Pro, on a 4GB machine, 512MB
seemed to be consistently the most effective size (out of tests on
32MB, 512MB and 1GB). There wasn't much between 32 and 512 though - my
suspicion is that 128 or 256 would be similarly effective. I didn't
have time to test that though.

-- 
Dave Page
EnterpriseDB UK:   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Magnus Hagander
Date:
Dave Page wrote:
> On Mon, Dec 1, 2008 at 3:21 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>> On Sun, 30 Nov 2008, Greg Smith wrote:
>>
>>> Memory detection works on recent (>=2.5) version of Python for Windows
>>> now.
>> I just realized that the provided configuration is really not optimal for
>> Windows users because of the known limitations that prevent larger
>> shared_buffers settings from being effective on that platform.  I know there
>> are some notes on that subject in the archives that I'll look though, but
>> I'd appreciate a suggestion for what a good upper-limit for that setting is
>> on Windows.  I also wonder whether any of the other parameters have similar
>> restrictions on their useful range.
> 
> It's going to be of little use to 99% of Windows users anyway as it's
> written in Python. What was wrong with C?

It could be shipped with a bundled python, I guess.

Python is a lot more common in these scenarios than any of the other
"unixy languages" - much more common than perl for example - on Windows
in my experience. It could be because it's easier to bundle or something?


> FWIW though, in some pgbench tests on XP Pro, on a 4GB machine, 512MB
> seemed to be consistently the most effective size (out of tests on
> 32MB, 512MB and 1GB). There wasn't much between 32 and 512 though - my
> suspicion is that 128 or 256 would be similarly effective. I didn't
> have time to test that though.

That's about what I've heard around as well - I don't think I've ever
heard of a case where >512 has actually helped.

//Magnus


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
I just gave this a try and got:

$ ./pgtune
Traceback (most recent call last): File "./pgtune", line 20, in <module>   from ctypes.wintypes import * File
"/usr/lib/python2.5/ctypes/wintypes.py",line 21, in <module>   class VARIANT_BOOL(_SimpleCData):
 
ValueError: _type_ 'v' not supported

This is FC7, installed RPM is python-2.5-15.fc7.

(This is also my #1 beef with Python: too many utterly impenetrable
error messages... or at least impenetrable to me.)

...Robert


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 1 Dec 2008, Dave Page wrote:

> It's going to be of little use to 99% of Windows users anyway as it's
> written in Python. What was wrong with C?

It's 471 lines of Python code that leans heavily on that language's 
Dictionary type to organize everything.  Had I insisted on writing 
directly to C first, it would have be of no use to 100% of all users in 
the 8.4 timeframe because it wouldn't be even remotely close to finished 
by now.

I'd ultimately like to use the Python version as a spec to produce a C 
implementation, because that's the only path to get something like this 
integrated into initdb itself.  There were just too many thing to get 
under control for that to practical just yet.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> I'd ultimately like to use the Python version as a spec to produce a C 
> implementation, because that's the only path to get something like this 
> integrated into initdb itself.

It won't get integrated into initdb in any case: a standalone tool is
the correct thing.  Or do you think people should have to initdb in
order to get fresh advice after changes in the environment?
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 1 Dec 2008, Robert Haas wrote:

> I just gave this a try and got:
>
> $ ./pgtune
> Traceback (most recent call last):
>  File "./pgtune", line 20, in <module>
>    from ctypes.wintypes import *
>  File "/usr/lib/python2.5/ctypes/wintypes.py", line 21, in <module>
>    class VARIANT_BOOL(_SimpleCData):
> ValueError: _type_ 'v' not supported

Looks like I need to add Python 2.5+Linux to my testing set.  I did not 
expect that the UNIX distributions of Python 2.5 would ship with 
wintypes.py at all.  I think I can fix this on the spot though.  On line 
40, you'll find this bit:

except ImportError:

Change that to the following:

except ImportError,ValueError:

And it should pass that point.  If it doesn't, you can try the completely 
general:

except:

And then it should be OK.  I should probably use that formulation in that 
particular section.

> (This is also my #1 beef with Python: too many utterly impenetrable
> error messages... or at least impenetrable to me.)

That one made sense to me, but that is one of the sharper parts of its 
learning curve.  I think part of the problem is that much of the tutorial 
Python code out there never gets into error handling.  One of the things 
Java does that I miss is that if you create a "try" block, it won't work 
until you have addressed every possible exception the code in there can 
generate.  It may be impractical for Python to know that much, but I 
wonder if a lint-ish tool for the language might be able to do it.

The above mumbling is certainly off-topic for this list though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 1 Dec 2008, Tom Lane wrote:

> Greg Smith <gsmith@gregsmith.com> writes:
>> I'd ultimately like to use the Python version as a spec to produce a C
>> implementation, because that's the only path to get something like this
>> integrated into initdb itself.
>
> It won't get integrated into initdb in any case: a standalone tool is
> the correct thing.  Or do you think people should have to initdb in
> order to get fresh advice after changes in the environment?

The long-term goal Josh suggested aiming for (that I abandoned for now) 
was replacing the part of initdb that creates the initial postgresql.conf 
via the sample with a call to a program that generates it instead.  Said 
program could also run standalone for later updates to the configuration.

I have no interest in arguing about whether that's a good or bad idea at 
this point, I've accepted that the only useful statements to be made in 
this area come in the form of a working implementation to review.  Until I 
have one of those, this really isn't worth getting into.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Mark Wong"
Date:
On Thu, Nov 13, 2008 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> A lot of people have suggested raising our default_statistics target,
>> and it has been rejected because there's some O(n^2) behavior in the
>> planner, and it makes ANALYZE slower, but it's not that crazy.
>
> I think everyone agrees it ought to be raised.  Where the rubber meets
> the road is deciding just *what* to raise it to.  We've got no
> convincing evidence in favor of any particular value.
>
> If someone actually wanted to put some effort into this, I'd suggest
> taking some reasonably complex benchmark (maybe TPCH or one of the DBT
> series) and plotting planner runtime for each query as a function of
> statistics_target, taking care to mark the breakpoints where it shifted
> to a better (or worse?) plan due to having better stats.

Hi all,

I have some data from a mostly un-tuned system.  I have a 10GB scale
factor using DBT-3 on a single 25 disk hardware RAID-0 lun.  Generally
speaking, it seems like on un-tuned systems increasing the
default_statistics_target for this workload doesn't have a clear
benefit.  Here is a histogram of the Power Test and Throughput Test
with default GUC values.  The Power and Throughput test results are
plotted against the default_statistics_target value.  For reminders,
the Power Test executes each query serially, while the Throughput Test
executes 8 streams of the 22 queries simultaneously.

Default values:
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/default.png

Raw data in http://207.173.203.223/~markwkm/community6/dbt3/,
directories 1 - 11.

I started increasing the shared_buffers and effective_cache_size to
see if there might be anything more interesting in the results, but
still don't think so.

shared_buffers=2048MB, effective_cache_size=1024MB
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/2048x1024.png

Raw data in http://207.173.203.223/~markwkm/community6/dbt3/,
directories 12 - 22.

-----

shared_buffers=4096MB, effective_cache_size=2048MB
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/4096x2048.png

Raw data in http://207.173.203.223/~markwkm/community6/dbt3/,
directories 23 - 33.

-----

shared_buffers=20480MB
effective_cache_size=10240MB
checkpoint_segments=3000

http://207.173.203.223/~markwkm/pgsql/default_statistics_target/20480x10240.png

Raw data in http://207.173.203.223/~markwkm/community6/dbt3/,
directories 34 - 44.

So then I attempted to see if there might have been difference between
the executing time of each individual query with the above parameters.The queries that don't seem to be effected are
Q1,Q4, Q12, Q13, and
 
Q15.  Q17 suggests that anything higher than
default_statistics_target=10 is an improvement.  The rest of the
queries appears not to follow any particular trend with respect to
default_statistics_target.

Here's query 17 when default_statistics_target=10
EXPLAIN ANALYZE select sum(l_extendedprice) / 7.0 as avg_yearly from
lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#52'
and p_container = 'SM PACK' and l_quantity < ( select 0.2 *
avg(l_quantity) from lineitem where l_partkey = p_partkey );

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=15725486.01..15725486.02 rows=1 width=4) (actual
 
time=31669.249..31669.249 rows=1 loops=1)  ->  Hash Join  (cost=70662.08..15725435.96 rows=20019 width=4)
(actual time=690.551..31666.031 rows=5413 loops=1)        Hash Cond: (public.lineitem.l_partkey = part.p_partkey)
Join Filter: (public.lineitem.l_quantity < (subplan))        ->  Seq Scan on lineitem  (cost=0.00..1668441.22
 
rows=60052722 width=12) (actual time=0.007..11016.720 rows=60046144
loops=1)        ->  Hash  (cost=70637.08..70637.08 rows=2000 width=4) (actual
time=548.192..548.192 rows=1986 loops=1)              ->  Seq Scan on part  (cost=0.00..70637.08 rows=2000
width=4) (actual time=0.066..547.106 rows=1986 loops=1)                    Filter: ((p_brand = 'Brand#52'::bpchar) AND
(p_container = 'SM PACK'::bpchar))        SubPlan          ->  Aggregate  (cost=227.86..227.87 rows=1 width=4) (actual
time=0.141..0.141 rows=1 loops=59857)                ->  Index Scan using i_l_partkey on lineitem
(cost=0.00..227.72 rows=54 width=4) (actual time=0.071..0.132 rows=31
loops=59857)                      Index Cond: (l_partkey = $0)Total runtime: 31669.340 ms
(13 rows)


Here's query 17 when default_statistics_target=1000
EXPLAIN ANALYZE select sum(l_extendedprice) / 7.0 as avg_yearly from
lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#21'
and p_container = 'LG BAG' and l_quantity < ( select 0.2 *
avg(l_quantity) from lineitem where l_partkey = p_partkey );

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=8660521.85..8660521.87 rows=1 width=4) (actual
 
time=7035.069..7035.069 rows=1 loops=1)  ->  Nested Loop  (cost=0.00..8660472.24 rows=19845 width=4) (actual
time=0.183..7033.231 rows=5148 loops=1)        Join Filter: (public.lineitem.l_quantity < (subplan))        ->  Seq
Scanon part  (cost=0.00..70639.00 rows=1983 width=4)
 
(actual time=0.054..544.415 rows=1934 loops=1)              Filter: ((p_brand = 'Brand#21'::bpchar) AND
(p_container = 'LG BAG'::bpchar))        ->  Index Scan using i_l_partkey on lineitem
(cost=0.00..131.88 rows=31 width=12) (actual time=1.378..1.400 rows=30
loops=1934)              Index Cond: (public.lineitem.l_partkey = part.p_partkey)        SubPlan          ->  Aggregate
(cost=135.45..135.46 rows=1 width=4) (actual
 
time=0.064..0.064 rows=1 loops=58151)                ->  Index Scan using i_l_partkey on lineitem
(cost=0.00..135.37 rows=31 width=4) (actual time=0.005..0.056 rows=31
loops=58151)                      Index Cond: (l_partkey = $0)Total runtime: 7035.149 ms
(12 rows)


Histograms of each individual query:

Q1
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q1.png

Q2
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q2.png

Q3
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q3.png

Q4
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q4.png

Q5
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q5.png

Q6
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q6.png

Q7
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q7.png

Q8
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q8.png

Q9
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q9.png

Q10
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q10.png

Q11
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q11.png

Q12
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q12.png

Q13
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q13.png

Q14
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q14.png

Q15
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q15.png

Q16
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q16.png

Q17
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q17.png

Q18
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q18.png

Q19
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q19.png

Q20
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q20.png

Q21
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q21.png

Q22
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q22.png


I dumped a lot of raw data and gave little analysis, I hope someone
can make more sense of it... :)

Regards,
Mark


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Mon, 1 Dec 2008, Mark Wong wrote:

> So then I attempted to see if there might have been difference between 
> the executing time of each individual query with the above parameters. 
> The queries that don't seem to be effected are Q1, Q4, Q12, Q13, and 
> Q15.  Q17 suggests that anything higher than 
> default_statistics_target=10 is an improvement.  The rest of the queries 
> appears not to follow any particular trend with respect to 
> default_statistics_target.

The interesting ones are Q2, Q9, Q17, Q18, and Q20, and that data is much 
more useful than the summary.  As you mention, Q17 improves significantly 
with a higher target.  All of the rest are dramatically slower in one or 
both tests going from default_statistics_target=10 to 100.  Those look 
like the most useful data points on the X axis--the increases from 100 up 
to 1000 aren't particularly interesting in most of these, except in Q20 
where the "Power Test" seems to oscillate between degrees of good and bad 
behavior seemingly at random.

My picks for the most useful graphs from the long list Mark sent:

http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q2.png
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q9.png
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q17.png
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q18.png
http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q20.png

I think the tests you could consider next is to graph the target going 
from 10 to 100 in steps of 10 just for those 5 queries.  If it gradually 
degrades, that's interesting but hard to nail down.  But if there's a 
sharp transition, getting an explain plan for the two sides of that should 
provide some insight.  I'm really more interested in the ones that slowed 
down than the one that improved, understanding that might finally provide 
some evidence against increasing it by default.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Hi all,
>
> I have some data [...]

Thanks for gathering this data.

The first thing I notice is that the two versions of Q17 that you are
running are actually not the exact same query - there are hard-coded
constants that are different in each case, and that matters.  The
substituted parameter doesn't affect the plan, but the constants that
are actually in the query do.

It seems that some of the other queries may be similar - for example,
Q2 has a couple of very fast runs (with default_stats_target 10 and
600) that probably got a better plan than the other runs, and Q6, Q9,
and Q20 look suspiciously like there may have been two different plans
in the mix as well, presumably because the actual queries varied
somewhat.

I think the only way to fix this is to run the test a large number of
times with each DST and average.  :-(

Q17 looks like a pretty illustrative example of how a higher
statistics target can (potentially) help: it enables the planner to
realize that a qual on the part table is highly selective, and
therefore switch to a nested loop w/index-scan instead of a hash join.(This may not be the only case, but I can't think
ofthe others right
 
now.)  I haven't actually looked at the selectivity estimation code,
but I'm assuming that if we have n MCVs then we can estimate that any
non-MCV occurs with frequency < 1/n (in fact, < the frequency of the
least-frequent MCV, but 1/n at most).  So we want n to be large enough
that 1/n is below the cutoff for switching to an index scan
(otherwise, we'll end up using the hash join even when the qual
selects an extremely infrequent value).  It might be helpful to figure
out where that cutoff is and what factors it depends on.

...Robert


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> ... where the "Power Test" seems to oscillate between degrees of good and bad
> behavior seemingly at random.

Are any of the queries complicated enough to trigger GEQO planning?
        regards, tom lane


Re: Simple postgresql.conf wizard

From
Bruce Momjian
Date:
Gregory Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > "Dann Corbit" <DCorbit@connx.com> writes:
> >> I also do not believe that there is any value that will be the right
> >> answer.  But a table of data might be useful both for people who want to
> >> toy with altering the values and also for those who want to set the
> >> defaults.  I guess that at one time such a table was generated to
> >> produce the initial estimates for default values.
> >
> > Sir, you credit us too much :-(.  The actual story is that the current
> > default of 10 was put in when we first implemented stats histograms,
> > replacing code that kept track of only a *single* most common value
> > (and not very well, at that).  So it was already a factor of 10 more
> > stats than we had experience with keeping, and accordingly conservatism
> > suggested not boosting the default much past that.
> 
> I think that's actually too little credit. The sample size is chosen quite
> carefully based on solid mathematics to provide a specific confidence interval
> estimate for queries covering ranges the size of a whole bucket.
> 
> The actual number of buckets more of an arbitrary choice. It depends entirely
> on how your data is distributed and how large a range your queries are
> covering. A uniformly distributed data set should only need a single bucket to
> generate good estimates. Less evenly distributed data sets need more.
> 
> I wonder actually if there are algorithms for estimating the number of buckets
> needed for a histogram to achieve some measurable goal. That would close the
> loop. It would be much more reassuring to base the size of the sample on solid
> statistics than on hunches.

I have a few thoughts on this.  First, people are correct that there is
no perfect default_statistics_target value.  This is similar to the
problem with the pre-8.4 max_fsm_pages/max_fsm_relations, for which
there also was never a perfect value.  But, if the FSM couldn't store
all the free space, a server log message was issued that recommended
increasing these values;  the same is still done for
checkpoint_segments.  Is there a way we could emit a server log message
to recommend increasing the statistics targets for specific columns?

Also, is there a way to increase the efficiency of the statistics
targets lookups?  I assume the values are already sorted in the
pg_statistic arrays;  do we already do a binary lookup on those?  Does
that help?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: default_stats_target WAS: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Mark,

> Generally
> speaking, it seems like on un-tuned systems increasing the
> default_statistics_target for this workload doesn't have a clear
> benefit.

Do you have any idea how skewed the distribution of data for DBT3 is?  If 
values are being generated in relatively equal proportion, I'd expect 
increasing DST to have little effect.  The databases where higher DST is 
useful is ones with skewed data distribution.

Unfortunately, all the data examples I could point to are proprietary 
customer databases :-(

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco


Re: default_stats_target WAS: Simple postgresql.conf wizard

From
Aidan Van Dyk
Date:
* Josh Berkus <josh@agliodbs.com> [081202 15:54]:
> Do you have any idea how skewed the distribution of data for DBT3 is?  If 
> values are being generated in relatively equal proportion, I'd expect 
> increasing DST to have little effect.  The databases where higher DST is 
> useful is ones with skewed data distribution.
> 
> Unfortunately, all the data examples I could point to are proprietary 
> customer databases :-(

But no body's asking anybody to point out "skewed" data... I think it's
*unanimous* that on skewed data, a higher stats target is needed for the
skewed columns.

The question is how much of a penalty the (majority of?) users with "normal"
data columns will have to pay in stats/planning overhead to accomidate a
blanket increase in DST for the (few?) skewed columns.

I think Marks started to try and show that overhead/difference with real
numbers.

My (probably unsubstantiated) bias is showing, but nobody else has (yet)
showed otherwise ;-)

a.

-- 
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Re: Simple postgresql.conf wizard

From
"Guillaume Smet"
Date:
Greg,

On Mon, Dec 1, 2008 at 3:17 AM, Greg Smith <gsmith@gregsmith.com> wrote:
> ./pgtune -i ~/data/postgresql.conf

First, thanks for your work: it will really help a lot of people to
have a decent default configuration.

A couple of comments from reading the code (I didn't run it yet):
- it would be really nice to make it work with Python 2.4 as RHEL 5 is
a Python 2.4 thing and it is a very widespread platform out there,
- considering the audience of this tool, I think you should explain in
the usage text which type of workload implies each database type (DW,
OLTP, Web, Mixed, Desktop).
- not sure of the wording of "The file this needs to operate correctly
can be generated with:"
- it would be nice to be able to define the architecture (32-64 bits)
from the command line (especially considering I won't be able to run
it on our target boxes which are all RHEL 5 :))

I'll see if I have more feedback while testing it for real.

-- 
Guillaume


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Looks like I need to add Python 2.5+Linux to my testing set.  I did not
> expect that the UNIX distributions of Python 2.5 would ship with wintypes.py
> at all.  I think I can fix this on the spot though.  On line 40, you'll find
> this bit:
>
> except ImportError:
>
> Change that to the following:
>
> except ImportError,ValueError:

That didn't work, same error message.

> And it should pass that point.  If it doesn't, you can try the completely
> general:
>
> except:

That worked.

The settings that this initially spit out (I guess it defaults to
"mixed" mode) didn't look too sane to me, because as discussed
elsewhere on this thread 50 is not a reasonable value for
default_statistics_target for my installation.  It also wanted to set
constraint_exclusion to on, which I'm pretty confident is useless.

Then I tried "-T web" and got what seemed like a more reasonable set
of values.  But I wasn't sure I needed that many connections, so I
added "-c 150" to see how much difference that made.  Kaboom!

$ ./pgtune -i ~postgres/data/postgresql.conf -T web -c 150
Traceback (most recent call last): File "./pgtune", line 463, in <module>   wizardTune(config,options,settings) File
"./pgtune",line 403, in wizardTune   'web':mem/con, 'oltp':mem/con,'dw':mem/con/2,
 
TypeError: unsupported operand type(s) for /: 'int' and 'str'

I'm not sure what "mixed" mode is supposed to be, but based on what
I've seen so far, I'm a skeptical of the idea that encouraging people
to raise default_statistics_target to 50 and turn on
constraint_exclusion is reasonable.  I'm also a bit surprised that
there doesn't seem to be anything here that depends on the size of the
database, even order-of-magnitude.  It seems like the right value for
checkpoint_segments, at least, might depend on that.

...Robert


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Wed, 2008-12-03 at 13:30 -0500, Robert Haas wrote:
> > Looks like I need to add Python 2.5+Linux to my testing set.  I did not
> > expect that the UNIX distributions of Python 2.5 would ship with wintypes.py
> > at all.  I think I can fix this on the spot though.  On line 40, you'll find
> > this bit:
> >
> > except ImportError:
> >
> > Change that to the following:
> >
> > except ImportError,ValueError:
> 
> That didn't work, same error message.
> 
> > And it should pass that point.  If it doesn't, you can try the completely
> > general:
> >
> > except:
> 
> That worked.
> 
> The settings that this initially spit out (I guess it defaults to
> "mixed" mode) didn't look too sane to me, because as discussed
> elsewhere on this thread 50 is not a reasonable value for
> default_statistics_target for my installation.  It also wanted to set
> constraint_exclusion to on, which I'm pretty confident is useless.
> 
> Then I tried "-T web" and got what seemed like a more reasonable set
> of values.  But I wasn't sure I needed that many connections, so I
> added "-c 150" to see how much difference that made.  Kaboom!
> 
> $ ./pgtune -i ~postgres/data/postgresql.conf -T web -c 150
> Traceback (most recent call last):
>   File "./pgtune", line 463, in <module>
>     wizardTune(config,options,settings)
>   File "./pgtune", line 403, in wizardTune
>     'web':mem/con, 'oltp':mem/con,'dw':mem/con/2,
> TypeError: unsupported operand type(s) for /: 'int' and 'str'
> 
> I'm not sure what "mixed" mode is supposed to be, but based on what
> I've seen so far, I'm a skeptical of the idea that encouraging people
> to raise default_statistics_target to 50 and turn on
> constraint_exclusion is reasonable.

Why?

>   I'm also a bit surprised that
> there doesn't seem to be anything here that depends on the size of the
> database, even order-of-magnitude.  It seems like the right value for
> checkpoint_segments, at least, might depend on that.

What does checkpoint_segments have to do with the size of the database?

Joshua D. Drake




> 
> ...Robert
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
>> I'm not sure what "mixed" mode is supposed to be, but based on what
>> I've seen so far, I'm a skeptical of the idea that encouraging people
>> to raise default_statistics_target to 50 and turn on
>> constraint_exclusion is reasonable.
>
> Why?

Because both of those settings are strictly worse for my database than
the defaults.  I don't have any partitioned tables, and see:

http://archives.postgresql.org/pgsql-hackers/2008-11/msg01837.php

>>   I'm also a bit surprised that
>> there doesn't seem to be anything here that depends on the size of the
>> database, even order-of-magnitude.  It seems like the right value for
>> checkpoint_segments, at least, might depend on that.
>
> What does checkpoint_segments have to do with the size of the database?

It seems unlikely that you would want 256 MB of checkpoint segments on
a database that is only 100 MB (or even 500 MB).  But you might very
well want that on a database that is 1 TB.

...Robert


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> On Wed, 2008-12-03 at 13:30 -0500, Robert Haas wrote:
>> I'm not sure what "mixed" mode is supposed to be, but based on what
>> I've seen so far, I'm a skeptical of the idea that encouraging people
>> to raise default_statistics_target to 50 and turn on
>> constraint_exclusion is reasonable.
>
> Why?

Well did you have any response to what I posited before? I said "mixed" should
produce the same settings that the default initdb settings produce. At least
on a moderately low-memory machine that initdb targets.

It sure seems strange to me to have initdb which presumably is targeting a
"mixed" system -- where it doesn't know for sure what workload will be run --
produce a different set of values than the tuner on the same machine.

>>   I'm also a bit surprised that
>> there doesn't seem to be anything here that depends on the size of the
>> database, even order-of-magnitude.  It seems like the right value for
>> checkpoint_segments, at least, might depend on that.
>
> What does checkpoint_segments have to do with the size of the database?

I had the same reaction but I think he's right.

checkpoint_segments is the maximum amount of space you want the WAL to take up
(ideally). Presumably on a small database you don't want hundreds of megabytes
of WAL for a 10M database. But on a terabyte data warehouse sitting on a big
SAN you're not going to be concerned with how much space the WAL files are
taking. In fact, really it would be nice if we allowed units of space (MB, GB,
etc) for checkpoint_segments.

I used to think of checkpoint_segments in terms of transaction rate and
maximum tolerable recovery time but really if those are your constraints
you're better off using checkpoint_timeout I think.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Wed, 2008-12-03 at 15:21 -0500, Robert Haas wrote:
> >> I'm not sure what "mixed" mode is supposed to be, but based on what
> >> I've seen so far, I'm a skeptical of the idea that encouraging people
> >> to raise default_statistics_target to 50 and turn on
> >> constraint_exclusion is reasonable.
> >
> > Why?
> 
> Because both of those settings are strictly worse for my database than

I can see an argument about constraint_exclusion but
default_statistics_target I don't.

> the defaults.  I don't have any partitioned tables, and see:
> 
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01837.php
> >
> > What does checkpoint_segments have to do with the size of the database?
> 
> It seems unlikely that you would want 256 MB of checkpoint segments on
> a database that is only 100 MB (or even 500 MB).  But you might very
> well want that on a database that is 1 TB.

It also seems unlikely that you would hit 256MB of checkpoint segments
on a 100MB database before checkpoint_timeout and if you did, you
certainly did need them.

Remember postgresql only creates the segments when it needs them. 

Sincerely,

Joshua D. Drake


> 
> ...Robert
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Well did you have any response to what I posited before? I said "mixed" should
> produce the same settings that the default initdb settings produce. At least
> on a moderately low-memory machine that initdb targets.

I'm actually really skeptical of this whole idea of modes.  The main
thing mode does, other than set max connections, is handle desktop
differently than other modes by decreasing shared_buffers and
effective_cache_size by 4x and work_mem by 3x.  And the default
settings for max_connections are a SWAG that could easily be way off
for any particular installation.  I think it would be more useful to
get rid of modes, accept the user is going to have to specify
max_connections if the default of, say, 100 is not reasonable, and
handle the desktop case by telling the user to rerun the tool
overriding the system memory with a lower value.

I'm not sure if you've thought about this, but there is also a
difference between max_connections and maximum LIKELY connections.
For example my apps don't have too many users, since they are
internal-facing.  But setting max_connections to 100 gives me a nice
buffer just in case everyone decides to log on at once.  Still, for
performance reasons, I'd prefer to calculate based on a more likely
scenario, where the concurrent user count might be only 10 or 20.

...Robert


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> I can see an argument about constraint_exclusion but
> default_statistics_target I don't.

Why not?  I don't want to accept a big increase in ANALYZE times (or
planning times, though I'm really not seeing that at this point)
without some benefit.

>> It seems unlikely that you would want 256 MB of checkpoint segments on
>> a database that is only 100 MB (or even 500 MB).  But you might very
>> well want that on a database that is 1 TB.
>
> It also seems unlikely that you would hit 256MB of checkpoint segments
> on a 100MB database before checkpoint_timeout and if you did, you
> certainly did need them.

So why do we have this parameter at all?

...Robert


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Wed, 2008-12-03 at 16:37 -0500, Robert Haas wrote:
> > I can see an argument about constraint_exclusion but
> > default_statistics_target I don't.
> 
> Why not?  I don't want to accept a big increase in ANALYZE times (or
> planning times, though I'm really not seeing that at this point)
> without some benefit.

If you are concerned about the analyze time between 10, 50 and 150, I
would suggest that you are concerned about the wrong things. Remember
that this "simple postgresql.conf wizard" is for those who don't know,
don't want to know and will likely never know how to answer, "How do I
configure this parameter?"

> >
> > It also seems unlikely that you would hit 256MB of checkpoint segments
> > on a 100MB database before checkpoint_timeout and if you did, you
> > certainly did need them.
> 
> So why do we have this parameter at all?

Excellent question, for a different thread :)

Joshua D. Drake


> 
> ...Robert
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> It also seems unlikely that you would hit 256MB of checkpoint segments
> on a 100MB database before checkpoint_timeout and if you did, you
> certainly did need them.
>
> Remember postgresql only creates the segments when it needs them. 

Should we change the initdb output then?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
On Wed, Dec 3, 2008 at 4:41 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> If you are concerned about the analyze time between 10, 50 and 150, I
> would suggest that you are concerned about the wrong things. Remember

I can't rule that out.  What things do you think I should be concerned
about?  ISTM that default_statistics_target trades off ANALYZE time
and query planning time vs. the possibility of better plans.  If the
former considerations are not an issue for dst = 50, then maybe we
should emit 50 by default.  But the limited evidence that has been
published in this forum thus far doesn't support that contention.

>> > It also seems unlikely that you would hit 256MB of checkpoint segments
>> > on a 100MB database before checkpoint_timeout and if you did, you
>> > certainly did need them.
>>
>> So why do we have this parameter at all?
>
> Excellent question, for a different thread :)

I think the rhetorical answer is "so that we don't fill up the disk",
which gets us back to database size.

...Robert


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Wed, 2008-12-03 at 17:33 -0500, Robert Haas wrote:
> On Wed, Dec 3, 2008 at 4:41 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > If you are concerned about the analyze time between 10, 50 and 150, I
> > would suggest that you are concerned about the wrong things. Remember
> 
> I can't rule that out.  What things do you think I should be concerned
> about?

Your databases production performance with the change of the parameter.
Quite a bit more often than not, your problem (if you have one) isn't
going to be default_statistics_target is too high.

>   ISTM that default_statistics_target trades off ANALYZE time
> and query planning time vs. the possibility of better plans.  If the
> former considerations are not an issue for dst = 50, then maybe we
> should emit 50 by default.  But the limited evidence that has been
> published in this forum thus far doesn't support that contention.
> 

Actually there are years worth of evidence in these archives. Not that
the 50 is the right number but that the current settings are definitely
wrong and that higher ones are needed. That people generally start
around 100 and go from there, except where they don't and then someone
like Tom, I or some other person says, "Oh you need to increase
default_statistics_target".

There is no empirical evidence that 50 is the right setting but there is
more than enough anecdotal evidence to suggest that 50 is a lot better
than 10 and that even higher than 50 is reasonable. In an effort to
follow the PostgereSQL conservative mantra, 50 is a good compromise. 


> >> > It also seems unlikely that you would hit 256MB of checkpoint segments
> >> > on a 100MB database before checkpoint_timeout and if you did, you
> >> > certainly did need them.
> >>
> >> So why do we have this parameter at all?
> >
> > Excellent question, for a different thread :)
> 
> I think the rhetorical answer is "so that we don't fill up the disk",

I don't think at any time I have said to my self, I am going to set this
parameter low so I don't fill up my disk. If I am saying that to myself
I have either greatly underestimated the hardware for the task. Consider
that we are quarreling over what amounts to a nominal amount of hard
drive space, 1000 checkpoint_segments = 1.6G of space. My phone has more
capacity than that.

*If* hard drive is a space (that much of) a concern then you are having
other problems already that pgtune won't satisfy and you should be
manually tuning the conf in the first place.

Joshua D. Drake


> which gets us back to database size.
> 
> ...Robert
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> "Robert Haas" <robertmhaas@gmail.com> wrote: 
> On Wed, Dec 3, 2008 at 4:41 PM, Joshua D. Drake
<jd@commandprompt.com> wrote:
>> If you are concerned about the analyze time between 10, 50 and 150,
I
>> would suggest that you are concerned about the wrong things.
Remember
> 
> I can't rule that out.  What things do you think I should be
concerned
> about?  ISTM that default_statistics_target trades off ANALYZE time
> and query planning time vs. the possibility of better plans.  If the
> former considerations are not an issue for dst = 50, then maybe we
> should emit 50 by default.  But the limited evidence that has been
> published in this forum thus far doesn't support that contention.
One more data point to try to help.
While the jump from a default_statistics_target from 10 to 1000
resulted in a plan time increase for a common query from 50 ms to 310
ms, at a target of 50 the plan time was 53 ms.  Analyze time was 7.2
minutes and 18.5 minutes for targets of 10 and 50.  This is an 842 GB
database on an 8 processor (3.5 GHz Xeon) machine with 64 GB RAM
running (soon to be updated) PostgreSQL 8.2.7.
Based on the minimal plan time increase of this test, we're going to
try 50 in production and see how it goes.
It's worth pondering that at the target of 1000, had we put that into
production, running this query 300,000 times per day would have used
21 hours and 40 minutes of additional CPU time per day on planning the
runs of this one query, while a target of 50 only consumes an
additional 15 minutes of 3.5 GHz CPU time per day.
-Kevin


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Wed, 3 Dec 2008, Gregory Stark wrote:

> It sure seems strange to me to have initdb which presumably is targeting a
> "mixed" system -- where it doesn't know for sure what workload will be run --
> produce a different set of values than the tuner on the same machine.

It's been a long time since the output from initdb was targeting anything 
but a minimal system with an untuned kernel and limited resources.  If you 
invert the normal tuning recommendations, as if its initial configuration 
were the output from typical practice, it would be aiming at a system with 
approximately 128MB of RAM.  That feels about right to me; when I had 
128MB of RAM in my high-end P2-300 server running PG 7.0, 32MB of 
shared_buffers was huge and 3 checkpoints segments was plenty.  I don't 
recall regularly dirtying things fast enough to see "checkpoints occuring 
too fast" then like you can do trivially nowadays.  Here in 2008, I push 
checkpoint_segments up to 10 even for the most trivial apps lest the logs 
fill with those buggers the first time I run an update on a table.

Right now, my program doesn't fiddle with any memory settings if you've 
got less than 256MB of RAM.  Were someone to champion the idea that 
*nothing* should be fiddled with in those cases, that's not an 
unreasonable position.  I'm not the sort to be too concerned myself that 
the guy who thinks he's running a DW on a system with 64MB of RAM might 
get bad settings, but it's a fair criticism to point that out as a 
problem.

> In fact, really it would be nice if we allowed units of space (MB, GB,
> etc) for checkpoint_segments.

That's a good way to think about this, let's run with that for a minute. 
The values I'm throwing in there look like this (if your tab stops aren't 
at 8 characters this will suck):
          Completion  Max        Max
Type    Segs    Target    Segments    Usage
web    8    0.7    23        368MB
oltp    16    0.9    47        752MB
dw    64    0.9     187        3GB
mixed    16    0.9    47        752MB
desktop    3    0.5    9        144MB

Is 368MB of overhead unreasonable for a web application database today, 
where you can get a mirrored pair of disks for under $1/GB?  It's only the 
DW case that even starts to leave trivial territory.  Your example of 
somebody who thinks the overhead is too high on their 10MB database is 
already being blown away even at the default of 3 segments (assuming that 
data has enough churn on it to go through that many segments ever--if it 
doesn't then the maximum doesn't matter anyway).

The reality here is that it's the recovery playback time that's the real 
bear.  If I were trying to argue against me, what would be more persuasive 
is some tests showing how long it takes to sort through, cleanup, and 
replay the appropriate portions of as many as 47 segments worth of WAL 
after an unclean shutdown when checkpoint_segments=16.  Given how long 
that takes, it might be possible to find a modern system takes a while to 
process that much WAL volume.  It's pretty rare I run into that (usually 
only after I do something abusive), whereas complaints about the logs 
filling with checkpoint warnings on systems set to the default seem to pop 
up all the time.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> Actually there are years worth of evidence in these archives. Not that
> the 50 is the right number but that the current settings are definitely
> wrong and that higher ones are needed. That people generally start
> around 100 and go from there, except where they don't and then someone
> like Tom, I or some other person says, "Oh you need to increase
> default_statistics_target".

no, Tom or some other person would say "you need to raise the statistics
target *on that column*" I've never seen any case on the list where raising
the statistics target on every column of every table in every database of the
cluster was warranted.

> There is no empirical evidence that 50 is the right setting but there is
> more than enough anecdotal evidence to suggest that 50 is a lot better
> than 10 and that even higher than 50 is reasonable. In an effort to
> follow the PostgereSQL conservative mantra, 50 is a good compromise. 

A factor of 5 increase is conservative? In any case the mantra is "run the
numbers and produce the evidence" not "pick a smaller arbitrary number". I
started to do this for you last week but got side-tracked. Do you have any
time for this?

>> I think the rhetorical answer is "so that we don't fill up the disk",
>
> I don't think at any time I have said to my self, I am going to set this
> parameter low so I don't fill up my disk. If I am saying that to myself
> I have either greatly underestimated the hardware for the task. Consider
> that we are quarreling over what amounts to a nominal amount of hard
> drive space, 1000 checkpoint_segments = 1.6G of space. My phone has more
> capacity than that.

Well my phone has 16G of RAM, why not 10000 ?

Any number you pick will be an arbitrary one. Whatever value you pick it's
going to be "so that we don't fill up the disk" in some form or another.

You know, you actually had me convinced but now you've just convinced me that
we do need this parameter after all.

So how big should a minimum postgres install be not including your data? Is
100M reasonable? Should we say Postgres requires 200M? 500? 1G? Whatever
number we pick (or allow the user to pick) will determine how large this value
ought to be. And incidentally also provide a bound on
autovacuum_max_freeze_age as Heikki pointed out on another thread.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> One more data point to try to help.
>  
> While the jump from a default_statistics_target from 10 to 1000
> resulted in a plan time increase for a common query from 50 ms to 310
> ms, at a target of 50 the plan time was 53 ms.  

That sounds like it would be an interesting query to analyze in more detail.
Is there any chance to could run the complete graph and get a chart of analyze
times for all statistics values from 1..1000 ? And log the explain plans to a
file so we can look for at what statistics targets the plan changed?

Or if the data is public I would be interested in looking at doing it if you
want to send it to me.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-12-04 at 00:11 +0000, Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:

>  I
> started to do this for you last week but got side-tracked. Do you have any
> time for this?

I can do it if you have a script.

> So how big should a minimum postgres install be not including your data? 
> Is 100M reasonable? Should we say Postgres requires 200M? 500? 1G? Whatever
> number we pick (or allow the user to pick) will determine how large this value
> ought to be. And incidentally also provide a bound on
> autovacuum_max_freeze_age as Heikki pointed out on another thread.
> 

I fail to see what any of the above paragraph has to do with
checkpoint_segments.

Anyway, I have made my arguments.

Joshua D. Drake


> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Wed, 3 Dec 2008, Robert Haas wrote:

> I'm not sure if you've thought about this, but there is also a
> difference between max_connections and maximum LIKELY connections.

It's actually an implicit assumption of the model Josh threw out if you 
stare at the numbers.  The settings for work_mem are twice as high per 
connection in the Web+OLTP application cases, based on the assumption that 
you're just not going to get everybody doing sorting at once in those 
situations.  I toyed with exposing that as an explicit "connection load 
duty factor", then remembered I was trying to deliver something rather 
than tweak the parameters forever.  It may be a bit too aggressive as 
written right now in those cases.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Wed, 3 Dec 2008, Guillaume Smet wrote:

> - it would be really nice to make it work with Python 2.4 as RHEL 5 is
> a Python 2.4 thing and it is a very widespread platform out there,

The 2.5 stuff is only required in order to detect memory on Windows.  My 
primary box is RHEL5 and runs 2.4, it works fine there.

> - considering the audience of this tool, I think you should explain in
> the usage text which type of workload implies each database type (DW,
> OLTP, Web, Mixed, Desktop).

Once I'm done with the docs I'll refer over to those, it's too much to put 
into the usage without cluttering it.

> - it would be nice to be able to define the architecture (32-64 bits)
> from the command line (especially considering I won't be able to run
> it on our target boxes which are all RHEL 5 :))

I'm starting to lean toward making everything that gets detected as also 
being possible to override, for these case.  I want to make this just work 
in as many cases as possible, but the situation where someone is 
configuring/testing on a system other than the server is pretty common. 
Now that I think about it I often setup configs on my 32-bit laptop and 
them move them over onto 64-bit servers.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Greg Smith <gsmith@gregsmith.com> writes:

> On Wed, 3 Dec 2008, Gregory Stark wrote:
>
>> It sure seems strange to me to have initdb which presumably is targeting a
>> "mixed" system -- where it doesn't know for sure what workload will be run --
>> produce a different set of values than the tuner on the same machine.
>
> It's been a long time since the output from initdb was targeting anything but a
> minimal system with an untuned kernel and limited resources.  If you invert the
> normal tuning recommendations, as if its initial configuration were the output
> from typical practice, it would be aiming at a system with approximately 128MB
> of RAM.  That feels about right to me; when I had 128MB of RAM in my high-end
> P2-300 server running PG 7.0, 32MB of shared_buffers was huge and 3 checkpoints
> segments was plenty.  I don't recall regularly dirtying things fast enough to
> see "checkpoints occuring too fast" then like you can do trivially nowadays.

Well I think there's also an assumption in initdb that Postgres can't assume
it's on a dedicated machine. So whether it's 32MB on a dedicated 128MB machine
or 32MB on a 256MB machine where it's only expected to be half the workload of
the machine it works out to about the same thing.

> Right now, my program doesn't fiddle with any memory settings if you've got
> less than 256MB of RAM.  

What I'm suggesting is that you shouldn't have to special case this. That you
should expect whatever formulas you're using to produce the same values as
initdb if they were run on the same machine initdb is targeting.

But actually I'm more concerned with the *non* memory related parameters. It
may make sense to tweak those one way or the other for oltp or dss but "mixed"
should be exactly what initdb produces since that's exactly what it's
targeting -- a system that will have a wide mixture of queries and must
function reasonably well for both data warehouse and oltp queries.

>           Completion  Max        Max
> Type    Segs    Target    Segments    Usage
> web    8    0.7    23        368MB
> oltp    16    0.9    47        752MB
> dw    64    0.9     187        3GB
> mixed    16    0.9    47        752MB
> desktop    3    0.5    9        144MB

(incidentally using tab stops in emails is probably a bad idea because of
quoting as above)

Uhm, I hadn't actually seen this list before. I don't understand how "web" is
different from "oltp". A web service really is just one (very typical) example
of an oltp application.

And "desktop" seems like an outlier here. I suppose it's meant to capture
whether postgres is on a dedicated box? But it's possible to have a
non-dedicated oltp application or non-dedicated data warehouse box just as
easily. It's an orthogonal issue from the oltp/data-warehouse axis.

> Is 368MB of overhead unreasonable for a web application database today

Well I think it's more than most people expect a single application install to
take up before they start putting data in it. It would probably work better if
we were asking how big their database was and then could say, well, you said
you had 10G of data so 300MB of overhead isn't going to be so bad.


> The reality here is that it's the recovery playback time that's the real bear.

I agree, but then that's what checkpoint_timeout is for, no? It might take
longer to replay but the recovery time should bear some relation to how long
it took to write out the wal. More so than to the sheer size of the wal. 

> whereas complaints about the logs filling with checkpoint warnings on systems
> set to the default seem to pop up all the time.

"filling"? The cure to having too much space taken up by logs is to take up
space with, well, logs?

The logs are filling up with warnings which explain exactly what parameter to
adjust. Are there really complaints about this?

I'm really beginning to think the root of the problem is the name. If it were
"transaction_log_max_space" and measured in megabytes people would be happy to
say "ok, I'll make space for 100MB of logs" or whatever. Today they don't know
what to set it to or what the impact of setting it will be.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Wed, 3 Dec 2008, Robert Haas wrote:

> Then I tried "-T web" and got what seemed like a more reasonable set
> of values.  But I wasn't sure I needed that many connections, so I
> added "-c 150" to see how much difference that made.  Kaboom!

That and the import errors fixed in the version attached (just replacing 
the script, not its data), thanks for testing that out.

> I'm not sure what "mixed" mode is supposed to be, but based on what
> I've seen so far, I'm a skeptical of the idea that encouraging people
> to raise default_statistics_target to 50 and turn on
> constraint_exclusion is reasonable.

The statistics stuff is obviously a broader discussion, will let that rage 
in existing threads.  The reason for setting constraint_exclusion in the 
mixed case is that people who just pick the defaults without reading 
anything will get a configuration that supports partitions usefully.  One 
of the HINTs I intend to throw out for that specific case is that they 
should turn it off if they don't ever intend to use paritions.

The idea of the mixed mode is that you want to reduce the odds someone 
will get a massively wrong configuration if they're not paying attention. 
Is it worse to suffer from additional query overhead if you're sloppy with 
the tuning tool, or to discover addition partitions didn't work as you 
expected?  That's a tough call; I could invert things, so that it defaults 
to off in mixed mode, as always, and just produces a HINT to turn it on. 
I don't have a really strong opinion there either way.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Greg Smith <gsmith@gregsmith.com> writes:

> Is it worse to suffer from additional query overhead if you're sloppy with
> the tuning tool, or to discover addition partitions didn't work as you
> expected?

Surely that's the same question we faced when deciding what the Postgres
default should be? 

That and the unstated other question "Is someone more likely to use partitions
without reading the manual or not use partitions without reading the manual
about the down-sides of constraint_exclusion (in the partitioning
section....)"


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> On Thu, 2008-12-04 at 00:11 +0000, Gregory Stark wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>
>>  I
>> started to do this for you last week but got side-tracked. Do you have any
>> time for this?
>
> I can do it if you have a script.

Well, I can send you what I have so far but it still needs more work. I only
got as far as the graphs I sent earlier which don't include scanning for
changed plans. Also, if you have any sample databases with skewed data sets
that would be interesting.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simple postgresql.conf wizard

From
Alvaro Herrera
Date:
Gregory Stark escribió:
> "Joshua D. Drake" <jd@commandprompt.com> writes:

> > I don't think at any time I have said to my self, I am going to set this
> > parameter low so I don't fill up my disk. If I am saying that to myself
> > I have either greatly underestimated the hardware for the task. Consider
> > that we are quarreling over what amounts to a nominal amount of hard
> > drive space, 1000 checkpoint_segments = 1.6G of space. My phone has more
> > capacity than that.
> 
> Well my phone has 16G of RAM, why not 10000 ?

I don't think the disk space used is the only consideration here.  You
also have to keep recovery time in mind.  If you set it to 1000,
recovery would take way too long.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> On Thu, 2008-12-04 at 00:11 +0000, Gregory Stark wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>
>>  I
>> started to do this for you last week but got side-tracked. Do you have any
>> time for this?
>
> I can do it if you have a script.
>
>> So how big should a minimum postgres install be not including your data? 
>> Is 100M reasonable? Should we say Postgres requires 200M? 500? 1G? Whatever
>> number we pick (or allow the user to pick) will determine how large this value
>> ought to be. And incidentally also provide a bound on
>> autovacuum_max_freeze_age as Heikki pointed out on another thread.
>> 
>
> I fail to see what any of the above paragraph has to do with
> checkpoint_segments.

Are we all on the same page on what checkpoint_segments does? It's the number
of segments of WAL log postgres will allow to accumulate before it triggers a
checkpoint and trims off any it doesn't need. 

That means even if your database is just churning updating the same records
over and over the WAL will grow to this size before Postgres makes any attempt
to trim it (unless it hits checkpoint_timeout but that's a separate tunable).
If you're loading data all your data will go into the heap *and* the wal log
until it hits this size and triggers a checkpoint.

So this is the minimum amount of extra space you need in addition to your data
for a functioning postgres database install not including your data. If you
don't anticipate postgres using this much space and set aside enough space for
it, your database is at risk of randomly stopping and producing errors when it
can't create new log files you told it needs.

It's interesting that this is the *only* parameter we can come up with that
really directly depends on disk space. The tables are obviously going to be as
big as they have to be and there's not much to do about that. If we could
eliminate this parameter it would be a lot nicer. 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Wed, 2008-12-03 at 22:17 -0300, Alvaro Herrera wrote:
> Gregory Stark escribió:
> > "Joshua D. Drake" <jd@commandprompt.com> writes:
> 
> > > I don't think at any time I have said to my self, I am going to set this
> > > parameter low so I don't fill up my disk. If I am saying that to myself
> > > I have either greatly underestimated the hardware for the task. Consider
> > > that we are quarreling over what amounts to a nominal amount of hard
> > > drive space, 1000 checkpoint_segments = 1.6G of space. My phone has more
> > > capacity than that.
> > 
> > Well my phone has 16G of RAM, why not 10000 ?
> 
> I don't think the disk space used is the only consideration here.  You
> also have to keep recovery time in mind.  If you set it to 1000,
> recovery would take way too long.

Well certainly but the original argument that came back was, (from
Robert Haas):

"
It seems unlikely that you would want 256 MB of checkpoint segments on
a database that is only 100 MB (or even 500 MB).  But you might very
well want that on a database that is 1 TB.
"

My whole point is that:

1. It seems unlikely that you would hit 256MB of checkpoint segments on
a 100MB database before checkpoint_timeout and if you did, you certainly
did need them. (the checkpoint segments)

2. "taking up space" is such a minute concern in comparison to the
potential benefit.

Recovery is certainly a consideration but let's be realistic it is the
last consideration because it is the least likely to happen. What is
more likely to happen is IO spikes because we are recycling logs too
much.

I know we have some other facilities to deal with that now too but it
doesn't completely negate the problem and in my opinion, increasing the
checkpoint_segments provides no perceivable downside in production use
but does provide significant perceivable upside.

Joshua D. Drake




-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Gregory Stark wrote:

>> Right now, my program doesn't fiddle with any memory settings if you've got
>> less than 256MB of RAM.
>
> What I'm suggesting is that you shouldn't have to special case this. That you
> should expect whatever formulas you're using to produce the same values as
> initdb if they were run on the same machine initdb is targeting.

The reason that approach isn't taken is that the model here assumes the OS 
overhead is negligable relative to everything else going on.  If you've 
only got a small amount of RAM, that assumption is so badly broken that 
you can't just extend the curves for everything down to there and expect 
that what comes out will make any sense.  I started to make a more 
complicated bit that did scale down to the bottom by modeling the overhead 
better, Josh talked me out of doing it for now.

> But actually I'm more concerned with the *non* memory related parameters. It
> may make sense to tweak those one way or the other for oltp or dss but "mixed"
> should be exactly what initdb produces since that's exactly what it's
> targeting -- a system that will have a wide mixture of queries and must
> function reasonably well for both data warehouse and oltp queries.

The only way this line of discussion will go is toward talking about what 
should be changed in initdb to make it more representative of the current 
real world, and I know that's not going anywhere (see 
"default_statistics_target=10").  The idea that the sample configuration 
is tuned usefully for any application whatsoever gets nothing from me but 
a chuckle.

> And "desktop" seems like an outlier here. I suppose it's meant to capture
> whether postgres is on a dedicated box? But it's possible to have a
> non-dedicated oltp application or non-dedicated data warehouse box just as
> easily.

That's the target for something that's not a dedicated server--a desktop 
PC you use as a general workstation, maybe you're installing PostgreSQL as 
a developer that's competing with your web server and other apps; 
something like that.  There might be a better name for that.

>> Is 368MB of overhead unreasonable for a web application database today
>
> Well I think it's more than most people expect a single application install to
> take up before they start putting data in it.

Segments don't get allocated until you churn through that much WAL 
activity; that figure is an upper-bound after you've pushed more than that 
worth of data through WAL and into the database.  The only example where 
this overhead isn't dwarfed by the size of the resulting database is where 
some small number of records are inserted, then constantly updated and 
vacuumed.  And you know what?  The person doing that is likely to really 
benefit from having checkpoint_segments set to a larger value.  Update and 
vacuum heavy workloads are exactly the sort where you end up checkpointing 
too often with the default parameters.

> I'm really beginning to think the root of the problem is the name. If it were
> "transaction_log_max_space" and measured in megabytes people would be happy to
> say "ok, I'll make space for 100MB of logs" or whatever. Today they don't know
> what to set it to or what the impact of setting it will be.

Unless they do something crazy like read the documentation:

http://www.postgresql.org/docs/8.3/static/wal-configuration.html
"There will always be at least one WAL segment file, and will normally not 
be more than (2 + checkpoint_completion_target) * checkpoint_segments + 1 
files. Each segment file is normally 16 MB (though this size can be 
altered when building the server). You can use this to estimate space 
requirements for WAL."

Too complicated for most people you say?  I agree; that's why I put some 
annotated examples for what those translate into 
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server like "32 
(checkpoint every 512MB)".

What fun.  I'm beginning to remember why nobody has ever managed to 
deliver a community tool that helps with this configuration task before.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> The idea of the mixed mode is that you want to reduce the odds someone will
> get a massively wrong configuration if they're not paying attention. Is it
> worse to suffer from additional query overhead if you're sloppy with the
> tuning tool, or to discover addition partitions didn't work as you expected?
>  That's a tough call; I could invert things, so that it defaults to off in
> mixed mode, as always, and just produces a HINT to turn it on. I don't have
> a really strong opinion there either way.

I think that the strong feelings about default_statistics_target and
constraint_exclusion come from the fact that when they are too low (in
the first case) or off (in the second case), you can get very, very
bad query plans.  The penalties in the opposite direction are more
subtle.  If they're so subtle that we don't care about incurring them,
then let's change initdb too.  If not, then let's not have the tuning
tool generate them by default either.

...Robert


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> What fun.  I'm beginning to remember why nobody has ever managed to deliver
> a community tool that helps with this configuration task before.

I have to say I really like this tool.  It may not be perfect but it's
a lot easier than trying to do this analysis from scratch.  And we are
really only arguing about a handful of settings.

It wouldn't take a lot to convince me that checkpoint_segments=3 is
too low.  I easily blew through that testing the bulk-insert tuning
patch.  I'm curious why wal_buffers is being set to 512 *
checkpoint_segments.  Are they related?  The default value for
wal_buffers is only 64 kB, which means someone thought you shouldn't
need much space for this at all, but this suggests a setting in the
4-32 MB range, an increase of ~2 orders of magnitude.  For all I know
that could be right but it's a big increase.

Regarding the religious war now in progress, I think it would be
awfully good for someone to offer some thoughts on how to figure out
which particular columns on which particular tables need a higher
statistics target.  That might allow us to either (a) build a wizard
that helps you find those problems that could perhaps be used
alongside this one or (b) incorporate those same smarts into core.

...Robert


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Greg Smith <gsmith@gregsmith.com> writes:

> On Thu, 4 Dec 2008, Gregory Stark wrote:
>
>> What I'm suggesting is that you shouldn't have to special case this. That you
>> should expect whatever formulas you're using to produce the same values as
>> initdb if they were run on the same machine initdb is targeting.
>
> The reason that approach isn't taken is that the model here assumes the OS
> overhead is negligable relative to everything else going on.  

ok that does make sense. But the non-memory parameters...

... I've cut part of my response for a separate thread ...

>> And "desktop" seems like an outlier here. I suppose it's meant to capture
>> whether postgres is on a dedicated box? But it's possible to have a
>> non-dedicated oltp application or non-dedicated data warehouse box just as
>> easily.
>
> That's the target for something that's not a dedicated server--a desktop PC you
> use as a general workstation, maybe you're installing PostgreSQL as a developer
> that's competing with your web server and other apps; something like that.
> There might be a better name for that.

My point was more that you could have a data warehouse on a non-dedicated
machine, you could have a web server on a non-dedicated machine, or you could
have a mixed server on a non-dedicated machine. I don't see how you would
decide whether to set enable_constraint_exclusion for "desktop" for example.

>>> Is 368MB of overhead unreasonable for a web application database today
>>
>> Well I think it's more than most people expect a single application install to
>> take up before they start putting data in it.
>
> Segments don't get allocated until you churn through that much WAL activity;
> that figure is an upper-bound after you've pushed more than that worth of data
> through WAL and into the database.  The only example where this overhead isn't
> dwarfed by the size of the resulting database 

Right, well, no, it won't be dwarfed -- it'll be about the same size. Ie, if
you load 100MB into the database there'll be about 100MB of logs generated. Up
to the point where you hit this maximum upper bound.

But yes, right that it's the upper bound for the extra space allocated in
addition to the size of the database. And how much extra space should we
allocate? 

I don't see why this extra space bound should depend on the type of OLTP vss
DSS workload. Only on how much disk space is available that the admin is
willing to dedicate to Postgres. Assuming an admin of a 1TB server is willing
to dedicate 1GB to logs and the admin of a 1GB server would be annoyed to have
to throw more than a few hundred megs seems as reasonable a place as any to
start.

> is where some small number of records are inserted, then constantly updated
> and vacuumed. And you know what? The person doing that is likely to really
> benefit from having checkpoint_segments set to a larger value. Update and
> vacuum heavy workloads are exactly the sort where you end up checkpointing
> too often with the default parameters.

Well there are a few problems with this. a) we have HOT now so you don't need
any vacuums to be part of the picture. b) if you're updating the same pages
over and over again a checkpoint will be super-quick since there will only be
a few pages to write out so no you don't really need some large
checkpoint_segments for any performance reason.

Josh's logic is impeccable -- for the specific use case he's describing of a
truly dedicated server with enough disk space for a major production database.
But not every install is going to have gigabytes of space reserved for it and
not every admin is going to realize that he really should set aside gigabytes
of space even though he only expects his database to be a few megabytes.

>> I'm really beginning to think the root of the problem is the name. If it were
>> "transaction_log_max_space" and measured in megabytes people would be happy to
>> say "ok, I'll make space for 100MB of logs" or whatever. Today they don't know
>> what to set it to or what the impact of setting it will be.
>
> Unless they do something crazy like read the documentation:

Well we know nobody does that :/

It's great that Postgres has such great documentation but whenever we have the
chance to replace something with an option which doesn't need any
documentation that would be even better. I'm just exploring whether that's an
option here.

> What fun.  I'm beginning to remember why nobody has ever managed to deliver a
> community tool that helps with this configuration task before.

Well I don't think this is why. Nobody's even tried to do this side of things
before. They always got bogged down in trying to parse config files and such.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Gregory Stark wrote:

> Greg Smith <gsmith@gregsmith.com> writes:
>
>> Is it worse to suffer from additional query overhead if you're sloppy with
>> the tuning tool, or to discover addition partitions didn't work as you
>> expected?
>
> Surely that's the same question we faced when deciding what the Postgres
> default should be?

Gosh, you're right.  I'm really new here, and I just didn't understand how 
things work.  I should have known that there was lots of thorough research 
into that setting before the default was set.  (hangs head in shame)

Wait, what list am I on?  pgsql-hackers?  Oh, crap, that can't be right at 
all then.  This one is actually an interesting example of how this stuff 
ends up ossified without being revisited, I'm glad you brought it up.

First we have to visit the 8.1 and 8.2 documentation.  There we find the 
real reason it originally defaulted to off:

http://www.postgresql.org/docs/8.1/static/runtime-config-query.html 
"Currently, constraint_exclusion is disabled by default because it risks 
incorrect results if query plans are cached if a table constraint is 
changed or dropped, the previously generated plan might now be wrong, and 
there is no built-in mechanism to force re-planning."  It stayed off for 
that reason for years.

Then the plan invalidation stuff went into 8.3 that made this no longer 
true.  Bruce even removed the item from the TODO list that used to say 
that constraint_exclusion should be improved to "allow it to be used for 
all statements with little performance impact".  Then a couple of months 
later, when the 8.3 docs were being worked on, Tom updated the text to 
remove the obsolete warning about the plan risks:

http://archives.postgresql.org/pgsql-committers/2007-03/msg00372.php

Leaving only the leftovers of the original caveat about how it can also 
cause some overhead as the reason for why it was still off--a concern 
which was certainly more serious when that text was written in 2005 than 
it is today for multiple reasons.

How much was that overhead lowered by the work done in 8.3?  I can't find 
any public information suggesting that was ever even discussed.  The only 
thing I found when poking around looking for it is that Tom had expressed 
some concerns that the proof overhead was too still large back in 2006: 
http://archives.postgresql.org/pgsql-committers/2006-02/msg00035.php

But you know what?  The cached proof comparison bit Tom commited a couple 
of weeks ago shifted the mechanics of the overhead for this specific case 
around, so even if we did have 8.3 results they'd need to get re-run at 
this point anyway.  See below for more on what might be different soon.

So, if you want to say that turning on constraint_exclusion by default is 
a horrible idea because it adds significant overhead, and you have any 
sort of evidence that will still be true for 8.4 on the kind of hardware 
8.4 is likely to run on, I would greatly appreciate that information.

But presuming that serious thought must have went into every decision made 
about what the defaults for all the performance-related parameter in the 
postgresql.conf is something we all know just ain't so.  What I see is a 
parameter that doesn't add enough overhead relative to query execution 
time on today's systems that I've noticed whether it was on or off, one 
that's set to off only by historical accident combined with basic 
conservatism (mainly from Tom far as I can tell, he's a nice reliable 
source for that).  Whereas if it's accidentally set wrong, it can lead to 
massively wrong plans.  I'm not sure what the right move here is, but the 
appeal to authority approach for defending the default here isn't going to 
work on me.

> That and the unstated other question "Is someone more likely to use partitions
> without reading the manual or not use partitions without reading the manual
> about the down-sides of constraint_exclusion (in the partitioning
> section....)"

Have you started thinking about the implications of 
http://archives.postgresql.org/message-id/a301bfd90810310750pf108c69x36499546f406650f@mail.gmail.com 
yet?  It is a bold new world of people who partition with less time stuck 
in the manual first we approach, and I was very much thinking about that 
when mulling over whether I agreed with Josh's suggestion to put that into 
the default mixed settings before I went with it (that's right--I wrote 
all the above and it wasn't even my idea originally).  If that doesn't 
make it into 8.4 I will yield to your statement of the boring, 
manual-reading status quo still being on target.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Greg Smith <gsmith@gregsmith.com> writes:

> On Thu, 4 Dec 2008, Gregory Stark wrote:
>
>> Greg Smith <gsmith@gregsmith.com> writes:
>>
>>> Is it worse to suffer from additional query overhead if you're sloppy with
>>> the tuning tool, or to discover addition partitions didn't work as you
>>> expected?
>>
>> Surely that's the same question we faced when deciding what the Postgres
>> default should be?
>
> Gosh, you're right.  I'm really new here, and I just didn't understand how
> things work.  I should have known that there was lots of thorough research into
> that setting before the default was set.  (hangs head in shame)

Oh no, in this case I meant just that if we want to change it we should change
it in *both* places. That the argument you're making applies just as much to
the Postgres default as it does to the "mixed" workload default in the tuner.

But I admit I totally didn't remember that the main reason it was originally
off was the lack of plan invalidation. That does rather change things. Perhaps
we should be enabling it now.

If we do though, it shouldn't default one way and then get randomly flipped by
a tool that has the same information to make its decision on. What I'm saying
is that "mixed" is the same information that initdb had about the workload.

If we do change this then I wonder if we need the parameter at all. I mean, we
don't generally have parameters to turn off random parts of the optimizer...

> How much was that overhead lowered by the work done in 8.3?  I can't find any
> public information suggesting that was ever even discussed.  

Well it does have to compare every constraint with every clause and do a
moderately complex analysis. It's never going to be super-fast for complex
queries. But on the other hand it should drop out pretty fast if the tables
haven't got any constraints so it does seem like it's only hurting people when
they would want it on anyways.

> What I see is a parameter that doesn't add enough overhead relative to query
> execution time on today's systems that I've noticed whether it was on or off

There's a danger in this. There's some famous, probably apocryphal, example of
a fast food restaurant that taste tested their menu and got great results.
Then they cheapened an ingredient and their testers couldn't taste the
difference. Then they did that with another ingredient and another and so on
and each time the testers couldn't taste a difference. And in the end they
ended up rolling out a dramatically inferior menu which people panned compared
to the original...

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!
 


Re: Simple postgresql.conf wizard

From
"Mark Wong"
Date:
On Mon, Dec 1, 2008 at 9:32 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Mon, 1 Dec 2008, Mark Wong wrote:
>
>> So then I attempted to see if there might have been difference between the
>> executing time of each individual query with the above parameters. The
>> queries that don't seem to be effected are Q1, Q4, Q12, Q13, and Q15.  Q17
>> suggests that anything higher than default_statistics_target=10 is an
>> improvement.  The rest of the queries appears not to follow any particular
>> trend with respect to default_statistics_target.
>
> The interesting ones are Q2, Q9, Q17, Q18, and Q20, and that data is much
> more useful than the summary.  As you mention, Q17 improves significantly
> with a higher target.  All of the rest are dramatically slower in one or
> both tests going from default_statistics_target=10 to 100.  Those look like
> the most useful data points on the X axis--the increases from 100 up to 1000
> aren't particularly interesting in most of these, except in Q20 where the
> "Power Test" seems to oscillate between degrees of good and bad behavior
> seemingly at random.
>
> My picks for the most useful graphs from the long list Mark sent:
>
> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q2.png
> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q9.png
> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q17.png
> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q18.png
> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q20.png
>
> I think the tests you could consider next is to graph the target going from
> 10 to 100 in steps of 10 just for those 5 queries.  If it gradually
> degrades, that's interesting but hard to nail down.  But if there's a sharp
> transition, getting an explain plan for the two sides of that should provide
> some insight.  I'm really more interested in the ones that slowed down than
> the one that improved, understanding that might finally provide some
> evidence against increasing it by default.

I've updated the charts to include results from setting
default_statistics_target from 20-90.  The links to the charts are the
same.  The links to the raw data are in
http://207.173.203.223/~markwkm/community6/dbt3/ in directories 45-52.

Regards,
Mark


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> If we do though, it shouldn't default one way and then get randomly flipped by
> a tool that has the same information to make its decision on. What I'm saying
> is that "mixed" is the same information that initdb had about the workload.

+1.

> If we do change this then I wonder if we need the parameter at all. I mean, we
> don't generally have parameters to turn off random parts of the optimizer...

It probably isn't a good idea to both change the default setting and
remove the parameter in the same release.  It would be awesome if this
is cheap enough now to have it on by default - but constraints are
pretty useful for maintaining data integrity, so it's conceivable to
me that someone could have a lot of constraints most of which are
unuseful for query planning.

...Robert


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Gregory Stark wrote:

> My point was more that you could have a data warehouse on a 
> non-dedicated machine, you could have a web server on a non-dedicated 
> machine, or you could have a mixed server on a non-dedicated machine.

I should just finish the documentation, where there will be a big 
disclaimer saying "THESE SETTINGS ASSUME A SERVER DEDICATED TO 
POSTGRESQL!"  That's the context here.  Why, after you follow my tuning 
instructions, you're lucky if the server will run anything but the 
database afterwards.

> Josh's logic is impeccable -- for the specific use case he's describing of a
> truly dedicated server with enough disk space for a major production database.
> But not every install is going to have gigabytes of space reserved for it and
> not every admin is going to realize that he really should set aside gigabytes
> of space even though he only expects his database to be a few megabytes.

It's really quite simple.  Josh and I don't care directly about disk space 
used by the WAL for people with trivial databases.  At all.  Whatsoever. 
Maybe once, long ago, when we were young and frugal and skinny[1]; not 
now, or probably ever again the future.  If that's your concern, maybe 
there can be some companion utility named pgmiser that lowers parameters 
back down again.  Your mascot can be some sort of animal that efficiently 
lives off small scraps of food or something.[2]

The context here is pgtune, which is aiming to make a fat elephant of a 
server faster so that there's an answer to people who say "My benchmarks 
are all running really slow, is this because my system with 16PT of RAM is 
only using 32MB of it for the database?  This sucks, I'm going back to 
Oracle which used all my RAM."  If there are people who instead think, 
"hey, I'll run this tuning utility to make my database faster, then it 
will also be a lot smaller!", maybe we can find a class about space/time 
tradeoffs in algorithm design to send them to or something.[3]

There are exactly two important things here.  The first is how large 
checkpoint_settings needs to be in order to for the considerable overhead 
of checkpoints to be bearable.  That drives the setting up.  Our super-fat 
DW application gets set to at least 64 so that when you bulk-load another 
TB of data into it, that doesn't get bottlenecked dumping gigabytes of 
dirty buffers every few seconds.  If the database crashes and recovery 
reads or writes a bunch of data, who cares about random writes because 
your SAN has a 4GB write cache on it and dozens of drives slaving away.

Driving the setting down is knowing how much time you'll have to wait for 
recovery to happen, which is really a measure of what your tolerance for 
downtime is.  We're thinking that someone who picks the Desktop tuning may 
have no tolerance for the database to be sluggish coming back up after 
Windows crashed and they rebooted, so tiny setting for them to make 
recovery super fast.

Everybody else in our sample profiles fall in the middle of those two 
extremes, which is why the values curve the way they do.  Web app? 
Probably not a lot of write volume, probably trouble if it's down a long 
time; how about 8, on the low side, but it gives checkpoints more time to 
spread out their I/O so worst-case latency isn't as bad.  That's the sort 
of analysis those numbers come from.  Do performance tuning and juggle 
these trade-offs for long enough for new people all the time, you get a 
gut feel for the right ballpark an app should start at based on its type. 
The whole idea behind this tool is that we're taking some of that hard-won 
knowledge and trying to automate the distribution of it.

> It's great that Postgres has such great documentation but whenever we have the
> chance to replace something with an option which doesn't need any
> documentation that would be even better. I'm just exploring whether that's an
> option here.

I would be glad to have a post-CommitFest discussion of this very topic as 
it's quite a pain to me in its current form.  Just not right now because 
it's too late to touch it.

> Nobody's even tried to do this side of things before. They always got 
> bogged down in trying to parse config files and such.

It's actually because most of them were working in Perl, which encourages 
deviant behavior where people delight in converting useful ideas into 
illegible punctuation rather than actually getting anything done.  Except 
for that other Greg around here who's not involved in this discussion, his 
Perl is pretty good.

[1] Josh is being aggressively bulked up right now for his next sumo 
match.

[2] Like a rat, which would give you an excuse to add the long overdue 
PL/Ratfor.

[3] This wouldn't actually help them learn anything, but it would make 
their heads explode at which point all their problems are gone.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
>> I think the tests you could consider next is to graph the target going from
>> 10 to 100 in steps of 10 just for those 5 queries.  If it gradually
>> degrades, that's interesting but hard to nail down.  But if there's a sharp
>> transition, getting an explain plan for the two sides of that should provide
>> some insight.  I'm really more interested in the ones that slowed down than
>> the one that improved, understanding that might finally provide some
>> evidence against increasing it by default.
>
> I've updated the charts to include results from setting
> default_statistics_target from 20-90.  The links to the charts are the
> same.  The links to the raw data are in
> http://207.173.203.223/~markwkm/community6/dbt3/ in directories 45-52.

I still think we're missing the boat here because it's not really the
same query every time.

...Robert


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Wed, 3 Dec 2008, Mark Wong wrote:

>> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q2.png
>> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q9.png
>> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q17.png
>> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q18.png
>> http://207.173.203.223/~markwkm/pgsql/default_statistics_target/q20.png
>>
> I've updated the charts to include results from setting
> default_statistics_target from 20-90.

Interesting.

Q2: Explodes hard when the target hits 100, but not before.  Would be 
interesting to compare the explain plans at, say, 50 and 150 to see if 
it's possible to nail down what is shifting so much.

Q3: goes bonkers as soon as the target hits 20, so the interesting plans 
to compare are 10 and 20.

Q17:  our one from the original set that improved a bunch with the larger 
target gets all that behefit just from going to 20.  Would be interesting 
to compare the plans at 10 and 20 to see what changed so much with such a 
small difference.

Q18:  looks like it was in some sort of local bad area around 100-400 
before, with some more context that one doesn't look interesting anymore.

Q20:  also doesn't look very interesting anymore.  The results at 10 were 
nice, and the ones at 100 were among the unusuallly bad ones, but it's 
pretty random--if there was something inherantly bad related to the stats, 
there wouldn't be low points around 200.

Out of those, the most interesting one to me (as someone who is trying to 
defend raising the target some but not going crazy with that) is Q3.  The 
reason I say that is that everything else is better or basically the same 
raising the target from 10, as long as you don't go too high (>=100). 
That one falls apart immediately with a larger target which seems weird.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
Heikki Linnakangas
Date:
Alvaro Herrera wrote:
> Gregory Stark escribió:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
> 
>>> I don't think at any time I have said to my self, I am going to set this
>>> parameter low so I don't fill up my disk. If I am saying that to myself
>>> I have either greatly underestimated the hardware for the task. Consider
>>> that we are quarreling over what amounts to a nominal amount of hard
>>> drive space, 1000 checkpoint_segments = 1.6G of space. My phone has more
>>> capacity than that.
>> Well my phone has 16G of RAM, why not 10000 ?
> 
> I don't think the disk space used is the only consideration here.  You
> also have to keep recovery time in mind.  If you set it to 1000,
> recovery would take way too long.

Presumably if you set checkpoint_segments to a high value, you'd use 
checkpoint_timeout to limit recovery time.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
Greg Smith <gsmith@gregsmith.com> writes:

> On Thu, 4 Dec 2008, Gregory Stark wrote:
>
>> My point was more that you could have a data warehouse on a non-dedicated
>> machine, you could have a web server on a non-dedicated machine, or you could
>> have a mixed server on a non-dedicated machine.
>
> I should just finish the documentation, where there will be a big disclaimer
> saying "THESE SETTINGS ASSUME A SERVER DEDICATED TO POSTGRESQL!"  That's the
> context here.  Why, after you follow my tuning instructions, you're lucky if
> the server will run anything but the database afterwards.

So you're getting rid of the "desktop" mode altogether? That's more drastic
than I was suggesting. I was suggesting that you want to separate "desktop"
into a separate option form the workload list. 

What users are getting now is kind of like asking users "Would you like a
small, medium, large, or diet Coke?"


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote:

> That sounds like it would be an interesting query to analyze in more
detail.
> Is there any chance to could run the complete graph and get a chart
of
> analyze
> times for all statistics values from 1..1000 ? And log the explain
plans to
> a
> file so we can look for at what statistics targets the plan changed?
>
> Or if the data is public I would be interested in looking at doing it
if you
> want to send it to me.

There are some very big tables in that query which contain some
confidential data.  It would be hard do a lot of runs at high
default_statistics_target values because the database analyze time
goes so high.  If I pick out which tables are used by the query, I
might be able to put a script together which loops through analyze of
those tables with different targets and capturing run time.

There are two problems -- finding the time to set this up, and finding
server time windows where other things wouldn't be distorting the
results.  If you could help with setting up the test script, that
would go a long way toward solving the first problem.  I think I could
reserve a smaller multi-CPU machine with identical data but slower
CPUs to run the test.

I'll attach the query and plan.  You'll note that the query looks a
little odd, especially all the (1=1) tests.  This is because the
application allows users to plug in a variety of selection criteria,
and any that aren't used are stubbed out that way.  I picked one that
was not too atypical for the 300,000 runs per day.  I intentionally
didn't "clean up" the white space, but left it just as it was emitted
by our framework, in case any of that affected parse/plan time.

If you need schema info not obvious from the plan, let me know.

-Kevin



Attachment

Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> There are some very big tables in that query which contain some
> confidential data.  

oh well.

> I'll attach the query and plan.  You'll note that the query looks a
> little odd, especially all the (1=1) tests.  

That is interesting. I seem to recall Tom is resistant to trying to optimize
such queries but actually I've written lots of queries like that myself so I
find them interesting.

I'll look at the query and see if I can write a similar one using dbt3.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote:
> And log the explain plans to a
> file so we can look for at what statistics targets the plan changed?

Well, I can give you explain analyze output for
default_statistics_target 10 and 50, for whatever that's worth.

Unfortunately I blew my save from the first run with target 50, but it
ran much faster than the first run of target 10 (attached), I think it
was about 2.8 seconds.  That may or may not be due to the right pages
being cached by coincidence.  The machine with the target 50 run was
serving the web app at the time, so there was significant other load,
while the other was idle except as a replicaton target at the time of
the run.

-Kevin



Attachment

Re: Simple postgresql.conf wizard

From
Ron Mayer
Date:
Greg Smith wrote:
> I'm not the sort to be too concerned myself that 
> the guy who thinks he's running a DW on a system with 64MB of RAM might 
> get bad settings, but it's a fair criticism to point that out as a problem.

In defense of thinking about very small configurations, I've seen many
cases where an enterprise-software salesperson's laptop is running a
demo - either in a small virtual machine in the laptop, or on an
overloaded windows box.    Even though the customer might end up
running with 64GB, the sales demo's more likely to be 64MB.

In some ways, I'd be more interested in seeing automation of those than
the large production systems.   Large production systems are likely to
have an administrator who's paid to read the documentation and learn
how to configure the database.  OTOH there tends to be less DBA time
available to tune the smaller demo instances that come&go as sales
people upgrade their laptops; so improved automation would be much
appreciated there.



Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-12-04 at 10:20 -0800, Ron Mayer wrote:
> Greg Smith wrote:
> > I'm not the sort to be too concerned myself that 
> > the guy who thinks he's running a DW on a system with 64MB of RAM might 
> > get bad settings, but it's a fair criticism to point that out as a problem.
> 
> In defense of thinking about very small configurations, I've seen many
> cases where an enterprise-software salesperson's laptop is running a
> demo - either in a small virtual machine in the laptop, or on an
> overloaded windows box.    Even though the customer might end up
> running with 64GB, the sales demo's more likely to be 64MB.

Although I get your point, that is a job for sqllite not postgresql.
PostgreSQL is not a end all be all solution and it is definitely not
designed to be "embedded" which is essentially what you are suggesting
with that kind of configuration.

Joshua D. Drake


-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Greg Stark
Date:
Well that's a bit if hyperbole. There's a gulf of difference between  
an embedded use case where it should fit within an acceptable  
footprint for a desktop app component of maybe a megabyte or so of ram  
and disk - if we're generous and saying it should run comfortably  
without having to spec out special server hardware for a demo.

That said 64mb of ram seems like hyperbole too. My NSLU2 has 32mb...

greg

On 4 Dec 2008, at 06:28 PM, "Joshua D. Drake" <jd@commandprompt.com>  
wrote:

> On Thu, 2008-12-04 at 10:20 -0800, Ron Mayer wrote:
>> Greg Smith wrote:
>>> I'm not the sort to be too concerned myself that
>>> the guy who thinks he's running a DW on a system with 64MB of RAM  
>>> might
>>> get bad settings, but it's a fair criticism to point that out as a  
>>> problem.
>>
>> In defense of thinking about very small configurations, I've seen  
>> many
>> cases where an enterprise-software salesperson's laptop is running a
>> demo - either in a small virtual machine in the laptop, or on an
>> overloaded windows box.    Even though the customer might end up
>> running with 64GB, the sales demo's more likely to be 64MB.
>
> Although I get your point, that is a job for sqllite not postgresql.
> PostgreSQL is not a end all be all solution and it is definitely not
> designed to be "embedded" which is essentially what you are suggesting
> with that kind of configuration.
>
> Joshua D. Drake
>
>
> -- 
> PostgreSQL
>   Consulting, Development, Support, Training
>   503-667-4564 - http://www.commandprompt.com/
>   The PostgreSQL Company, serving since 1997
>


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
>> In defense of thinking about very small configurations, I've seen many
>> cases where an enterprise-software salesperson's laptop is running a
>> demo - either in a small virtual machine in the laptop, or on an
>> overloaded windows box.    Even though the customer might end up
>> running with 64GB, the sales demo's more likely to be 64MB.
>
> Although I get your point, that is a job for sqllite not postgresql.
> PostgreSQL is not a end all be all solution and it is definitely not
> designed to be "embedded" which is essentially what you are suggesting
> with that kind of configuration.

It's unlikely that someone would want to write a demo version of the
software that runs on a completely different database than the
production one...  it's also totally unnecessary because PostgreSQL
runs great on small systems.  In fact, it runs great on small systems
with NO TUNING AT ALL.  That's exactly why a wizard is needed to set
values for systems that aren't small.

...Robert


Re: Simple postgresql.conf wizard

From
Ron Mayer
Date:
Joshua D. Drake wrote:
> On Thu, 2008-12-04 at 10:20 -0800, Ron Mayer wrote:
>> Greg Smith wrote:
>>> I'm not the sort to be too concerned myself that 
>>> the guy who thinks he's running a DW on a system with 64MB of RAM might 
>>> get bad settings, but it's a fair criticism to point that out as a problem.
>> In defense of thinking about very small configurations, I've seen many
>> cases where an enterprise-software salesperson's laptop is running a
>> demo - either in a small virtual machine in the laptop, or on an
>> overloaded windows box.    Even though the customer might end up
>> running with 64GB, the sales demo's more likely to be 64MB.
> 
> Although I get your point, that is a job for sqllite not postgresql.
> PostgreSQL is not a end all be all solution and it is definitely not
> designed to be "embedded" which is essentially what you are suggesting
> with that kind of configuration.

But these sales people are selling a postgres based product. It'd be
both much less convincing to demo a different application stack; as
well as not a very productive use of the developer's time.



Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-12-04 at 10:55 -0800, Ron Mayer wrote:
> Joshua D. Drake wrote: 
> > Although I get your point, that is a job for sqllite not postgresql.
> > PostgreSQL is not a end all be all solution and it is definitely not
> > designed to be "embedded" which is essentially what you are suggesting
> > with that kind of configuration.
> 
> But these sales people are selling a postgres based product. It'd be
> both much less convincing to demo a different application stack; as
> well as not a very productive use of the developer's time.

Fair enough, then make sure you are demoing on a platform that can
handle PostgreSQL :)

Joshua D. Drake


> 
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> "Joshua D. Drake" <jd@commandprompt.com> wrote: 
> Fair enough, then make sure you are demoing on a platform that can
> handle PostgreSQL :)
There are a lot of good reasons for people to be running an instance
of PostgreSQL on a small machine, running it on a machine with other
software, or running many clusters of PostgreSQL on a single machine. 
It may not be possible for this tool to generate useful values for all
of these situations, but it seems to me that should be viewed as a
limitation of the tool, rather than some moral failing on the part of
the people with those needs.  Let's cover what we can, document the
limitations, and avoid any user-hostile tones....
-Kevin


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-12-04 at 14:05 -0600, Kevin Grittner wrote:
> >>> "Joshua D. Drake" <jd@commandprompt.com> wrote: 
>  
> > Fair enough, then make sure you are demoing on a platform that can
> > handle PostgreSQL :)
>  
> There are a lot of good reasons for people to be running an instance
> of PostgreSQL on a small machine, running it on a machine with other
> software, or running many clusters of PostgreSQL on a single machine. 
> It may not be possible for this tool to generate useful values for all
> of these situations, but it seems to me that should be viewed as a
> limitation of the tool, rather than some moral failing on the part of
> the people with those needs.  Let's cover what we can, document the
> limitations, and avoid any user-hostile tones....

I didn't say don't run on a small machine :) I said make sure you run on
one that is up for the job. There is a difference.

Joshua D. Drake



>  
> -Kevin
> 
-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Ron Mayer wrote:

> OTOH there tends to be less DBA time available to tune the smaller demo 
> instances that come&go as sales people upgrade their laptops; so 
> improved automation would be much appreciated there.

I have a TODO list for things that might be interesting to add to a V2.0 
version of this tool.  I just added an item to there for extending the 
tuning model usefully into working on systems with smaller amounts of RAM. 
I'm not opposed to the idea, just don't have any background doing that and 
I'm trying to stay focused on the more common big-machine problems for the 
first release.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Greg Smith <gsmith@gregsmith.com> wrote: 
> On Thu, 4 Dec 2008, Ron Mayer wrote:
> 
>> OTOH there tends to be less DBA time available to tune the smaller
demo 
>> instances that come&go as sales people upgrade their laptops; so 
>> improved automation would be much appreciated there.
> 
> I have a TODO list for things that might be interesting to add to a
V2.0 
> version of this tool.  I just added an item to there for extending
the 
> tuning model usefully into working on systems with smaller amounts of
RAM. 
> I'm not opposed to the idea, just don't have any background doing
that and 
> I'm trying to stay focused on the more common big-machine problems
for the 
> first release.
I think there needs to be some easy way to choose an option which
yields a configuration similar to what we've had in recent production
releases -- something that will start up and allow minimal testing on
even a small machine.
It also occurred to me that if initdb is generating its initial
configuration with this, some special handling might be needed for the
"make check" runs.  It isn't unusual to want to do a build and check
it on a production server.  If the generated configuration used in
regression tests is assuming it "owns the machine" there could be a
problem.
-Kevin


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Kevin Grittner wrote:

> I think there needs to be some easy way to choose an option which
> yields a configuration similar to what we've had in recent production
> releases -- something that will start up and allow minimal testing on
> even a small machine.

But that's the goal of what comes out of initdb already; I'm missing how 
that is something this script would even get involved in.  Is your 
suggestion to add support for a minimal target that takes a tuned-up 
configuration file and returns it to that state, or did you have something 
else in mind?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Greg Smith <gsmith@gregsmith.com> wrote: 
> On Thu, 4 Dec 2008, Kevin Grittner wrote:
> 
>> I think there needs to be some easy way to choose an option which
>> yields a configuration similar to what we've had in recent
production
>> releases -- something that will start up and allow minimal testing
on
>> even a small machine.
> 
> But that's the goal of what comes out of initdb already; I'm missing
how 
> that is something this script would even get involved in.  Is your 
> suggestion to add support for a minimal target that takes a tuned-up

> configuration file and returns it to that state, or did you have
something 
> else in mind?
Perhaps I misunderstood some earlier post -- I had gotten the
impression that initdb was going to use this utility for the initial
postgresql.conf file.  If that's not happening, then you can ignore my
last post as noise, with my apologies.
-Kevin


Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
Greg Smith wrote:
> On Thu, 4 Dec 2008, Ron Mayer wrote:
> 
>> OTOH there tends to be less DBA time available to tune the smaller 
>> demo instances that come&go as sales people upgrade their laptops; so 
>> improved automation would be much appreciated there.
> 
> I have a TODO list for things that might be interesting to add to a V2.0 
> version of this tool.  I just added an item to there for extending the 
> tuning model usefully into working on systems with smaller amounts of 
> RAM. I'm not opposed to the idea, just don't have any background doing 
> that and I'm trying to stay focused on the more common big-machine 
> problems for the first release.

We *have* a configuration for small amounts of RAM; it's our current 
default configuration.

However, I take the point that the "workstation" calculations should 
work down to 128MB of system RAM.  I'll check.

--Josh


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
On Thu, Dec 4, 2008 at 5:11 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Thu, 4 Dec 2008, Kevin Grittner wrote:
>
>> I think there needs to be some easy way to choose an option which
>> yields a configuration similar to what we've had in recent production
>> releases -- something that will start up and allow minimal testing on
>> even a small machine.
>
> But that's the goal of what comes out of initdb already; I'm missing how
> that is something this script would even get involved in.  Is your
> suggestion to add support for a minimal target that takes a tuned-up
> configuration file and returns it to that state, or did you have something
> else in mind?

I humbly suggest that the memory-related settings output by the tool
don't need to match what initdb outputs.  But the values of
checkpoint_segments, constraint_exclusion, and
default_statistics_target probably should, given a small mixed-mode
database.

You've probably all figured out by now that I, personally, in my own
opinion, think that default_statistics_target = 10 is just fine for
such a database, but if a decision is made to change that number, so
be it.  Just let's please change it both places, rather than letting
contrib/pgtune be a backdoor to get around not liking what initdb
does.  And similarly with the other parameters...

...Robert


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Thu, 4 Dec 2008, Robert Haas wrote:

> Just let's please change it both places, rather than letting 
> contrib/pgtune be a backdoor to get around not liking what initdb does. 
> And similarly with the other parameters...

Someone running pgtune has specifically asked for their database to be 
tuned for performance; someone running initdb has not.  It's not a 
backdoor, the defaults for a tuned small system and what comes out of 
initdb have completely different priorities.  The linking of the two that 
keeps happening in this thread makes no sense to me, and frankly I 
consider the whole topic an off-topic distraction.

I never had any intention of making changes to the basic configuration 
that comes out of initdb, the burden of proof for making a change there is 
far higher than I feel justified in clearing.  The last time I got an 
initdb setting changed I had days worth of focused test data to present 
with the suggestion.

If I take a poll of half a dozen experienced PostgreSQL administrators 
with performance tuning background (which is basically where the pgtune 
settings are coming from) and I hear the same story about a setting from 
most of them, that's good enough for me to justify a settings change for 
this tool; the whole idea is to pool expert opinion and try to distill it 
into code.  But that's not good enough for changing that setting for 
everybody who installs the database.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Thu, 2008-12-04 at 21:51 -0500, Greg Smith wrote:
> On Thu, 4 Dec 2008, Robert Haas wrote:
> 
> > Just let's please change it both places, rather than letting 
> > contrib/pgtune be a backdoor to get around not liking what initdb does. 
> > And similarly with the other parameters...
> 
> Someone running pgtune has specifically asked for their database to be 
> tuned for performance; someone running initdb has not.  It's not a 
> backdoor,

Right.

>  the defaults for a tuned small system and what comes out of 
> initdb have completely different priorities.

IMO the priority of initdb is, "Get the damn thing running"

>   The linking of the two that 
> keeps happening in this thread makes no sense to me, and frankly I 
> consider the whole topic an off-topic distraction.
> 

Agreed.


Joshua D. Drake

-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
Gregory Stark
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

> One more data point to try to help.
>  
> While the jump from a default_statistics_target from 10 to 1000
> resulted in a plan time increase for a common query from 50 ms to 310
> ms, at a target of 50 the plan time was 53 ms.  Analyze time was 7.2
> minutes and 18.5 minutes for targets of 10 and 50.  This is an 842 GB
> database on an 8 processor (3.5 GHz Xeon) machine with 64 GB RAM
> running (soon to be updated) PostgreSQL 8.2.7.

So my half-finished DBT3 tests showed hardly any change in planning time for
default_statistics_targets 1000 and even Kevin Grittner's query which is the
worst real example posted so far only went from 50ms to 310ms.

So I started thinking perhaps substantially larger values might not have much
effect on planning at all. (Actual data is a lot more convincing than simple
assertions!).

Kevin's query was an OLTP query so 300ms is still way too much -- 300ms is on
the high end for OLTP response times for query *execution*. But seeing the
intermediate values would be interesting.

So I wondered what the worst-case would be for a synthetic case designed to
exercise the planner severely. This would also be useful for optimizing the
planner under gprof, though I suspect the hot spots are pretty obvious even
without empirical data.

So anyways, here's a script to create a table with a 75k pg_statistic record.
And a worst-case query where the plan time goes from 34ms to 1.2s for
histogram sizes between 10 and 1,000.

Looking at eqjoinsel I think it could be improved algorithmically if we keep
the mcv list in sorted order, even if it's just binary sorted order. But I'm
not sure what else uses those values and whether the current ordering is
significant. I'm also not sure it's the only O(n^2) algorithm there and
there's no algorithmic gain unless they're all knocked down.

Incidentally this timing is with the 75kB toasted arrays in shared buffers
because the table has just been analyzed. If it was on a busy system then just
planning the query could involve 75kB of I/O which is what I believe was
happening to me way back when I last observed super-long plan times.


postgres=# create table tk as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::textas r from
generate_series(1,1000);
postgres=# insert into tk (select * from tk);
postgres=# insert into tk (select * from tk);
postgres=# insert into tk (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::textas r from
generate_series(1,2000));
postgres=# alter table tk alter r set statistics 1000;
postgres=# analyze tk;

postgres=# select pg_column_size(stavalues1) from pg_statistic where starelid = 'tk'::regclass;pg_column_size 
----------------         75484
(1 row)
postgres=# explain select count(*) from (select * from tk as k, tk as l,tk as m,tk as n,tk as o,tk as p where k.r=l.r
andk.r=m.r and k.r=n.r and k.r=o.r and k.r=p.r) as x;                                            QUERY PLAN
                                
 
-----------------------------------------------------------------------------------------------------Aggregate
(cost=41358.14..41358.15rows=1 width=0)  ->  Merge Join  (cost=3213.13..37713.13 rows=1458000 width=0)        Merge
Cond:(k.r = l.r)        ->  Merge Join  (cost=2677.61..14092.61 rows=486000 width=510)              Merge Cond: (k.r =
m.r)             ->  Merge Join  (cost=2142.09..5862.09 rows=162000 width=408)                    Merge Cond: (k.r =
n.r)                   ->  Merge Join  (cost=1606.57..2761.57 rows=54000 width=306)                          Merge
Cond:(k.r = o.r)                          ->  Merge Join  (cost=1071.04..1371.04 rows=18000 width=204)
             Merge Cond: (k.r = p.r)                                ->  Sort  (cost=535.52..550.52 rows=6000 width=102)
                                    Sort Key: k.r                                      ->  Seq Scan on tk k
(cost=0.00..159.00rows=6000 width=102)                                ->  Sort  (cost=535.52..550.52 rows=6000
width=102)                                     Sort Key: p.r                                      ->  Seq Scan on tk p
(cost=0.00..159.00rows=6000 width=102)                          ->  Sort  (cost=535.52..550.52 rows=6000 width=102)
                          Sort Key: o.r                                ->  Seq Scan on tk o  (cost=0.00..159.00
rows=6000width=102)                    ->  Sort  (cost=535.52..550.52 rows=6000 width=102)
SortKey: n.r                          ->  Seq Scan on tk n  (cost=0.00..159.00 rows=6000 width=102)              ->
Sort (cost=535.52..550.52 rows=6000 width=102)                    Sort Key: m.r                    ->  Seq Scan on tk m
(cost=0.00..159.00 rows=6000 width=102)        ->  Sort  (cost=535.52..550.52 rows=6000 width=102)              Sort
Key:l.r              ->  Seq Scan on tk l  (cost=0.00..159.00 rows=6000 width=102)
 
(29 rows)

Time: 1186.315 ms




--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote: 
> Incidentally this timing is with the 75kB toasted arrays in shared
buffers
> because the table has just been analyzed. If it was on a busy system
then 
> just
> planning the query could involve 75kB of I/O which is what I believe
was
> happening to me way back when I last observed super-long plan times.
I'm pretty sure I saw that in some of my tests with larger targets. 
With a large database and a large target, some of the tables'
statistics apparently weren't still cached the first time I planned
the query, and I got an extremely long plan time on the first attempt,
and then it settled in within a pretty narrow range on repeated plans.
I discarded the initial plan time as immaterial for our purposes
because a query that's run 300,000 times per day is probably going to
keep its statistics in cache most of the time.
I was looking at trying to modify the perl script from Robert Haas to
run my query at a wide range of target values, collecting analyze and
plan times at each.  Now that you have an easy-to-create synthetic
example, is this still worth it, considering that it would be on 8.2? 
(If we wait a week or two, I could probably do it on 8.3.)  We do have
gprof on these systems, although I'd need advice on how to use it.
-Kevin


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Gregory Stark <stark@enterprisedb.com> wrote: 

> Incidentally, here's a graph of the explain time for that plan. It
looks
> pretty linear to me
Except for that sweet spot between 50 and 100.
Any idea what's up with that?
-Kevin


Re: Simple postgresql.conf wizard

From
Josh Berkus
Date:
All,

I'm thinking that default_statistics_target is disputable enough that we 
want to move discussion of it to pgsql-performance, and for version 0.1 of 
the tuning wizard, exclude it.

-- 
--Josh

Josh Berkus
PostgreSQL
San Francisco


Re: Simple postgresql.conf wizard

From
"Nathan Boley"
Date:
Thanks for putting out pgtune - it's a sorely needed tool.

I had a chance to look over the source code and have a few comments,
mostly about python specific coding conventions.

- The windows specific try block ( line 16 ) raises a ValueError vs
ImportError on my debian machine. Maybe it would be more appropriate
to explicitly test platform.system()=="Windows"?

- from ctypes import * ( 18 ) makes the block difficult to read and
pollutes the namespace.

- on line 45, the try block should probably catch exceptions derived
from Exception ( to avoid catching SystemExit and KeyboardInterrupt
errors ). ie, except Exception: return None. Also, printing out the
expection in debug mode would probably be a good idea ( ie except
Exception, e: print e\ return None )

- all classes ( 58, 135, 205 ) are 'old-style' classes.  I dont see
any reason to use classic classes ( unless Python 2.1 is a support
goal? ) To make classes 'new style' classes (
http://www.python.org/doc/2.5.2/ref/node33.html ) they should inherit
object. i.e. class PGConfigLine(object):

- The doc strings ( 59, 136, 206 ) don't follow standard conventions,
described here http://www.python.org/dev/peps/pep-0257/.

- Functions also support doc strings ( 342, 351, etc. )

- Tuple unpacking doesn't require the tuple boundaries ( 446 and
others ). ie, options, args = ReadOptions()  works.

This is more of a style comment about the 'Java-ish interface' ( line
112 ), feel free to ignore it.

overloading __str__ and __repr__ are the python ways to return string
representations of an object. ie, instead of toSting use __str__ and
then ( on 197 ) print l or print str(l) instead of print l.toString()

for the other methods ( getValue, getLineNumber, isSetting ) I'm
assuming you didnt call the attributes directly because you didnt want
them to be accidently overwritten. Have you considered the use of
properties ( http://www.python.org/download/releases/2.2.3/descrintro/#property
)? Also, it would be more clear to mark attributes as private ( i.e.
_name or __name, _readable, _lineNumber, _setsParameter ) if you dont
want them to be accessed directly.

Hope my comments are useful! Thanks again for writing this.

-Nathan

P.S. I'd be happy to officially review this if it gets to that.


Re: Simple postgresql.conf wizard

From
Greg Smith
Date:
On Fri, 5 Dec 2008, Nathan Boley wrote:

> - all classes ( 58, 135, 205 ) are 'old-style' classes.  I dont see
> any reason to use classic classes ( unless Python 2.1 is a support
> goal? )

I'm not targeting anything older then 2.4, as that's the oldest version I 
have installed anywhere.  2.4 is still an important target because it's 
the version that comes with the very popular RHEL4.  I've worked on some 
SuSE 9 boxes that only have 2.3 as recently as last year, but I consider 
that a legacy version I don't care much about mainly because the kinds of 
programs I'm usually writing tend to rely heavily on using "subprocess", 
added in 2.4.

The answers to all of your other points is that I'm relatively new to 
Python and just didn't know any of that.  I'll convert to new style 
classes, use __str__ properly, and switch all the getters to use 
properties now that you've pointed me toward those.  Improving the 
rudimentary doc strings was already on my list, that's not a major target 
though because this project has its own documentation standards and I 
can't do anything useful with the output from pydoc.  I'm only interested 
in documenting things about the source code itself in there, anything 
user-related needs to go in the official docs.

> I'd be happy to officially review this if it gets to that.

Just added you to the wiki in that role.  I should have a first version 
that's of commit candidate quality ready to go in the next week.  If you 
could make a second pass then and see how I did implementing your 
suggestions, that would be great.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: Simple postgresql.conf wizard

From
"Joshua D. Drake"
Date:
On Fri, 2008-12-05 at 17:27 -0500, Greg Smith wrote:
> On Fri, 5 Dec 2008, Nathan Boley wrote:
> 
> > - all classes ( 58, 135, 205 ) are 'old-style' classes.  I dont see
> > any reason to use classic classes ( unless Python 2.1 is a support
> > goal? )
> 
> I'm not targeting anything older then 2.4, as that's the oldest version I 
> have installed anywhere.  2.4 is still an important target because it's 
> the version that comes with the very popular RHEL4.

And RHEL5

Joshua D. Drake

-- 
PostgreSQL  Consulting, Development, Support, Training  503-667-4564 - http://www.commandprompt.com/  The PostgreSQL
Company,serving since 1997
 



Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
> Looking at eqjoinsel I think it could be improved algorithmically if we keep
> the mcv list in sorted order, even if it's just binary sorted order. But I'm
> not sure what else uses those values and whether the current ordering is
> significant. I'm also not sure it's the only O(n^2) algorithm there and
> there's no algorithmic gain unless they're all knocked down.

The current code seems to be trying to handle pathological cases where:

(1) the operator for which it is invoked doesn't really represent
equality and/or
(2) the type has an equality operator but no comparison operator.

Those are important cases, but maybe we could create an alternative
version for the fairly common situation where neither holds?

...Robert


Re: Simple postgresql.conf wizard

From
"Robert Haas"
Date:
I profiled this on CVS HEAD.  First, I set this up:

> postgres=# create table tk as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::textas r from
generate_series(1,1000);
> postgres=# insert into tk (select * from tk);
> postgres=# insert into tk (select * from tk);
> postgres=# insert into tk (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::textas r from
generate_series(1,2000));
> postgres=# alter table tk alter r set statistics 1000;
> postgres=# analyze tk;

Then I profiled a backend that executed the following query 10x
(uparrow-enter 10 times from psql):

> postgres=# explain select count(*) from (select * from tk as k, tk as l,tk as m,tk as n,tk as o,tk as p where k.r=l.r
andk.r=m.r and k.r=n.r and k.r=o.r and k.r=p.r) as x; 

Results:

  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 77.73      1.92     1.92     5500     0.00     0.00  pglz_decompress
  6.07      2.07     0.15      250     0.00     0.00  eqjoinsel
  1.62      2.11     0.04  2009500     0.00     0.00  varstr_cmp
  1.21      2.14     0.03  3502500     0.00     0.00  datumCopy
  1.21      2.17     0.03     3500     0.00     0.00  deconstruct_array
  1.21      2.20     0.03                             MemoryContextContains

Full results attached.

...Robert

Attachment

Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
[ a bit off-topic for the thread, but ... ]

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'll attach the query and plan.  You'll note that the query looks a
> little odd, especially all the (1=1) tests.

FWIW, it would be better to use "TRUE" as a placeholder in your
generated queries.  I don't suppose this would make a huge percentage
difference in such complicated queries, but in and of itself there are
a lot of cycles spent to parse "1=1" and then reduce it to constant TRUE.
        regards, tom lane


Re: Simple postgresql.conf wizard

From
"Mark Wong"
Date:
On Tue, Dec 2, 2008 at 2:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Smith <gsmith@gregsmith.com> writes:
>> ... where the "Power Test" seems to oscillate between degrees of good and bad
>> behavior seemingly at random.
>
> Are any of the queries complicated enough to trigger GEQO planning?

Is there a debug option that we could use to see?

Regards,
Mark


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Mark Wong" <markwkm@gmail.com> writes:
> On Tue, Dec 2, 2008 at 2:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Are any of the queries complicated enough to trigger GEQO planning?

> Is there a debug option that we could use to see?

Well, you could set geqo=off and see if the behavior changes, but
it'd be easier just to count how many tables are in each query ...
        regards, tom lane


Re: Simple postgresql.conf wizard

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> [ a bit off-topic for the thread, but ... ]
> 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> I'll attach the query and plan.  You'll note that the query looks a
>> little odd, especially all the (1=1) tests.
> 
> FWIW, it would be better to use "TRUE" as a placeholder in your
> generated queries.  I don't suppose this would make a huge
percentage
> difference in such complicated queries, but in and of itself there
are
> a lot of cycles spent to parse "1=1" and then reduce it to constant
TRUE.
Thanks, I'll put in a request for enhancement for our framework. 
(Not all databases we support handle boolean literals, so we need a
few lines in our plugin layer.)
In case anyone cares in terms of interpreting the timings I posted, on
the server where I just tested this change, the average plan time
dropped from 65.0 ms to 63.7 ms -- a 2% improvement.  Eliminating
"pretty" whitespace shaved off another 0.2 ms, or 0.3%.
So, worth doing on our end as a tuning measure, but not a significant
distortion in terms of the issues discussed on the thread.
-Kevin


Re: Simple postgresql.conf wizard

From
"Mark Wong"
Date:
On Mon, Dec 8, 2008 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Mark Wong" <markwkm@gmail.com> writes:
>> On Tue, Dec 2, 2008 at 2:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Are any of the queries complicated enough to trigger GEQO planning?
>
>> Is there a debug option that we could use to see?
>
> Well, you could set geqo=off and see if the behavior changes, but
> it'd be easier just to count how many tables are in each query ...

Sorry for the delay in responding, here's the queries and the number
of tables in each:

Q1:  1
Q2:  2
Q3:  3
Q4:  1
Q5:  6
Q6:  1
Q7:  6 (5 unique) + 1 temp table
Q8:  8 (7 uniqie) + 1 temp table
Q9:  6 + 1 temp table
Q10: 4
Q11: 3
Q12:  2
Q13:  3 + 1 temp table
Q14:  2
Q15:  5
Q16:  2
Q17:  2
Q18:  3
Q19:  2
Q20:  2
Q21:  4
Q22:  3 + 1 temp table

Regards,
Mark


Re: Simple postgresql.conf wizard

From
Tom Lane
Date:
"Mark Wong" <markwkm@gmail.com> writes:
> On Mon, Dec 8, 2008 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Mark Wong" <markwkm@gmail.com> writes:
>>> On Tue, Dec 2, 2008 at 2:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Are any of the queries complicated enough to trigger GEQO planning?

> Sorry for the delay in responding, here's the queries and the number
> of tables in each:
> [ worst case is ]
> Q8:  8 (7 uniqie) + 1 temp table

Well, geqo_threshold is 12 by default, so that theory loses ...
        regards, tom lane


pgtune: postgresql.conf wizard

From
Greg Smith
Date:
pgtune is now on pgFoundry:  http://pgfoundry.org/projects/pgtune/
I just released an update to there, and attached here for the archives is 
that same code.

Progress since the last code drop to this list:

-All of the necessary integer handling needed was extracted from guc.c and 
implemented, in particular memory unit handling that matches the server 
for input and SHOW-style output.  One feature that's already allowed is it 
can sort out:

Input:  shared_buffers='32768'
tuned:  shared_buffers='256MB'

Notice those are the same value, and then not do anything to that setting. 
This will make round trips through pgtune much less disruptive.  And all 
it should take to eventually add 8.1 compatibility (before memory units 
like kB worked) is turning off my SHOW-style implementation and just 
working with the integer values.
-Lots of code cleanup based on all the hiccups reported back to me

Primary open issues at this point:
-Update the model to limit shared memory use on Windows
-Degrade more gracefully on low memory systems
-Optional module to estimate shared memory use and output sysctl 
suggestions

The next time I get a block of time to work on this I'll swat all those, 
getting past all the memory unit stuff was the main thing I've been stuck 
on recently.  There are a few TODO items tagged in the code to finish 
cleaning up too.  I'm on track to have everything I wanted in a 1.0 (only 
targeting 8.4) ready by Feb 1 in time for beta.

Things I was hoping for some input on:
-Using default_stats_target=100 doesn't seem controversial anymore, which 
makes we wonder if it makes sense to restore the original DW suggestion of 
400 Josh suggested?
-There was some talk about making constraint_exclusion have an additional 
smart mode aimed at inheritance.  Not sure where that's at, but it would 
impact the model were it available.

I also made a number of Python style improvements based on review from 
Nathan.  If you're not interested in that topic you should certainly stop 
reading here as that's all I cover below.

On Fri, 5 Dec 2008, Nathan Boley wrote:

> - The windows specific try block ( line 16 ) raises a ValueError...

I rewrote that whole mess based on all the feedback I got and I hope this 
interface is much more robust.

> - from ctypes import * ( 18 ) makes the block difficult to read and
> pollutes the namespace.

Noted, will try to reduce what it calls when I do my final Windows 
testing.  Am not sure exactly what it needs so I won't know if I broke it 
trying to reduce the import use until then.

> - all classes ( 58, 135, 205 ) are 'old-style' classes.

Converted to new style ones.

> - The doc strings ( 59, 136, 206 ) don't follow standard conventions,
> described here http://www.python.org/dev/peps/pep-0257/.

Improved doc strings are on my list for later cleanup, haven't gotten to 
it yet.

> - Tuple unpacking doesn't require the tuple boundaries ( 446 and
> others ). ie, options, args = ReadOptions()  works.

Fixed a bunch of those, if you notice any I missed please point them out.

> overloading __str__ and __repr__ are the python ways to return string
> representations of an object

Converted my toString call to use __str__ and str().

> Have you considered the use of properties 
> http://www.python.org/download/releases/2.2.3/descrintro/#property ?

I wasn't fully aware of properties before.  Now that I am, I find I 
dislike them.  For this type of app, there doesn't seem to be much benefit 
for the extra code I'd put in.

The main reason I provided those getter functions rather than having 
things directly reference the object's fields was so I could change the 
underlying implementation without breaking the API.  I didn't really care 
about making things private.  I now realize that I can change the API and 
make the caller still think they're directly accessing the field with 
properties, so I just made most things pull right from the fields now. 
That cut out some code that was only supporting that poor data hiding.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: pgtune: postgresql.conf wizard

From
Josh Berkus
Date:
Greg,

> Things I was hoping for some input on:
> -Using default_stats_target=100 doesn't seem controversial anymore, 
> which makes we wonder if it makes sense to restore the original DW 
> suggestion of 400 Josh suggested?

I'm going to back off from this.  Following our discussion, I did some 
testing on DWs belonging to my clients.  The amount of time required to 
ANALYZE 20m rows of a large text field at 400 is significant, as is the 
bulk-out of the stats table.

Instead, I'm writing a script which increases stats for all *indexed* 
columns under a named size to a user-specified level.  When it's ready, 
I'll add it to pgTune.

So for DW, go for 100.  Not 400, though.

--Josh Berkus