Thread: Volunteer to build a configuration tool

Volunteer to build a configuration tool

From
"Campbell, Lance"
Date:

I am a Java Software architect, DBA, and project manager for the University of Illinois, Department of Web Services.  We use PostgreSQL to serve about 2 million pages of dynamic content a month; everything from calendars, surveys, forms, discussion boards, RSS feeds, etc.  I am really impressed with this tool.

 

The only major problem area I have found where PostgreSQL is really lacking is in “what should my initial configuration settings be?”  I realize that there are many elements that can impact a DBA’s specific database settings but it would be nice to have a “configuration tool” that would get someone up and running better in the beginning. 

 

This is my idea:

 

A JavaScript HTML page that would have some basic questions at the top:

1) How much memory do you have?

2) How many connections will be made to the database?

3) What operating system do you use?

4) Etc…

 

Next the person would press a button, “generate”, found below the questions.  The JavaScript HTML page would then generate content for two Iframes at the bottom on the page.  One Iframe would contain the contents of the postgresql.conf file.  The postgresql.conf settings would be tailored more to the individuals needs than the standard default file.  The second Iframe would contain the default settings one should consider using with their operating system.

 

My web team would be very happy to develop this for the PostgreSQL project.   It would have saved us a lot of time by having a configuration tool in the beginning.  I am willing to make this a very high priority for my team.

 

Thanks,

 

Lance Campbell

Project Manager/Software Architect

Web Services at Public Affairs

University of Illinois

217.333.0382

http://webservices.uiuc.edu

 

Re: [DOCS] Volunteer to build a configuration tool

From
"Campbell, Lance"
Date:
Mario,
The JavaScript configuration tool I proposed would not be in the install
of PostgreSQL.  It would be an HTML page.  It would be part of the HTML
documentation or it could be a separate HTML page that would be linked
from the HTML documentation.

Thanks,

Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu

-----Original Message-----
From: Mario Gonzalez [mailto:gonzalemario@gmail.com]
Sent: Monday, June 18, 2007 10:16 AM
To: Campbell, Lance
Cc: pgsql-docs@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [DOCS] Volunteer to build a configuration tool

On 18/06/07, Campbell, Lance <lance@uiuc.edu> wrote:
>
> Next the person would press a button, "generate", found below the
questions.
>  The JavaScript HTML page would then generate content for two Iframes
at the
> bottom on the page.  One Iframe would contain the contents of the
> postgresql.conf file.  The postgresql.conf settings would be tailored
more
> to the individuals needs than the standard default file.  The second
Iframe
> would contain the default settings one should consider using with
their
> operating system.
>

  I think it could be a great help to newbies. IMVHO a bash script in
dialog could be better than a javascript file. There are many
administrators with no graphics navigator or with no javascript.

>

--
http://www.advogato.org/person/mgonzalez/

Re: [DOCS] Volunteer to build a configuration tool

From
"Y Sidhu"
Date:


On 6/18/07, Campbell, Lance <lance@uiuc.edu> wrote:
Mario,
The JavaScript configuration tool I proposed would not be in the install
of PostgreSQL.  It would be an HTML page.  It would be part of the HTML
documentation or it could be a separate HTML page that would be linked
from the HTML documentation.

Thanks,

Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu

-----Original Message-----
From: Mario Gonzalez [mailto:gonzalemario@gmail.com]
Sent: Monday, June 18, 2007 10:16 AM
To: Campbell, Lance
Cc: pgsql-docs@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [DOCS] Volunteer to build a configuration tool

On 18/06/07, Campbell, Lance <lance@uiuc.edu> wrote:
>
> Next the person would press a button, "generate", found below the
questions.
>  The JavaScript HTML page would then generate content for two Iframes
at the
> bottom on the page.  One Iframe would contain the contents of the
> postgresql.conf file.  The postgresql.conf settings would be tailored
more
> to the individuals needs than the standard default file.  The second
Iframe
> would contain the default settings one should consider using with
their
> operating system.
>

  I think it could be a great help to newbies. IMVHO a bash script in
dialog could be better than a javascript file. There are many
administrators with no graphics navigator or with no javascript.

>

--
http://www.advogato.org/person/mgonzalez/

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

EXCELLENT idea Lance.

--
Yudhvir Singh Sidhu
408 375 3134 cell

Re: Volunteer to build a configuration tool

From
James Neethling
Date:
> This is my idea:
>
> A JavaScript HTML page that would have some basic questions at the top:
>
> 1) How much memory do you have?
>
> 2) How many connections will be made to the database?
>
> 3) What operating system do you use?
>
> 4) Etc…
>
> Next the person would press a button, “generate”, found below the
> questions. The JavaScript HTML page would then generate content for
> two Iframes at the bottom on the page. One Iframe would contain the
> contents of the postgresql.conf file. The postgresql.conf settings
> would be tailored more to the individuals needs than the standard
> default file. The second Iframe would contain the default settings one
> should consider using with their operating system.
>
> My web team would be very happy to develop this for the PostgreSQL
> project. It would have saved us a lot of time by having a
> configuration tool in the beginning. I am willing to make this a very
> high priority for my team.
>

Hi Lance,

I agree that having a page that can assist in generating a base
configuration file is an excellent way to start off with a good
configuration that can assist a system administrator in getting half way
to a good configuration. We've recently gone through a process of
configuring a machine and it is a time consuming task of testing and
benchmarking various configuration details.

My thoughts:
Using the browser is a great idea as a universal platform. I can
foreseen a problem in that some users won't have GUI access to the
machine that they are setting up. I don't have much direct experience in
this field, but I suspect that a great number of installations happen
'headless'? This can easily be circumvented by hosting the configuration
builder on a public internet site, possibly postgresql.org?

Also, Javascript isn't the easiest language to use to get all the
decisions that need to be made for various configuration options. Would
it not be a better idea to host a configuration builder centrally,
possible on postgresql.org and have the documentation reference it,
including the docs that come packaged with postgresql (README, INSTALL
documentation?). This would mean that you wouldn't be able to package
the configuration builder, but you would be able to implement more
application logic and more complex decision making in a hosted
application. Of course, I have no idea of the skills that your team
already have :)



To add ideas: perhaps a more advanced tool would be able to add comment
indicating a suggested range for the particular setting. For example,
with 2Gb of RAM, it chooses a workmem of, say, 768Mb, with a comment
indicating a suggested range of 512Mb - 1024Mb.


Thanks for taking the time to put this together and for offering the
services of your team.

Kind regards,
James










Attachment

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Mon, 18 Jun 2007, Campbell, Lance wrote:

> The postgresql.conf settings would be tailored more to the individuals
> needs than the standard default file.  The second Iframe would contain
> the default settings one should consider using with their operating
> system.

I'd toyed with making a Javascript based tool for this but concluded it
wasn't ever going to be robust enough for my purposes.  It wouldn't hurt
to have it around through, as almost anything is an improvement over the
current state of affairs for new users.

As far as prior art goes here, there was an ambitious tool driven by Josh
Berkus called Configurator that tried to address this need but never got
off the ground, you might want to swipe ideas from it.  See
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/configurator/configurator/ for
some documents/code and
http://pgfoundry.org/docman/index.php?group_id=1000106 for a handy
Open-Office spreadsheet.

If you want this to take off as a project, make sure you can release the
code under a free software license compatible with the PostgreSQL project,
so others can contribute to it and it can be assimilated by the core
project if it proves helpful.  I know I wouldn't spend a minute working on
this if that's not the case.

I'd suggest you try and get the basic look fleshed out with some
reasonable values for the parameters, then release the source and let
other people nail down the parts you're missing.  Don't get stressed about
making sure you have a good value to set for everything before releasing a
beta, it's a lot easier for others to come in and help fix a couple of
parameters once the basic framework is in place.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD



Re: Volunteer to build a configuration tool

From
david@lang.hm
Date:
one thing to point out to people about this idea is that nothing says that
this page needs to be served via a webserver. If all the calculations are
done in javascript this could be a local file that you open with a
browser.

do any of the text-mode browsers implement javascript? if so then you have
an answer even for the deeply buried isolated headless servers.

David Lang

  On Mon, 18 Jun 2007,
Campbell, Lance wrote:

>
> I am a Java Software architect, DBA, and project manager for the
> University of Illinois, Department of Web Services.  We use PostgreSQL
> to serve about 2 million pages of dynamic content a month; everything
> from calendars, surveys, forms, discussion boards, RSS feeds, etc.  I am
> really impressed with this tool.
>
>
>
> The only major problem area I have found where PostgreSQL is really
> lacking is in "what should my initial configuration settings be?"  I
> realize that there are many elements that can impact a DBA's specific
> database settings but it would be nice to have a "configuration tool"
> that would get someone up and running better in the beginning.
>
>
>
> This is my idea:
>
>
>
> A JavaScript HTML page that would have some basic questions at the top:
>
> 1) How much memory do you have?
>
> 2) How many connections will be made to the database?
>
> 3) What operating system do you use?
>
> 4) Etc...

Re: Volunteer to build a configuration tool

From
Steve Atkins
Date:
On Jun 18, 2007, at 4:09 PM, david@lang.hm wrote:

> one thing to point out to people about this idea is that nothing
> says that this page needs to be served via a webserver. If all the
> calculations are done in javascript this could be a local file that
> you open with a browser.
>
> do any of the text-mode browsers implement javascript? if so then
> you have an answer even for the deeply buried isolated headless
> servers.

It doesn't really matter.

The implementation is likely to be trivial, and could be independently
knocked out by anyone in their favorite language in a few hours.

The tricky bits are going to be defining the problem and creating the
alogrithm to do the maths from input to output.

If that's so, the language or platform the proof-of-concept code is
written for isn't that important, as it's likely to be portable to
anything
else without too much effort.

But the tricky bits seem quite tricky (and the first part, defining the
problem, is something where someone developing it on their
own, without some discussion with other users and devs
could easily end up way off in the weeds).

Cheers,
   Steve

>
> David Lang
>
>  On Mon, 18 Jun 2007, Campbell, Lance wrote:
>
>> I am a Java Software architect, DBA, and project manager for the
>> University of Illinois, Department of Web Services.  We use
>> PostgreSQL
>> to serve about 2 million pages of dynamic content a month; everything
>> from calendars, surveys, forms, discussion boards, RSS feeds,
>> etc.  I am
>> really impressed with this tool.
>>
>>
>>
>> The only major problem area I have found where PostgreSQL is really
>> lacking is in "what should my initial configuration settings be?"  I
>> realize that there are many elements that can impact a DBA's specific
>> database settings but it would be nice to have a "configuration tool"
>> that would get someone up and running better in the beginning.
>>
>>
>>
>> This is my idea:
>>
>>
>>
>> A JavaScript HTML page that would have some basic questions at the
>> top:
>>
>> 1) How much memory do you have?
>>
>> 2) How many connections will be made to the database?
>>
>> 3) What operating system do you use?
>>
>> 4) Etc...
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend


Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Mon, 18 Jun 2007, david@lang.hm wrote:

> do any of the text-mode browsers implement javascript?

http://links.twibright.com/

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Ray Stell
Date:
On Mon, Jun 18, 2007 at 04:35:11PM -0700, Steve Atkins wrote:
>
> On Jun 18, 2007, at 4:09 PM, david@lang.hm wrote:
>
> The tricky bits are going to be defining the problem and creating the
> alogrithm to do the maths from input to output.


Why not methodically discuss the the alogrithms on pgsql-performance,
thus improving the chance of being on target up front.  Plus, us newbies
get to see what you are thinking thus expanding our universe.  I know I'd
read every word.

Thanks for doing this, btw.

Re: Volunteer to build a configuration tool

From
Mike Benoit
Date:
It would be cool if someone started a generic configuration+benchmark
utility that could be used with virtually any software. Something like
this:

1. Create a configuration file parser for your specific application, be
it PostgreSQL, MySQL, Apache, whatever.

2. Create a min/max or X,Y,Z configuration option file that determines
which options to try. ie:

shared_buffers = 1000-20000[1000] //1000 is the increment by
wal_buffers = 8,16,32
...

3. Create start/stop scripts for the specific application

4. Create a benchmark script for the application that returns relevant
metrics. In PGSQL's case, it would be tied in to PG bench probably. In
Apache's case AB. This utility would of course need to know how to read
the metrics to determine what is "best".

5. Run the utility. Ideally it would use some sort of genetic algorithm
to benchmark the application initially to get base numbers, then
one-by-one apply the different configuration options and re-run the
benchmark. It would output the metrics for each run and once it is done,
pick the best run and let you know what those settings are.

I don't think something like this would be very difficult at all to
write, and it would be modular enough to work for virtually any
application. For a database it would take a while to run depending on
the benchmark script, but even that you could have a "fast" and "slow"
benchmark script that could be easily run when you first install
PostgreSQL. This way too your not worrying about how much memory the
system has, or how many disks they have, etc... The system will figure
out the best possible settings for a specific benchmark.

Not to mention people could easily take a SQL log of their own
application running, and use that as the benchmark to get "real world"
numbers.

Any other sort of configuration "suggestion" utility will always have
the question of what do you recommend? How much data do you try to get
and what can be determined from that data to get the best settings? Is
it really going to be that much better then the default, at least enough
better to warrant the work and effort put into it?

On Mon, 2007-06-18 at 10:04 -0500, Campbell, Lance wrote:
> I am a Java Software architect, DBA, and project manager for the
> University of Illinois, Department of Web Services.  We use PostgreSQL
> to serve about 2 million pages of dynamic content a month; everything
> from calendars, surveys, forms, discussion boards, RSS feeds, etc.  I
> am really impressed with this tool.
>
>
>
> The only major problem area I have found where PostgreSQL is really
> lacking is in “what should my initial configuration settings be?”  I
> realize that there are many elements that can impact a DBA’s specific
> database settings but it would be nice to have a “configuration tool”
> that would get someone up and running better in the beginning.
>
>
>
> This is my idea:
>
>
>
> A JavaScript HTML page that would have some basic questions at the
> top:
>
> 1) How much memory do you have?
>
> 2) How many connections will be made to the database?
>
> 3) What operating system do you use?
>
> 4) Etc…
>
>
>
> Next the person would press a button, “generate”, found below the
> questions.  The JavaScript HTML page would then generate content for
> two Iframes at the bottom on the page.  One Iframe would contain the
> contents of the postgresql.conf file.  The postgresql.conf settings
> would be tailored more to the individuals needs than the standard
> default file.  The second Iframe would contain the default settings
> one should consider using with their operating system.
>
>
>
> My web team would be very happy to develop this for the PostgreSQL
> project.   It would have saved us a lot of time by having a
> configuration tool in the beginning.  I am willing to make this a very
> high priority for my team.
>
>
>
> Thanks,
>
>
>
> Lance Campbell
>
> Project Manager/Software Architect
>
> Web Services at Public Affairs
>
> University of Illinois
>
> 217.333.0382
>
> http://webservices.uiuc.edu
>
>
>
>
--
Mike Benoit <ipso@snappymail.ca>

Attachment

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Wed, 20 Jun 2007, Mike Benoit wrote:

> It would be cool if someone started a generic configuration+benchmark
> utility that could be used with virtually any software.

It would be cool.  It would also be impossible.

> Create a benchmark script for the application that returns relevant
> metrics. In PGSQL's case, it would be tied in to PG bench probably. In
> Apache's case AB. This utility would of course need to know how to read
> the metrics to determine what is "best".

The usual situation in these benchmarks is that you get parameters that
adjust along a curve where there's a trade-off between, say, total
throughput and worse-case latency.  Specifying "best" here would require a
whole specification language if you want to model how real tuning efforts
work.  The AB case is a little simpler, but for PostgreSQL you'd want
something like "With this database and memory sizing, I want the best
throughput possible where maximum latency is usually <5 seconds with 1-30
clients running this transaction, while still maintaining at least 400 TPS
with up to 100 clients, and the crash recovery time can't take more than
10 minutes".  There are all sorts of local min/max situations and
non-robust configurations an automated tool will put you into if you don't
force an exhaustive search by being very specific like this.

> I don't think something like this would be very difficult at all to
> write

Here I just smile and say that proves you've never tried to write one :)
It's a really hard problem that gets harder the more you poke at it.
There's certainly lots of value to writing a utility that automatically
tests out multiple parameter values in a batch and compares the results.
If you're not doing that now, you should consider scripting something up
that does.  Going beyond that to having it pick the optimal parameters
more automatically would take AI much stronger than just a genetic
algorithm approach.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Wed, 20 Jun 2007, Mike Benoit wrote:
>> I don't think something like this would be very difficult at all to
>> write

> Here I just smile and say that proves you've never tried to write one :)

I'm with Greg on this.  It's not that easy to optimize in a
multi-parameter space even if all conditions are favorable, and they
never are.

I think what would be much more useful in the long run is some
serious study of the parameters themselves.  For instance,
random_page_cost is a self-admitted oversimplification of reality.
We know that good settings for it depend critically on how large
your DB is relative to your RAM; which means there are at least two
parameters there, but no one's done any serious thinking about how
to disentangle 'em.

            regards, tom lane

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Wed, 20 Jun 2007, Tom Lane wrote:

> I think what would be much more useful in the long run is some serious
> study of the parameters themselves.  For instance, random_page_cost is a
> self-admitted oversimplification of reality.

If I could figure out who would sponsor such a study that's what I'd be
doing right now.  I have studies on many of the commit-related parameters
I'll have ready in another few days, those are straightforward to map out.
But you know what I have never found?  A good benchmark that demonstrates
how well complicated queries perform to run studies on things like
random_page_cost against.  Many of the tuning knobs on the query optimizer
seem very opaque to me so far, and I'm not sure how to put together a
proper test to illuminate their operation and map out their useful range.

Here's an example of one of the simplest questions in this area to
demonstate things I wonder about.  Let's say I have a properly indexed
database of some moderate size such that you're in big trouble if you do a
sequential scan.  How can I tell if effective_cache_size is in the right
ballpark so it will do what I want to effectively navigate that?  People
back into a setting for that parameter right now based on memory in their
system, but I never see anybody going "since your main table is X GB
large, and its index is Y GB, you really need enough memory to set
effective_cache_size to Z GB if you want queries/joins on that table to
perform well".

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Wed, 20 Jun 2007, Tom Lane wrote:
>> I think what would be much more useful in the long run is some serious
>> study of the parameters themselves.  For instance, random_page_cost is a
>> self-admitted oversimplification of reality.

> If I could figure out who would sponsor such a study that's what I'd be
> doing right now.

Hmm ... Sun?  EDB?  Greenplum?  [I'm afraid Red Hat is not likely to
step up to the plate right now, they have other priorities]

> Many of the tuning knobs on the query optimizer
> seem very opaque to me so far,

At least some of them are demonstrably broken.  The issue here is to
develop a mental model that is both simple enough to work with, and
rich enough to predict real-world behavior.

> Here's an example of one of the simplest questions in this area to
> demonstate things I wonder about.  Let's say I have a properly indexed
> database of some moderate size such that you're in big trouble if you do a
> sequential scan.  How can I tell if effective_cache_size is in the right
> ballpark so it will do what I want to effectively navigate that?

As the guy who put in effective_cache_size, I'd say it's on the broken
side of the fence.  Think about how to replace it with a more useful
parameter, not how to determine a good value for it.  "Useful" means
both "easy to determine a value for" and "strong for estimating query
costs", which are contradictory to some extent, but that's the problem
to be solved --- and effective_cache_size doesn't really win on either
metric.

To me, the worst catch-22 we face in this area is that we'd like the
optimizer's choices of plan to be stable and understandable, but the
real-world costs of queries depend enormously on short-term conditions
such as how much of the table has been sucked into RAM recently by
other queries.  I have no good answer to that one.

            regards, tom lane

Re: Volunteer to build a configuration tool

From
PFC
Date:
> To me, the worst catch-22 we face in this area is that we'd like the
> optimizer's choices of plan to be stable and understandable, but the
> real-world costs of queries depend enormously on short-term conditions
> such as how much of the table has been sucked into RAM recently by
> other queries.  I have no good answer to that one.

    Yeah, there is currently no way to tell the optimizer things like :

    - this table/portion of a table is not frequently accessed, so it won't
be in the cache, so please use low-seek plans (like bitmap index scan)
    - this table/portion of a table is used all the time so high-seek-count
plans can be used like index scan or nested loops since everything is in
RAM

    Except planner hints (argh) I see no way to give this information to the
machine... since it's mostly in the mind of the DBA. Maybe a per-table
"cache temperature" param (hot, warm, cold), but what about the log table,
the end of which is cached, but not the old records ? It's messy.

    Still PG does a pretty excellent job most of the time.

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Wed, 20 Jun 2007, PFC wrote:

> Except planner hints (argh) I see no way to give this information to the
> machine... since it's mostly in the mind of the DBA.

And the mind of the DBA has a funny way of being completely wrong some
days about what's really happening under the hood.

> Maybe a per-table "cache temperature" param (hot, warm, cold), but what
> about the log table, the end of which is cached, but not the old records
> ? It's messy.

One of the things that was surprising to me when I started looking at the
organization of the PostgreSQL buffer cache is how little gross
information about its contents is available.  I kept expecting to find a
summary section where you could answer questions like "how much of the
cache currently has information about index/table X?" used as an input to
the optimizer.  I understand that the design model expects much of this is
unknowable due to the interaction with the OS cache, and in earlier
versions you couldn't make shared_buffers big enough for its contents to
be all that interesting, so until recently this wasn't worth collecting.

But in the current era, where it's feasible to have multi-GB caches
efficiently managed by PG and one can expect processor time is relatively
cheap, it seems to me one way to give a major boost to the optimizer is to
add some overhead to buffer cache management so it collects such
information.  When I was trying to do a complete overhaul on the
background writer, the #1 problem was that I had to assemble my own
statistics on what was inside the buffer cache as it was scanned, because
a direct inspection of every buffer is the only way to know things like
what percentage of the cache is currently dirty.

I can't figure out if I'm relieved or really worried to discover that Tom
isn't completely sure what to do with effective_cache_size either.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
"Campbell, Lance"
Date:
It is amazing how many times you can read something before it actually
sinks in.

There seems to be two possible approaches to optimizing PostgreSQL 8.2:

File caching approach:
This approach is based on the fact that the OS will cache the necessary
PostgreSQL files.  The key here is to set the size of
effective_cache_size value as high as you think the OS has memory to
cache the files.  This approach would need the value of shared_buffers
to be relatively low.  Otherwise you are in a cense storing the data
twice.  One would also have to make sure that work_mem is not too high.
Since the files would be cached by the OS, work_mem could be relatively
low.  This is an ideal approach if you have a dedicated server since
there would be no other software using memory or accessing files that
the OS would try to cache.

Memory driven approach:
In this approach you want to create a large value for shared_buffers.
You are relying on shared_buffers to hold the most commonly accessed
disk blocks.  The value for effective_cache_size would be relatively
small since you are not relying on the OS to cache files.  This seems
like it would be the ideal situation if you have other applications
running on the box.  By setting shared_buffers to a high value you are
guaranteeing memory available to PostgreSQL (this assumes the other
applications did not suck up to much memory to make your OS use virtual
memory).   This also seems more like how Oracle approaches things.

Do I understand the possible optimization paths correctly?  The only
question I have about this approach is: if I use the "memory driven
approach" since effective_cache_size would be small I would assume I
would need to fiddle with random_page_cost since there would be know way
for PostgreSQL to know I have a well configured system.

If everything I said is correct then I agree "Why have
effective_cache_size?"  Why not just go down the approach that Oracle
has taken and require people to rely more on shared_buffers and the
general memory driven approach?  Why rely on the disk caching of the OS?
Memory is only getting cheaper.

Thanks,

Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu


-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Smith
Sent: Wednesday, June 20, 2007 10:21 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Volunteer to build a configuration tool

On Wed, 20 Jun 2007, PFC wrote:

> Except planner hints (argh) I see no way to give this information to
the
> machine... since it's mostly in the mind of the DBA.

And the mind of the DBA has a funny way of being completely wrong some
days about what's really happening under the hood.

> Maybe a per-table "cache temperature" param (hot, warm, cold), but
what
> about the log table, the end of which is cached, but not the old
records
> ? It's messy.

One of the things that was surprising to me when I started looking at
the
organization of the PostgreSQL buffer cache is how little gross
information about its contents is available.  I kept expecting to find a

summary section where you could answer questions like "how much of the
cache currently has information about index/table X?" used as an input
to
the optimizer.  I understand that the design model expects much of this
is
unknowable due to the interaction with the OS cache, and in earlier
versions you couldn't make shared_buffers big enough for its contents to

be all that interesting, so until recently this wasn't worth collecting.

But in the current era, where it's feasible to have multi-GB caches
efficiently managed by PG and one can expect processor time is
relatively
cheap, it seems to me one way to give a major boost to the optimizer is
to
add some overhead to buffer cache management so it collects such
information.  When I was trying to do a complete overhaul on the
background writer, the #1 problem was that I had to assemble my own
statistics on what was inside the buffer cache as it was scanned,
because
a direct inspection of every buffer is the only way to know things like
what percentage of the cache is currently dirty.

I can't figure out if I'm relieved or really worried to discover that
Tom
isn't completely sure what to do with effective_cache_size either.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: Volunteer to build a configuration tool

From
Heikki Linnakangas
Date:
Campbell, Lance wrote:
> It is amazing how many times you can read something before it actually
> sinks in.
>
> There seems to be two possible approaches to optimizing PostgreSQL 8.2:

Right.

> File caching approach:
> This approach is based on the fact that the OS will cache the necessary
> PostgreSQL files.  The key here is to set the size of
> effective_cache_size value as high as you think the OS has memory to
> cache the files.  This approach would need the value of shared_buffers
> to be relatively low.  Otherwise you are in a cense storing the data
> twice.  One would also have to make sure that work_mem is not too high.
> Since the files would be cached by the OS, work_mem could be relatively
> low.  This is an ideal approach if you have a dedicated server since
> there would be no other software using memory or accessing files that
> the OS would try to cache.

There's no particular danger in setting work_mem too high in this
approach. In fact, it's more important avoid a too large worm_mem
setting with the other approach, because if you set it too high you can
force the system to swap, while with the "file caching approach" the OS
will just evict some of the cached pages to make room for sorts etc.

> Memory driven approach:
> In this approach you want to create a large value for shared_buffers.
> You are relying on shared_buffers to hold the most commonly accessed
> disk blocks.  The value for effective_cache_size would be relatively
> small since you are not relying on the OS to cache files.

effective_cache_size should be set to the estimated amount of memory
available for caching, *including* shared_buffers. So it should be set
to a similar value in both approaches.

>  This seems
> like it would be the ideal situation if you have other applications
> running on the box.

Actually it's the opposite. If there's other applications competing for
the memory, it's better to let the OS manage the cache because it can
make decisions on which pages to keep in cache and which to evict across
all applications.

>  By setting shared_buffers to a high value you are
> guaranteeing memory available to PostgreSQL (this assumes the other
> applications did not suck up to much memory to make your OS use virtual
> memory).

You're guaranteeing memory available to PostgreSQL, at the cost of said
memory being unavailable from other applications. Or as you point out,
in the worst case you end up swapping.

> Do I understand the possible optimization paths correctly?  The only
> question I have about this approach is: if I use the "memory driven
> approach" since effective_cache_size would be small I would assume I
> would need to fiddle with random_page_cost since there would be know way
> for PostgreSQL to know I have a well configured system.

I don't see how effective_cache_size or the other settings affect
random_page_cost. random_page_cost should mostly depend on your I/O
hardware, though I think it's common practice to lower it when your
database is small enough to fit mostly or completely in cache on the
grounds that random access in memory is almost as fast as sequential access.

> If everything I said is correct then I agree "Why have
> effective_cache_size?"  Why not just go down the approach that Oracle
> has taken and require people to rely more on shared_buffers and the
> general memory driven approach?  Why rely on the disk caching of the OS?
> Memory is only getting cheaper.

That has been discussed before many times, search the archives on direct
I/O for previous flamewars on that subject. In a nutshell, we rely on
the OS to not only do caching for us, but I/O scheduling and readahead
as well. That saves us a lot of code, and the OS is in a better position
to do that as well, because it knows the I/O hardware and disk layout so
that it can issue the I/O requests in the most efficient way.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Volunteer to build a configuration tool

From
Mark Lewis
Date:
On Wed, 2007-06-20 at 11:21 -0400, Greg Smith wrote:
...
> One of the things that was surprising to me when I started looking at the
> organization of the PostgreSQL buffer cache is how little gross
> information about its contents is available.  I kept expecting to find a
> summary section where you could answer questions like "how much of the
> cache currently has information about index/table X?" used as an input to
> the optimizer.  I understand that the design model expects much of this is
> unknowable due to the interaction with the OS cache, and in earlier
> versions you couldn't make shared_buffers big enough for its contents to
> be all that interesting, so until recently this wasn't worth collecting.
>
> But in the current era, where it's feasible to have multi-GB caches
> efficiently managed by PG and one can expect processor time is relatively
> cheap, it seems to me one way to give a major boost to the optimizer is to
> add some overhead to buffer cache management so it collects such
> information.  When I was trying to do a complete overhaul on the
> background writer, the #1 problem was that I had to assemble my own
> statistics on what was inside the buffer cache as it was scanned, because
> a direct inspection of every buffer is the only way to know things like
> what percentage of the cache is currently dirty.
...

One problem with feeding the current state of the buffer cache to the
planner is that the planner may be trying to prepare a plan which will
execute 10,000 times.  For many interesting queries, the state of the
cache will be very different after the first execution, as indexes and
active portions of tables are brought in.

For that matter, an early stage of query execution could significantly
change the contents of the buffer cache as seen by a later stage of the
execution, even inside a single query.

I'm not saying that inspecting the buffer cache more is a bad idea, but
gathering useful information with the current planner is a bit tricky.

For purposes of idle speculation, one could envision some non-trivial
changes to PG which would make really slick use this data:

(1) Allow PG to defer deciding whether to perform an index scan or
sequential scan until the moment it is needed, and then ask the buffer
cache what % of the pages from the relevant indexes/tables are currently
cached.

(2) Automatically re-plan prepared queries with some kind of frequency
(exponential in # of executions?  fixed-time?), to allow the plans to
adjust to changes in the buffer-cache.

Besides being hard to build, the problem with these approaches (or any
other approach which takes current temporary state into account) is that
as much as some of us might want to make use of every piece of data
available to make the planner into a super-brain, there are lots of
other folks who just want plan stability.  The more dynamic the system
is, the less predictable it can be, and especially in mission-critical
stuff, predictability matters more than .  Tom said it really well in a
recent post,

"To me, the worst catch-22 we face in this area is that we'd like the
optimizer's choices of plan to be stable and understandable, but the
real-world costs of queries depend enormously on short-term conditions
such as how much of the table has been sucked into RAM recently by
other queries.  I have no good answer to that one."

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Wed, 20 Jun 2007, Mark Lewis wrote:

> as much as some of us might want to make use of every piece of data
> available to make the planner into a super-brain, there are lots of
> other folks who just want plan stability.

It's not like it has to be on for everybody.  I look forward to the day
when I could see this:

$ cat postgresql.conf | grep brain
# - Super-brain Query Optimizer -
sbqo = on               # Enables the super-brain
sbqo_reconsider_interval = 5s  # How often to update plans
sbqo_brain_size = friggin_huge  # Possible values are wee, not_so_wee, and

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Wed, 20 Jun 2007, Campbell, Lance wrote:

> If everything I said is correct then I agree "Why have
> effective_cache_size?"  Why not just go down the approach that Oracle
> has taken and require people to rely more on shared_buffers and the
> general memory driven approach?  Why rely on the disk caching of the OS?

First off, it may help explain the dynamics here if you know that until
fairly recent releases, the PostgreSQL shared_buffers cache had some
performance issues that made it impractical to make it too large.  It
hasn't been that long that relying more heavily on the Postgres cache was
technically feasible.  I think the user community at large is still
assimilating all the implications of that shift, and as such some of the
territory with making the Postgres memory really large is still being
mapped out.

There are also still some issues left in that area.  For example, the
bigger your shared_buffers cache is, the worse the potential is for having
a checkpoint take a really long time and disrupt operations.  There are OS
tunables that can help work around that issue; similar ones for the
PostgreSQL buffer cache won't be available until the 8.3 release.

In addition to all that, there are still several reasons to keep relying
on the OS cache:

1) The OS cache memory is shared with other applications, so relying on it
lowers the average memory footprint of PostgreSQL.  The database doesn't
have to be a pig that constantly eats all the memory up, while still
utilizing it when necessary.

2) The OS knows a lot more about the disk layout and similar low-level
details and can do optimizations a platform-independant program like
Postgres can't assume are available.

3) There are more people working on optimizing the caching algorithms in
modern operating systems than are coding on this project.  Using that
sophisticated cache leverages their work.

"The Oracle Way" presumes that you've got such a massive development staff
that you can solve these problems better yourself than the community at
large, and then support that solution on every platform.  This is why they
ended up with solutions like raw partitions, where they just put their own
filesystem on the disk and figure out how to make that work well
everywhere.  If you look at trends in this area, at this point the
underlying operating systems have gotten good enough that tricks like that
are becoming marginal.  Pushing more work toward the OS is a completely
viable design choice that strengthens every year.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> On Wed, 20 Jun 2007, Campbell, Lance wrote:
>> If everything I said is correct then I agree "Why have
>> effective_cache_size?"  Why not just go down the approach that Oracle
>> has taken and require people to rely more on shared_buffers and the
>> general memory driven approach?  Why rely on the disk caching of the OS?

> [ reasons why snipped ]

There's another reason for not setting shared_buffers huge, beyond the
good ones Greg listed: the kernel may or may not consider a large
shared-memory segment as potentially swappable.  If the kernel starts
swapping out low-usage areas of the shared-buffer arena, you lose badly:
accessing a supposedly "in cache" page takes just as long as fetching it
from the disk file would've, and if a dirty page gets swapped out,
you'll have to swap it back in before you can write it; making a total
of *three* I/Os expended to get it down to where it should have been,
not one.  So unless you can lock the shared memory segment in RAM, it's
best to keep it small enough that all the buffers are heavily used.
Marginal-use pages will be handled much more effectively in the O/S
cache.

I'd also like to re-emphasize the point about "don't be a pig if you
don't have to".  It would be very bad if Postgres automatically operated
on the assumption that it should try to consume all available resources.
Personally, I run half a dozen postmasters (of varying vintages) on one
not-especially-impressive development machine.  I can do this exactly
because the default configuration doesn't try to eat the whole machine.

To get back to the comparison to Oracle: Oracle can assume that it's
running on a dedicated machine, because their license fees are more than
the price of the machine anyway.  We shouldn't make that assumption,
at least not in the out-of-the-box configuration.

            regards, tom lane

Re: Volunteer to build a configuration tool

From
Dave Page
Date:
Tom Lane wrote:
> There's another reason for not setting shared_buffers huge, beyond the
> good ones Greg listed: the kernel may or may not consider a large
> shared-memory segment as potentially swappable.

Another is that on Windows, shared memory access is more expensive and
various people have noted that the smallest value for shared_buffers you
can get away with can yield better performance as it leaves more free
for the kernel to use, more efficiently.

Regards, Dave.


Re: Volunteer to build a configuration tool

From
Andrew Sullivan
Date:
On Thu, Jun 21, 2007 at 03:14:48AM -0400, Greg Smith wrote:

> "The Oracle Way" presumes that you've got such a massive development staff
> that you can solve these problems better yourself than the community at
> large, and then support that solution on every platform.

Not that Greg is suggesting otherwise, but to be fair to Oracle (and
other large database vendors), the raw partitions approach was also a
completely sensible design decision back when they made it.  In the
late 70s and early 80s, the capabilities of various filesystems were
wildly uneven (read the _UNIX Hater's Handbook_ on filesystems, for
instance, if you want an especially jaundiced view).  Moreover, since
it wasn't clear that UNIX and UNIX-like things were going to become
the dominant standard -- VMS was an obvious contender for a long
time, and for good reason -- it made sense to have a low-level
structure that you could rely on.

Once they had all that code and had made all those assumptions while
relying on it, it made no sense to replace it all.  It's now mostly
mature and robust, and it is probably a better decision to focus on
incremental improvements to it than to rip it all out and replace it
with something likely to be buggy and surprising.  The PostgreSQL
developers' practice of sighing gently every time someone comes along
insisting that threads are keen or that shared memory sucks relies on
the same, perfectly sensible premise: why throw away a working
low-level part of your design to get an undemonstrated benefit and
probably a whole lot of new bugs?

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.
                --Brad Holland

Re: Volunteer to build a configuration tool

From
"Campbell, Lance"
Date:
Greg,
I have a PostgreSQL database that runs on a dedicated server.  The
server has 24Gig of memory.  What would be the max size I would ever
want to set the shared_buffers to if I where to relying on the OS for
disk caching approach?  It seems that no matter how big your dedicated
server is there would be a top limit to the size of shared_buffers.

Thanks,

Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Smith
Sent: Thursday, June 21, 2007 2:15 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Volunteer to build a configuration tool

On Wed, 20 Jun 2007, Campbell, Lance wrote:

> If everything I said is correct then I agree "Why have
> effective_cache_size?"  Why not just go down the approach that Oracle
> has taken and require people to rely more on shared_buffers and the
> general memory driven approach?  Why rely on the disk caching of the
OS?

First off, it may help explain the dynamics here if you know that until
fairly recent releases, the PostgreSQL shared_buffers cache had some
performance issues that made it impractical to make it too large.  It
hasn't been that long that relying more heavily on the Postgres cache
was
technically feasible.  I think the user community at large is still
assimilating all the implications of that shift, and as such some of the

territory with making the Postgres memory really large is still being
mapped out.

There are also still some issues left in that area.  For example, the
bigger your shared_buffers cache is, the worse the potential is for
having
a checkpoint take a really long time and disrupt operations.  There are
OS
tunables that can help work around that issue; similar ones for the
PostgreSQL buffer cache won't be available until the 8.3 release.

In addition to all that, there are still several reasons to keep relying

on the OS cache:

1) The OS cache memory is shared with other applications, so relying on
it
lowers the average memory footprint of PostgreSQL.  The database doesn't

have to be a pig that constantly eats all the memory up, while still
utilizing it when necessary.

2) The OS knows a lot more about the disk layout and similar low-level
details and can do optimizations a platform-independant program like
Postgres can't assume are available.

3) There are more people working on optimizing the caching algorithms in

modern operating systems than are coding on this project.  Using that
sophisticated cache leverages their work.

"The Oracle Way" presumes that you've got such a massive development
staff
that you can solve these problems better yourself than the community at
large, and then support that solution on every platform.  This is why
they
ended up with solutions like raw partitions, where they just put their
own
filesystem on the disk and figure out how to make that work well
everywhere.  If you look at trends in this area, at this point the
underlying operating systems have gotten good enough that tricks like
that
are becoming marginal.  Pushing more work toward the OS is a completely
viable design choice that strengthens every year.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: Volunteer to build a configuration tool

From
Greg Smith
Date:
On Thu, 21 Jun 2007, Campbell, Lance wrote:

> I have a PostgreSQL database that runs on a dedicated server.  The
> server has 24Gig of memory.  What would be the max size I would ever
> want to set the shared_buffers to if I where to relying on the OS for
> disk caching approach?  It seems that no matter how big your dedicated
> server is there would be a top limit to the size of shared_buffers.

It's impossible to say exactly what would work optimally in this sort of
situation.  The normal range is 25-50% of total memory, but there's no
hard reason for that balance; for all we know your apps might work best
with 20GB in shared_buffers and only a relatively small 4GB left over for
the rest of the OS to use.  Push it way up and and see what you get.

This is part of why the idea of an "advanced" mode for this tool is
suspect.  Advanced tuning usually requires benchmarking with as close to
real application data as you can get in order to make good forward
progress.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Volunteer to build a configuration tool

From
Jim Nasby
Date:
On Jun 23, 2007, at 2:28 PM, Greg Smith wrote:
> On Thu, 21 Jun 2007, Campbell, Lance wrote:
>> I have a PostgreSQL database that runs on a dedicated server.  The
>> server has 24Gig of memory.  What would be the max size I would ever
>> want to set the shared_buffers to if I where to relying on the OS for
>> disk caching approach?  It seems that no matter how big your
>> dedicated
>> server is there would be a top limit to the size of shared_buffers.
>
> It's impossible to say exactly what would work optimally in this
> sort of situation.  The normal range is 25-50% of total memory, but
> there's no hard reason for that balance; for all we know your apps
> might work best with 20GB in shared_buffers and only a relatively
> small 4GB left over for the rest of the OS to use.  Push it way up
> and and see what you get.
>
> This is part of why the idea of an "advanced" mode for this tool is
> suspect.  Advanced tuning usually requires benchmarking with as
> close to real application data as you can get in order to make good
> forward progress.

Agreed. EnterpriseDB comes with a feature called "DynaTune" that
looks at things like server memory and sets a best-guess at a bunch
of parameters. Truth is, it works fine for 90% of cases, because
there's just a lot of installations where tuning postgresql.conf
isn't that critical.

The real issue is that the "stock" postgresql.conf is just horrible.
It was originally tuned for something like a 486, but even the recent
changes have only brought it up to the "pentium era" (case in point:
24MB of shared buffers equates to a machine with 128MB of memory,
give or take). Given that, I think an 80% solution would be to just
post small/medium/large postgresql.conf files somewhere.

I also agree 100% with Tom that the cost estimators need serious
work. One simple example: nothing in the planner looks at what
percent of a relation is actually in shared_buffers. If it did that,
it would probably be reasonable to extrapolate that percentage into
how much is sitting in kernel cache, which would likely be miles
ahead of what's currently done.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)



Re: Volunteer to build a configuration tool

From
Gregory Stark
Date:
"Jim Nasby" <decibel@decibel.org> writes:

> The real issue is that the "stock" postgresql.conf is just horrible. It was
> originally tuned for something like a 486, but even the recent  changes have
> only brought it up to the "pentium era" (case in point:  24MB of shared buffers
> equates to a machine with 128MB of memory,  give or take).

I think it's more that the stock configure has to assume it's not a dedicated
box. Picture someone installing Postgres on their debian box because it's
required for Gnucash. Even having 24M suddenly disappear from the box is quite
a bit.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Volunteer to build a configuration tool

From
Josh Berkus
Date:
Lance,

> I have a PostgreSQL database that runs on a dedicated server.  The
> server has 24Gig of memory.  What would be the max size I would ever
> want to set the shared_buffers to if I where to relying on the OS for
> disk caching approach?  It seems that no matter how big your dedicated
> server is there would be a top limit to the size of shared_buffers.

There's not, actually.  Under some circumstances (mainly Solaris 10 + UFS
on AMD) it can actually be beneficial to have s_b be 80% of RAM and bypass
the FS cache entirely.  This isn't usually the case, but it's not to be
ruled out.

If you're relying on the FS cache and not using direct I/O, though, you
want to keep at least 50% of memory free for use by the cache.  At below
50%, you lose a significant part of the benefit of the cache without
losing the cost of it.  Of course, that assumes that your database is
bigger than ram; there isn't much need to have either s_b or the f.s.c. be
more than twice the size of your whole database.

In general, a setting s_b to 25% of RAM on a dedicated machine, and 10%
(with a max of 512MB) on a shared machine, is a nice safe default which
will do OK for most applications.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco