Thread: Volunteer to build a configuration tool
I am a Java Software architect, DBA, and project manager for the University of Illinois, Department of Web Services. We use PostgreSQL to serve about 2 million pages of dynamic content a month; everything from calendars, surveys, forms, discussion boards, RSS feeds, etc. I am really impressed with this tool.
The only major problem area I have found where PostgreSQL is really lacking is in “what should my initial configuration settings be?” I realize that there are many elements that can impact a DBA’s specific database settings but it would be nice to have a “configuration tool” that would get someone up and running better in the beginning.
This is my idea:
A JavaScript HTML page that would have some basic questions at the top:
1) How much memory do you have?
2) How many connections will be made to the database?
3) What operating system do you use?
4) Etc…
Next the person would press a button, “generate”, found below the questions. The JavaScript HTML page would then generate content for two Iframes at the bottom on the page. One Iframe would contain the contents of the postgresql.conf file. The postgresql.conf settings would be tailored more to the individuals needs than the standard default file. The second Iframe would contain the default settings one should consider using with their operating system.
My web team would be very happy to develop this for the PostgreSQL project. It would have saved us a lot of time by having a configuration tool in the beginning. I am willing to make this a very high priority for my team.
Thanks,
Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu
Mario, The JavaScript configuration tool I proposed would not be in the install of PostgreSQL. It would be an HTML page. It would be part of the HTML documentation or it could be a separate HTML page that would be linked from the HTML documentation. Thanks, Lance Campbell Project Manager/Software Architect Web Services at Public Affairs University of Illinois 217.333.0382 http://webservices.uiuc.edu -----Original Message----- From: Mario Gonzalez [mailto:gonzalemario@gmail.com] Sent: Monday, June 18, 2007 10:16 AM To: Campbell, Lance Cc: pgsql-docs@postgresql.org; pgsql-performance@postgresql.org Subject: Re: [DOCS] Volunteer to build a configuration tool On 18/06/07, Campbell, Lance <lance@uiuc.edu> wrote: > > Next the person would press a button, "generate", found below the questions. > The JavaScript HTML page would then generate content for two Iframes at the > bottom on the page. One Iframe would contain the contents of the > postgresql.conf file. The postgresql.conf settings would be tailored more > to the individuals needs than the standard default file. The second Iframe > would contain the default settings one should consider using with their > operating system. > I think it could be a great help to newbies. IMVHO a bash script in dialog could be better than a javascript file. There are many administrators with no graphics navigator or with no javascript. > -- http://www.advogato.org/person/mgonzalez/
Mario,
The JavaScript configuration tool I proposed would not be in the install
of PostgreSQL. It would be an HTML page. It would be part of the HTML
documentation or it could be a separate HTML page that would be linked
from the HTML documentation.
Thanks,
Lance Campbell
Project Manager/Software Architect
Web Services at Public Affairs
University of Illinois
217.333.0382
http://webservices.uiuc.edu
-----Original Message-----
From: Mario Gonzalez [mailto:gonzalemario@gmail.com]
Sent: Monday, June 18, 2007 10:16 AM
To: Campbell, Lance
Cc: pgsql-docs@postgresql.org; pgsql-performance@postgresql.org
Subject: Re: [DOCS] Volunteer to build a configuration tool
On 18/06/07, Campbell, Lance <lance@uiuc.edu> wrote:
>
> Next the person would press a button, "generate", found below the
questions.
> The JavaScript HTML page would then generate content for two Iframes
at the
> bottom on the page. One Iframe would contain the contents of the
> postgresql.conf file. The postgresql.conf settings would be tailored
more
> to the individuals needs than the standard default file. The second
Iframe
> would contain the default settings one should consider using with
their
> operating system.
>
I think it could be a great help to newbies. IMVHO a bash script in
dialog could be better than a javascript file. There are many
administrators with no graphics navigator or with no javascript.
>
--
http://www.advogato.org/person/mgonzalez/
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster
EXCELLENT idea Lance.
--
Yudhvir Singh Sidhu
408 375 3134 cell
> This is my idea: > > A JavaScript HTML page that would have some basic questions at the top: > > 1) How much memory do you have? > > 2) How many connections will be made to the database? > > 3) What operating system do you use? > > 4) Etc… > > Next the person would press a button, “generate”, found below the > questions. The JavaScript HTML page would then generate content for > two Iframes at the bottom on the page. One Iframe would contain the > contents of the postgresql.conf file. The postgresql.conf settings > would be tailored more to the individuals needs than the standard > default file. The second Iframe would contain the default settings one > should consider using with their operating system. > > My web team would be very happy to develop this for the PostgreSQL > project. It would have saved us a lot of time by having a > configuration tool in the beginning. I am willing to make this a very > high priority for my team. > Hi Lance, I agree that having a page that can assist in generating a base configuration file is an excellent way to start off with a good configuration that can assist a system administrator in getting half way to a good configuration. We've recently gone through a process of configuring a machine and it is a time consuming task of testing and benchmarking various configuration details. My thoughts: Using the browser is a great idea as a universal platform. I can foreseen a problem in that some users won't have GUI access to the machine that they are setting up. I don't have much direct experience in this field, but I suspect that a great number of installations happen 'headless'? This can easily be circumvented by hosting the configuration builder on a public internet site, possibly postgresql.org? Also, Javascript isn't the easiest language to use to get all the decisions that need to be made for various configuration options. Would it not be a better idea to host a configuration builder centrally, possible on postgresql.org and have the documentation reference it, including the docs that come packaged with postgresql (README, INSTALL documentation?). This would mean that you wouldn't be able to package the configuration builder, but you would be able to implement more application logic and more complex decision making in a hosted application. Of course, I have no idea of the skills that your team already have :) To add ideas: perhaps a more advanced tool would be able to add comment indicating a suggested range for the particular setting. For example, with 2Gb of RAM, it chooses a workmem of, say, 768Mb, with a comment indicating a suggested range of 512Mb - 1024Mb. Thanks for taking the time to put this together and for offering the services of your team. Kind regards, James
Attachment
On Mon, 18 Jun 2007, Campbell, Lance wrote: > The postgresql.conf settings would be tailored more to the individuals > needs than the standard default file. The second Iframe would contain > the default settings one should consider using with their operating > system. I'd toyed with making a Javascript based tool for this but concluded it wasn't ever going to be robust enough for my purposes. It wouldn't hurt to have it around through, as almost anything is an improvement over the current state of affairs for new users. As far as prior art goes here, there was an ambitious tool driven by Josh Berkus called Configurator that tried to address this need but never got off the ground, you might want to swipe ideas from it. See http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/configurator/configurator/ for some documents/code and http://pgfoundry.org/docman/index.php?group_id=1000106 for a handy Open-Office spreadsheet. If you want this to take off as a project, make sure you can release the code under a free software license compatible with the PostgreSQL project, so others can contribute to it and it can be assimilated by the core project if it proves helpful. I know I wouldn't spend a minute working on this if that's not the case. I'd suggest you try and get the basic look fleshed out with some reasonable values for the parameters, then release the source and let other people nail down the parts you're missing. Don't get stressed about making sure you have a good value to set for everything before releasing a beta, it's a lot easier for others to come in and help fix a couple of parameters once the basic framework is in place. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
one thing to point out to people about this idea is that nothing says that this page needs to be served via a webserver. If all the calculations are done in javascript this could be a local file that you open with a browser. do any of the text-mode browsers implement javascript? if so then you have an answer even for the deeply buried isolated headless servers. David Lang On Mon, 18 Jun 2007, Campbell, Lance wrote: > > I am a Java Software architect, DBA, and project manager for the > University of Illinois, Department of Web Services. We use PostgreSQL > to serve about 2 million pages of dynamic content a month; everything > from calendars, surveys, forms, discussion boards, RSS feeds, etc. I am > really impressed with this tool. > > > > The only major problem area I have found where PostgreSQL is really > lacking is in "what should my initial configuration settings be?" I > realize that there are many elements that can impact a DBA's specific > database settings but it would be nice to have a "configuration tool" > that would get someone up and running better in the beginning. > > > > This is my idea: > > > > A JavaScript HTML page that would have some basic questions at the top: > > 1) How much memory do you have? > > 2) How many connections will be made to the database? > > 3) What operating system do you use? > > 4) Etc...
On Jun 18, 2007, at 4:09 PM, david@lang.hm wrote: > one thing to point out to people about this idea is that nothing > says that this page needs to be served via a webserver. If all the > calculations are done in javascript this could be a local file that > you open with a browser. > > do any of the text-mode browsers implement javascript? if so then > you have an answer even for the deeply buried isolated headless > servers. It doesn't really matter. The implementation is likely to be trivial, and could be independently knocked out by anyone in their favorite language in a few hours. The tricky bits are going to be defining the problem and creating the alogrithm to do the maths from input to output. If that's so, the language or platform the proof-of-concept code is written for isn't that important, as it's likely to be portable to anything else without too much effort. But the tricky bits seem quite tricky (and the first part, defining the problem, is something where someone developing it on their own, without some discussion with other users and devs could easily end up way off in the weeds). Cheers, Steve > > David Lang > > On Mon, 18 Jun 2007, Campbell, Lance wrote: > >> I am a Java Software architect, DBA, and project manager for the >> University of Illinois, Department of Web Services. We use >> PostgreSQL >> to serve about 2 million pages of dynamic content a month; everything >> from calendars, surveys, forms, discussion boards, RSS feeds, >> etc. I am >> really impressed with this tool. >> >> >> >> The only major problem area I have found where PostgreSQL is really >> lacking is in "what should my initial configuration settings be?" I >> realize that there are many elements that can impact a DBA's specific >> database settings but it would be nice to have a "configuration tool" >> that would get someone up and running better in the beginning. >> >> >> >> This is my idea: >> >> >> >> A JavaScript HTML page that would have some basic questions at the >> top: >> >> 1) How much memory do you have? >> >> 2) How many connections will be made to the database? >> >> 3) What operating system do you use? >> >> 4) Etc... > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend
On Mon, 18 Jun 2007, david@lang.hm wrote: > do any of the text-mode browsers implement javascript? http://links.twibright.com/ -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, Jun 18, 2007 at 04:35:11PM -0700, Steve Atkins wrote: > > On Jun 18, 2007, at 4:09 PM, david@lang.hm wrote: > > The tricky bits are going to be defining the problem and creating the > alogrithm to do the maths from input to output. Why not methodically discuss the the alogrithms on pgsql-performance, thus improving the chance of being on target up front. Plus, us newbies get to see what you are thinking thus expanding our universe. I know I'd read every word. Thanks for doing this, btw.
It would be cool if someone started a generic configuration+benchmark utility that could be used with virtually any software. Something like this: 1. Create a configuration file parser for your specific application, be it PostgreSQL, MySQL, Apache, whatever. 2. Create a min/max or X,Y,Z configuration option file that determines which options to try. ie: shared_buffers = 1000-20000[1000] //1000 is the increment by wal_buffers = 8,16,32 ... 3. Create start/stop scripts for the specific application 4. Create a benchmark script for the application that returns relevant metrics. In PGSQL's case, it would be tied in to PG bench probably. In Apache's case AB. This utility would of course need to know how to read the metrics to determine what is "best". 5. Run the utility. Ideally it would use some sort of genetic algorithm to benchmark the application initially to get base numbers, then one-by-one apply the different configuration options and re-run the benchmark. It would output the metrics for each run and once it is done, pick the best run and let you know what those settings are. I don't think something like this would be very difficult at all to write, and it would be modular enough to work for virtually any application. For a database it would take a while to run depending on the benchmark script, but even that you could have a "fast" and "slow" benchmark script that could be easily run when you first install PostgreSQL. This way too your not worrying about how much memory the system has, or how many disks they have, etc... The system will figure out the best possible settings for a specific benchmark. Not to mention people could easily take a SQL log of their own application running, and use that as the benchmark to get "real world" numbers. Any other sort of configuration "suggestion" utility will always have the question of what do you recommend? How much data do you try to get and what can be determined from that data to get the best settings? Is it really going to be that much better then the default, at least enough better to warrant the work and effort put into it? On Mon, 2007-06-18 at 10:04 -0500, Campbell, Lance wrote: > I am a Java Software architect, DBA, and project manager for the > University of Illinois, Department of Web Services. We use PostgreSQL > to serve about 2 million pages of dynamic content a month; everything > from calendars, surveys, forms, discussion boards, RSS feeds, etc. I > am really impressed with this tool. > > > > The only major problem area I have found where PostgreSQL is really > lacking is in “what should my initial configuration settings be?” I > realize that there are many elements that can impact a DBA’s specific > database settings but it would be nice to have a “configuration tool” > that would get someone up and running better in the beginning. > > > > This is my idea: > > > > A JavaScript HTML page that would have some basic questions at the > top: > > 1) How much memory do you have? > > 2) How many connections will be made to the database? > > 3) What operating system do you use? > > 4) Etc… > > > > Next the person would press a button, “generate”, found below the > questions. The JavaScript HTML page would then generate content for > two Iframes at the bottom on the page. One Iframe would contain the > contents of the postgresql.conf file. The postgresql.conf settings > would be tailored more to the individuals needs than the standard > default file. The second Iframe would contain the default settings > one should consider using with their operating system. > > > > My web team would be very happy to develop this for the PostgreSQL > project. It would have saved us a lot of time by having a > configuration tool in the beginning. I am willing to make this a very > high priority for my team. > > > > Thanks, > > > > Lance Campbell > > Project Manager/Software Architect > > Web Services at Public Affairs > > University of Illinois > > 217.333.0382 > > http://webservices.uiuc.edu > > > > -- Mike Benoit <ipso@snappymail.ca>
Attachment
On Wed, 20 Jun 2007, Mike Benoit wrote: > It would be cool if someone started a generic configuration+benchmark > utility that could be used with virtually any software. It would be cool. It would also be impossible. > Create a benchmark script for the application that returns relevant > metrics. In PGSQL's case, it would be tied in to PG bench probably. In > Apache's case AB. This utility would of course need to know how to read > the metrics to determine what is "best". The usual situation in these benchmarks is that you get parameters that adjust along a curve where there's a trade-off between, say, total throughput and worse-case latency. Specifying "best" here would require a whole specification language if you want to model how real tuning efforts work. The AB case is a little simpler, but for PostgreSQL you'd want something like "With this database and memory sizing, I want the best throughput possible where maximum latency is usually <5 seconds with 1-30 clients running this transaction, while still maintaining at least 400 TPS with up to 100 clients, and the crash recovery time can't take more than 10 minutes". There are all sorts of local min/max situations and non-robust configurations an automated tool will put you into if you don't force an exhaustive search by being very specific like this. > I don't think something like this would be very difficult at all to > write Here I just smile and say that proves you've never tried to write one :) It's a really hard problem that gets harder the more you poke at it. There's certainly lots of value to writing a utility that automatically tests out multiple parameter values in a batch and compares the results. If you're not doing that now, you should consider scripting something up that does. Going beyond that to having it pick the optimal parameters more automatically would take AI much stronger than just a genetic algorithm approach. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Wed, 20 Jun 2007, Mike Benoit wrote: >> I don't think something like this would be very difficult at all to >> write > Here I just smile and say that proves you've never tried to write one :) I'm with Greg on this. It's not that easy to optimize in a multi-parameter space even if all conditions are favorable, and they never are. I think what would be much more useful in the long run is some serious study of the parameters themselves. For instance, random_page_cost is a self-admitted oversimplification of reality. We know that good settings for it depend critically on how large your DB is relative to your RAM; which means there are at least two parameters there, but no one's done any serious thinking about how to disentangle 'em. regards, tom lane
On Wed, 20 Jun 2007, Tom Lane wrote: > I think what would be much more useful in the long run is some serious > study of the parameters themselves. For instance, random_page_cost is a > self-admitted oversimplification of reality. If I could figure out who would sponsor such a study that's what I'd be doing right now. I have studies on many of the commit-related parameters I'll have ready in another few days, those are straightforward to map out. But you know what I have never found? A good benchmark that demonstrates how well complicated queries perform to run studies on things like random_page_cost against. Many of the tuning knobs on the query optimizer seem very opaque to me so far, and I'm not sure how to put together a proper test to illuminate their operation and map out their useful range. Here's an example of one of the simplest questions in this area to demonstate things I wonder about. Let's say I have a properly indexed database of some moderate size such that you're in big trouble if you do a sequential scan. How can I tell if effective_cache_size is in the right ballpark so it will do what I want to effectively navigate that? People back into a setting for that parameter right now based on memory in their system, but I never see anybody going "since your main table is X GB large, and its index is Y GB, you really need enough memory to set effective_cache_size to Z GB if you want queries/joins on that table to perform well". -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Wed, 20 Jun 2007, Tom Lane wrote: >> I think what would be much more useful in the long run is some serious >> study of the parameters themselves. For instance, random_page_cost is a >> self-admitted oversimplification of reality. > If I could figure out who would sponsor such a study that's what I'd be > doing right now. Hmm ... Sun? EDB? Greenplum? [I'm afraid Red Hat is not likely to step up to the plate right now, they have other priorities] > Many of the tuning knobs on the query optimizer > seem very opaque to me so far, At least some of them are demonstrably broken. The issue here is to develop a mental model that is both simple enough to work with, and rich enough to predict real-world behavior. > Here's an example of one of the simplest questions in this area to > demonstate things I wonder about. Let's say I have a properly indexed > database of some moderate size such that you're in big trouble if you do a > sequential scan. How can I tell if effective_cache_size is in the right > ballpark so it will do what I want to effectively navigate that? As the guy who put in effective_cache_size, I'd say it's on the broken side of the fence. Think about how to replace it with a more useful parameter, not how to determine a good value for it. "Useful" means both "easy to determine a value for" and "strong for estimating query costs", which are contradictory to some extent, but that's the problem to be solved --- and effective_cache_size doesn't really win on either metric. To me, the worst catch-22 we face in this area is that we'd like the optimizer's choices of plan to be stable and understandable, but the real-world costs of queries depend enormously on short-term conditions such as how much of the table has been sucked into RAM recently by other queries. I have no good answer to that one. regards, tom lane
> To me, the worst catch-22 we face in this area is that we'd like the > optimizer's choices of plan to be stable and understandable, but the > real-world costs of queries depend enormously on short-term conditions > such as how much of the table has been sucked into RAM recently by > other queries. I have no good answer to that one. Yeah, there is currently no way to tell the optimizer things like : - this table/portion of a table is not frequently accessed, so it won't be in the cache, so please use low-seek plans (like bitmap index scan) - this table/portion of a table is used all the time so high-seek-count plans can be used like index scan or nested loops since everything is in RAM Except planner hints (argh) I see no way to give this information to the machine... since it's mostly in the mind of the DBA. Maybe a per-table "cache temperature" param (hot, warm, cold), but what about the log table, the end of which is cached, but not the old records ? It's messy. Still PG does a pretty excellent job most of the time.
On Wed, 20 Jun 2007, PFC wrote: > Except planner hints (argh) I see no way to give this information to the > machine... since it's mostly in the mind of the DBA. And the mind of the DBA has a funny way of being completely wrong some days about what's really happening under the hood. > Maybe a per-table "cache temperature" param (hot, warm, cold), but what > about the log table, the end of which is cached, but not the old records > ? It's messy. One of the things that was surprising to me when I started looking at the organization of the PostgreSQL buffer cache is how little gross information about its contents is available. I kept expecting to find a summary section where you could answer questions like "how much of the cache currently has information about index/table X?" used as an input to the optimizer. I understand that the design model expects much of this is unknowable due to the interaction with the OS cache, and in earlier versions you couldn't make shared_buffers big enough for its contents to be all that interesting, so until recently this wasn't worth collecting. But in the current era, where it's feasible to have multi-GB caches efficiently managed by PG and one can expect processor time is relatively cheap, it seems to me one way to give a major boost to the optimizer is to add some overhead to buffer cache management so it collects such information. When I was trying to do a complete overhaul on the background writer, the #1 problem was that I had to assemble my own statistics on what was inside the buffer cache as it was scanned, because a direct inspection of every buffer is the only way to know things like what percentage of the cache is currently dirty. I can't figure out if I'm relieved or really worried to discover that Tom isn't completely sure what to do with effective_cache_size either. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
It is amazing how many times you can read something before it actually sinks in. There seems to be two possible approaches to optimizing PostgreSQL 8.2: File caching approach: This approach is based on the fact that the OS will cache the necessary PostgreSQL files. The key here is to set the size of effective_cache_size value as high as you think the OS has memory to cache the files. This approach would need the value of shared_buffers to be relatively low. Otherwise you are in a cense storing the data twice. One would also have to make sure that work_mem is not too high. Since the files would be cached by the OS, work_mem could be relatively low. This is an ideal approach if you have a dedicated server since there would be no other software using memory or accessing files that the OS would try to cache. Memory driven approach: In this approach you want to create a large value for shared_buffers. You are relying on shared_buffers to hold the most commonly accessed disk blocks. The value for effective_cache_size would be relatively small since you are not relying on the OS to cache files. This seems like it would be the ideal situation if you have other applications running on the box. By setting shared_buffers to a high value you are guaranteeing memory available to PostgreSQL (this assumes the other applications did not suck up to much memory to make your OS use virtual memory). This also seems more like how Oracle approaches things. Do I understand the possible optimization paths correctly? The only question I have about this approach is: if I use the "memory driven approach" since effective_cache_size would be small I would assume I would need to fiddle with random_page_cost since there would be know way for PostgreSQL to know I have a well configured system. If everything I said is correct then I agree "Why have effective_cache_size?" Why not just go down the approach that Oracle has taken and require people to rely more on shared_buffers and the general memory driven approach? Why rely on the disk caching of the OS? Memory is only getting cheaper. Thanks, Lance Campbell Project Manager/Software Architect Web Services at Public Affairs University of Illinois 217.333.0382 http://webservices.uiuc.edu -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Smith Sent: Wednesday, June 20, 2007 10:21 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Volunteer to build a configuration tool On Wed, 20 Jun 2007, PFC wrote: > Except planner hints (argh) I see no way to give this information to the > machine... since it's mostly in the mind of the DBA. And the mind of the DBA has a funny way of being completely wrong some days about what's really happening under the hood. > Maybe a per-table "cache temperature" param (hot, warm, cold), but what > about the log table, the end of which is cached, but not the old records > ? It's messy. One of the things that was surprising to me when I started looking at the organization of the PostgreSQL buffer cache is how little gross information about its contents is available. I kept expecting to find a summary section where you could answer questions like "how much of the cache currently has information about index/table X?" used as an input to the optimizer. I understand that the design model expects much of this is unknowable due to the interaction with the OS cache, and in earlier versions you couldn't make shared_buffers big enough for its contents to be all that interesting, so until recently this wasn't worth collecting. But in the current era, where it's feasible to have multi-GB caches efficiently managed by PG and one can expect processor time is relatively cheap, it seems to me one way to give a major boost to the optimizer is to add some overhead to buffer cache management so it collects such information. When I was trying to do a complete overhaul on the background writer, the #1 problem was that I had to assemble my own statistics on what was inside the buffer cache as it was scanned, because a direct inspection of every buffer is the only way to know things like what percentage of the cache is currently dirty. I can't figure out if I'm relieved or really worried to discover that Tom isn't completely sure what to do with effective_cache_size either. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Campbell, Lance wrote: > It is amazing how many times you can read something before it actually > sinks in. > > There seems to be two possible approaches to optimizing PostgreSQL 8.2: Right. > File caching approach: > This approach is based on the fact that the OS will cache the necessary > PostgreSQL files. The key here is to set the size of > effective_cache_size value as high as you think the OS has memory to > cache the files. This approach would need the value of shared_buffers > to be relatively low. Otherwise you are in a cense storing the data > twice. One would also have to make sure that work_mem is not too high. > Since the files would be cached by the OS, work_mem could be relatively > low. This is an ideal approach if you have a dedicated server since > there would be no other software using memory or accessing files that > the OS would try to cache. There's no particular danger in setting work_mem too high in this approach. In fact, it's more important avoid a too large worm_mem setting with the other approach, because if you set it too high you can force the system to swap, while with the "file caching approach" the OS will just evict some of the cached pages to make room for sorts etc. > Memory driven approach: > In this approach you want to create a large value for shared_buffers. > You are relying on shared_buffers to hold the most commonly accessed > disk blocks. The value for effective_cache_size would be relatively > small since you are not relying on the OS to cache files. effective_cache_size should be set to the estimated amount of memory available for caching, *including* shared_buffers. So it should be set to a similar value in both approaches. > This seems > like it would be the ideal situation if you have other applications > running on the box. Actually it's the opposite. If there's other applications competing for the memory, it's better to let the OS manage the cache because it can make decisions on which pages to keep in cache and which to evict across all applications. > By setting shared_buffers to a high value you are > guaranteeing memory available to PostgreSQL (this assumes the other > applications did not suck up to much memory to make your OS use virtual > memory). You're guaranteeing memory available to PostgreSQL, at the cost of said memory being unavailable from other applications. Or as you point out, in the worst case you end up swapping. > Do I understand the possible optimization paths correctly? The only > question I have about this approach is: if I use the "memory driven > approach" since effective_cache_size would be small I would assume I > would need to fiddle with random_page_cost since there would be know way > for PostgreSQL to know I have a well configured system. I don't see how effective_cache_size or the other settings affect random_page_cost. random_page_cost should mostly depend on your I/O hardware, though I think it's common practice to lower it when your database is small enough to fit mostly or completely in cache on the grounds that random access in memory is almost as fast as sequential access. > If everything I said is correct then I agree "Why have > effective_cache_size?" Why not just go down the approach that Oracle > has taken and require people to rely more on shared_buffers and the > general memory driven approach? Why rely on the disk caching of the OS? > Memory is only getting cheaper. That has been discussed before many times, search the archives on direct I/O for previous flamewars on that subject. In a nutshell, we rely on the OS to not only do caching for us, but I/O scheduling and readahead as well. That saves us a lot of code, and the OS is in a better position to do that as well, because it knows the I/O hardware and disk layout so that it can issue the I/O requests in the most efficient way. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-06-20 at 11:21 -0400, Greg Smith wrote: ... > One of the things that was surprising to me when I started looking at the > organization of the PostgreSQL buffer cache is how little gross > information about its contents is available. I kept expecting to find a > summary section where you could answer questions like "how much of the > cache currently has information about index/table X?" used as an input to > the optimizer. I understand that the design model expects much of this is > unknowable due to the interaction with the OS cache, and in earlier > versions you couldn't make shared_buffers big enough for its contents to > be all that interesting, so until recently this wasn't worth collecting. > > But in the current era, where it's feasible to have multi-GB caches > efficiently managed by PG and one can expect processor time is relatively > cheap, it seems to me one way to give a major boost to the optimizer is to > add some overhead to buffer cache management so it collects such > information. When I was trying to do a complete overhaul on the > background writer, the #1 problem was that I had to assemble my own > statistics on what was inside the buffer cache as it was scanned, because > a direct inspection of every buffer is the only way to know things like > what percentage of the cache is currently dirty. ... One problem with feeding the current state of the buffer cache to the planner is that the planner may be trying to prepare a plan which will execute 10,000 times. For many interesting queries, the state of the cache will be very different after the first execution, as indexes and active portions of tables are brought in. For that matter, an early stage of query execution could significantly change the contents of the buffer cache as seen by a later stage of the execution, even inside a single query. I'm not saying that inspecting the buffer cache more is a bad idea, but gathering useful information with the current planner is a bit tricky. For purposes of idle speculation, one could envision some non-trivial changes to PG which would make really slick use this data: (1) Allow PG to defer deciding whether to perform an index scan or sequential scan until the moment it is needed, and then ask the buffer cache what % of the pages from the relevant indexes/tables are currently cached. (2) Automatically re-plan prepared queries with some kind of frequency (exponential in # of executions? fixed-time?), to allow the plans to adjust to changes in the buffer-cache. Besides being hard to build, the problem with these approaches (or any other approach which takes current temporary state into account) is that as much as some of us might want to make use of every piece of data available to make the planner into a super-brain, there are lots of other folks who just want plan stability. The more dynamic the system is, the less predictable it can be, and especially in mission-critical stuff, predictability matters more than . Tom said it really well in a recent post, "To me, the worst catch-22 we face in this area is that we'd like the optimizer's choices of plan to be stable and understandable, but the real-world costs of queries depend enormously on short-term conditions such as how much of the table has been sucked into RAM recently by other queries. I have no good answer to that one."
On Wed, 20 Jun 2007, Mark Lewis wrote: > as much as some of us might want to make use of every piece of data > available to make the planner into a super-brain, there are lots of > other folks who just want plan stability. It's not like it has to be on for everybody. I look forward to the day when I could see this: $ cat postgresql.conf | grep brain # - Super-brain Query Optimizer - sbqo = on # Enables the super-brain sbqo_reconsider_interval = 5s # How often to update plans sbqo_brain_size = friggin_huge # Possible values are wee, not_so_wee, and -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 20 Jun 2007, Campbell, Lance wrote: > If everything I said is correct then I agree "Why have > effective_cache_size?" Why not just go down the approach that Oracle > has taken and require people to rely more on shared_buffers and the > general memory driven approach? Why rely on the disk caching of the OS? First off, it may help explain the dynamics here if you know that until fairly recent releases, the PostgreSQL shared_buffers cache had some performance issues that made it impractical to make it too large. It hasn't been that long that relying more heavily on the Postgres cache was technically feasible. I think the user community at large is still assimilating all the implications of that shift, and as such some of the territory with making the Postgres memory really large is still being mapped out. There are also still some issues left in that area. For example, the bigger your shared_buffers cache is, the worse the potential is for having a checkpoint take a really long time and disrupt operations. There are OS tunables that can help work around that issue; similar ones for the PostgreSQL buffer cache won't be available until the 8.3 release. In addition to all that, there are still several reasons to keep relying on the OS cache: 1) The OS cache memory is shared with other applications, so relying on it lowers the average memory footprint of PostgreSQL. The database doesn't have to be a pig that constantly eats all the memory up, while still utilizing it when necessary. 2) The OS knows a lot more about the disk layout and similar low-level details and can do optimizations a platform-independant program like Postgres can't assume are available. 3) There are more people working on optimizing the caching algorithms in modern operating systems than are coding on this project. Using that sophisticated cache leverages their work. "The Oracle Way" presumes that you've got such a massive development staff that you can solve these problems better yourself than the community at large, and then support that solution on every platform. This is why they ended up with solutions like raw partitions, where they just put their own filesystem on the disk and figure out how to make that work well everywhere. If you look at trends in this area, at this point the underlying operating systems have gotten good enough that tricks like that are becoming marginal. Pushing more work toward the OS is a completely viable design choice that strengthens every year. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > On Wed, 20 Jun 2007, Campbell, Lance wrote: >> If everything I said is correct then I agree "Why have >> effective_cache_size?" Why not just go down the approach that Oracle >> has taken and require people to rely more on shared_buffers and the >> general memory driven approach? Why rely on the disk caching of the OS? > [ reasons why snipped ] There's another reason for not setting shared_buffers huge, beyond the good ones Greg listed: the kernel may or may not consider a large shared-memory segment as potentially swappable. If the kernel starts swapping out low-usage areas of the shared-buffer arena, you lose badly: accessing a supposedly "in cache" page takes just as long as fetching it from the disk file would've, and if a dirty page gets swapped out, you'll have to swap it back in before you can write it; making a total of *three* I/Os expended to get it down to where it should have been, not one. So unless you can lock the shared memory segment in RAM, it's best to keep it small enough that all the buffers are heavily used. Marginal-use pages will be handled much more effectively in the O/S cache. I'd also like to re-emphasize the point about "don't be a pig if you don't have to". It would be very bad if Postgres automatically operated on the assumption that it should try to consume all available resources. Personally, I run half a dozen postmasters (of varying vintages) on one not-especially-impressive development machine. I can do this exactly because the default configuration doesn't try to eat the whole machine. To get back to the comparison to Oracle: Oracle can assume that it's running on a dedicated machine, because their license fees are more than the price of the machine anyway. We shouldn't make that assumption, at least not in the out-of-the-box configuration. regards, tom lane
Tom Lane wrote: > There's another reason for not setting shared_buffers huge, beyond the > good ones Greg listed: the kernel may or may not consider a large > shared-memory segment as potentially swappable. Another is that on Windows, shared memory access is more expensive and various people have noted that the smallest value for shared_buffers you can get away with can yield better performance as it leaves more free for the kernel to use, more efficiently. Regards, Dave.
On Thu, Jun 21, 2007 at 03:14:48AM -0400, Greg Smith wrote: > "The Oracle Way" presumes that you've got such a massive development staff > that you can solve these problems better yourself than the community at > large, and then support that solution on every platform. Not that Greg is suggesting otherwise, but to be fair to Oracle (and other large database vendors), the raw partitions approach was also a completely sensible design decision back when they made it. In the late 70s and early 80s, the capabilities of various filesystems were wildly uneven (read the _UNIX Hater's Handbook_ on filesystems, for instance, if you want an especially jaundiced view). Moreover, since it wasn't clear that UNIX and UNIX-like things were going to become the dominant standard -- VMS was an obvious contender for a long time, and for good reason -- it made sense to have a low-level structure that you could rely on. Once they had all that code and had made all those assumptions while relying on it, it made no sense to replace it all. It's now mostly mature and robust, and it is probably a better decision to focus on incremental improvements to it than to rip it all out and replace it with something likely to be buggy and surprising. The PostgreSQL developers' practice of sighing gently every time someone comes along insisting that threads are keen or that shared memory sucks relies on the same, perfectly sensible premise: why throw away a working low-level part of your design to get an undemonstrated benefit and probably a whole lot of new bugs? A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
Greg, I have a PostgreSQL database that runs on a dedicated server. The server has 24Gig of memory. What would be the max size I would ever want to set the shared_buffers to if I where to relying on the OS for disk caching approach? It seems that no matter how big your dedicated server is there would be a top limit to the size of shared_buffers. Thanks, Lance Campbell Project Manager/Software Architect Web Services at Public Affairs University of Illinois 217.333.0382 http://webservices.uiuc.edu -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Greg Smith Sent: Thursday, June 21, 2007 2:15 AM To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Volunteer to build a configuration tool On Wed, 20 Jun 2007, Campbell, Lance wrote: > If everything I said is correct then I agree "Why have > effective_cache_size?" Why not just go down the approach that Oracle > has taken and require people to rely more on shared_buffers and the > general memory driven approach? Why rely on the disk caching of the OS? First off, it may help explain the dynamics here if you know that until fairly recent releases, the PostgreSQL shared_buffers cache had some performance issues that made it impractical to make it too large. It hasn't been that long that relying more heavily on the Postgres cache was technically feasible. I think the user community at large is still assimilating all the implications of that shift, and as such some of the territory with making the Postgres memory really large is still being mapped out. There are also still some issues left in that area. For example, the bigger your shared_buffers cache is, the worse the potential is for having a checkpoint take a really long time and disrupt operations. There are OS tunables that can help work around that issue; similar ones for the PostgreSQL buffer cache won't be available until the 8.3 release. In addition to all that, there are still several reasons to keep relying on the OS cache: 1) The OS cache memory is shared with other applications, so relying on it lowers the average memory footprint of PostgreSQL. The database doesn't have to be a pig that constantly eats all the memory up, while still utilizing it when necessary. 2) The OS knows a lot more about the disk layout and similar low-level details and can do optimizations a platform-independant program like Postgres can't assume are available. 3) There are more people working on optimizing the caching algorithms in modern operating systems than are coding on this project. Using that sophisticated cache leverages their work. "The Oracle Way" presumes that you've got such a massive development staff that you can solve these problems better yourself than the community at large, and then support that solution on every platform. This is why they ended up with solutions like raw partitions, where they just put their own filesystem on the disk and figure out how to make that work well everywhere. If you look at trends in this area, at this point the underlying operating systems have gotten good enough that tricks like that are becoming marginal. Pushing more work toward the OS is a completely viable design choice that strengthens every year. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
On Thu, 21 Jun 2007, Campbell, Lance wrote: > I have a PostgreSQL database that runs on a dedicated server. The > server has 24Gig of memory. What would be the max size I would ever > want to set the shared_buffers to if I where to relying on the OS for > disk caching approach? It seems that no matter how big your dedicated > server is there would be a top limit to the size of shared_buffers. It's impossible to say exactly what would work optimally in this sort of situation. The normal range is 25-50% of total memory, but there's no hard reason for that balance; for all we know your apps might work best with 20GB in shared_buffers and only a relatively small 4GB left over for the rest of the OS to use. Push it way up and and see what you get. This is part of why the idea of an "advanced" mode for this tool is suspect. Advanced tuning usually requires benchmarking with as close to real application data as you can get in order to make good forward progress. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Jun 23, 2007, at 2:28 PM, Greg Smith wrote: > On Thu, 21 Jun 2007, Campbell, Lance wrote: >> I have a PostgreSQL database that runs on a dedicated server. The >> server has 24Gig of memory. What would be the max size I would ever >> want to set the shared_buffers to if I where to relying on the OS for >> disk caching approach? It seems that no matter how big your >> dedicated >> server is there would be a top limit to the size of shared_buffers. > > It's impossible to say exactly what would work optimally in this > sort of situation. The normal range is 25-50% of total memory, but > there's no hard reason for that balance; for all we know your apps > might work best with 20GB in shared_buffers and only a relatively > small 4GB left over for the rest of the OS to use. Push it way up > and and see what you get. > > This is part of why the idea of an "advanced" mode for this tool is > suspect. Advanced tuning usually requires benchmarking with as > close to real application data as you can get in order to make good > forward progress. Agreed. EnterpriseDB comes with a feature called "DynaTune" that looks at things like server memory and sets a best-guess at a bunch of parameters. Truth is, it works fine for 90% of cases, because there's just a lot of installations where tuning postgresql.conf isn't that critical. The real issue is that the "stock" postgresql.conf is just horrible. It was originally tuned for something like a 486, but even the recent changes have only brought it up to the "pentium era" (case in point: 24MB of shared buffers equates to a machine with 128MB of memory, give or take). Given that, I think an 80% solution would be to just post small/medium/large postgresql.conf files somewhere. I also agree 100% with Tom that the cost estimators need serious work. One simple example: nothing in the planner looks at what percent of a relation is actually in shared_buffers. If it did that, it would probably be reasonable to extrapolate that percentage into how much is sitting in kernel cache, which would likely be miles ahead of what's currently done. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim Nasby" <decibel@decibel.org> writes: > The real issue is that the "stock" postgresql.conf is just horrible. It was > originally tuned for something like a 486, but even the recent changes have > only brought it up to the "pentium era" (case in point: 24MB of shared buffers > equates to a machine with 128MB of memory, give or take). I think it's more that the stock configure has to assume it's not a dedicated box. Picture someone installing Postgres on their debian box because it's required for Gnucash. Even having 24M suddenly disappear from the box is quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Lance, > I have a PostgreSQL database that runs on a dedicated server. The > server has 24Gig of memory. What would be the max size I would ever > want to set the shared_buffers to if I where to relying on the OS for > disk caching approach? It seems that no matter how big your dedicated > server is there would be a top limit to the size of shared_buffers. There's not, actually. Under some circumstances (mainly Solaris 10 + UFS on AMD) it can actually be beneficial to have s_b be 80% of RAM and bypass the FS cache entirely. This isn't usually the case, but it's not to be ruled out. If you're relying on the FS cache and not using direct I/O, though, you want to keep at least 50% of memory free for use by the cache. At below 50%, you lose a significant part of the benefit of the cache without losing the cost of it. Of course, that assumes that your database is bigger than ram; there isn't much need to have either s_b or the f.s.c. be more than twice the size of your whole database. In general, a setting s_b to 25% of RAM on a dedicated machine, and 10% (with a max of 512MB) on a shared machine, is a nice safe default which will do OK for most applications. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco