Thread: shared_buffers advice
Hi there, I'm after a little bit of advice on the shared_buffers setting (I have read the various docs on/linked from the performance tuning wiki page, some very helpful stuff there so thanks to those people). I am setting up a 64bit Linux server running Postgresql 8.3, the server has 64gigs of memory and Postgres is the only major application running on it. (This server is to go alongside some existing 8.3 servers, we will look at 8.4/9 migration later) I'm basically wondering how the postgresql cache (ie shared_buffers) and the OS page_cache interact. The general advice seems to be to assign 1/4 of RAM to shared buffers. I don't have a good knowledge of the internals but I'm wondering if this will effectively mean that roughly the same amount of RAM being used for the OS page cache will be used for redundantly caching something the Postgres is caching as well? IE when Postgres reads something from disk it will go into both the OS page cache and the Postgresql shared_buffers and the OS page cache copy is unlikely to be useful for anything. If that is the case what are the downsides to having less overlap between the caches, IE heavily favouring one or the other, such as allocating shared_buffers to a much larger percentage (such as 90-95% of expected 'free' memory). Paul
There seems to be a wide range of opinion on this .... I am new to PG and grew up on Oracle, where more SGA is always a good thing ... I know people who run Oracle on 2TB Superdome's with titanic SGA sizes to keep the whole DB in RAM. I'd be using a 40GB+ Oracle SGA on that box of yours.
A lot of the folks here say that there isn't much performance to be gained by giving PG's buffer cache the bulk of the RAM .... if you have the opportunity, it'd be interesting to test it both ways with a representative workload before going live. I for one would be very curious to see the results.
One caveat: the PG child processes can collectively use a lot of transient RAM, e.g. for sorts and vaccuming, depending on config and workload. If this causes swapping, or even uses up enough memory to effectively eliminate the OS buffer cache, it's going to hurt performance.
Cheers
Dave
A lot of the folks here say that there isn't much performance to be gained by giving PG's buffer cache the bulk of the RAM .... if you have the opportunity, it'd be interesting to test it both ways with a representative workload before going live. I for one would be very curious to see the results.
One caveat: the PG child processes can collectively use a lot of transient RAM, e.g. for sorts and vaccuming, depending on config and workload. If this causes swapping, or even uses up enough memory to effectively eliminate the OS buffer cache, it's going to hurt performance.
Cheers
Dave
On Wed, Mar 10, 2010 at 8:28 PM, Paul McGarry <paul@paulmcgarry.com> wrote:
Hi there,
I'm after a little bit of advice on the shared_buffers setting (I have
read the various docs on/linked from the performance tuning wiki page,
some very helpful stuff there so thanks to those people).
I am setting up a 64bit Linux server running Postgresql 8.3, the
server has 64gigs of memory and Postgres is the only major application
running on it. (This server is to go alongside some existing 8.3
servers, we will look at 8.4/9 migration later)
I'm basically wondering how the postgresql cache (ie shared_buffers)
and the OS page_cache interact. The general advice seems to be to
assign 1/4 of RAM to shared buffers.
I don't have a good knowledge of the internals but I'm wondering if
this will effectively mean that roughly the same amount of RAM being
used for the OS page cache will be used for redundantly caching
something the Postgres is caching as well?
IE when Postgres reads something from disk it will go into both the OS
page cache and the Postgresql shared_buffers and the OS page cache
copy is unlikely to be useful for anything.
If that is the case what are the downsides to having less overlap
between the caches, IE heavily favouring one or the other, such as
allocating shared_buffers to a much larger percentage (such as 90-95%
of expected 'free' memory).
Paul
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Dave Crooke wrote: > There seems to be a wide range of opinion on this .... I am new to PG > and grew up on Oracle, where more SGA is always a good thing ... I > know people who run Oracle on 2TB Superdome's with titanic SGA sizes > to keep the whole DB in RAM. I'd be using a 40GB+ Oracle SGA on that > box of yours. I wouldn't call it opinion so much as a series of anecdotes all suggesting the same thing: that you cannot translate SGA practice into PostgreSQL and expect that to work the same way. Some data points: -An academic study at Duke suggested 40% of RAM was optimal for their mixed workload, but that was a fairly small amount of RAM. http://www.cs.duke.edu/~shivnath/papers/ituned.pdf -Tests done by Jignesh Shah at Sun not too long ago put diminishing returns on a system with a bunch of RAM at 10GB, probably due to buffer lock contention issues (details beyond that number not in the slides, recalling from memory of the talk itself): http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best -My warnings about downsides related to checkpoint issues with larger buffer pools isn't an opinion at all; that's a fact based on limitations in how Postgres does its checkpoints. If we get something more like Oracle's incremental checkpoint logic, this particular concern might go away. -Concerns about swapping, work_mem, etc. are all very real. All of us who have had the database server process killed by the Linux OOM killer at least once know that's one OS you absolutely cannot push this too hard on. This is not unique to here, that issue exists in Oracle+SGA land as well: http://lkml.indiana.edu/hypermail/linux/kernel/0103.3/0906.html -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> -My warnings about downsides related to checkpoint issues with larger > buffer pools isn't an opinion at all; that's a fact based on limitations > in how Postgres does its checkpoints. If we get something more like > Oracle's incremental checkpoint logic, this particular concern might go > away. Does PG issue checkpoint writes in "sorted" order ? I wonder about something, too : if your DB size is smaller than RAM, you could in theory set shared_buffers to a size larger than your DB provided you still have enough free RAM left for work_mem and OS writes management. How does this interact with the logic which prevents seq-scans hogging shared_buffers ?
On Tue, Mar 16, 2010 at 7:24 AM, Pierre C <lists@peufeu.com> wrote:
I wonder about something, too : if your DB size is smaller than RAM, you could in theory set shared_buffers to a size larger than your DB provided you still have enough free RAM left for work_mem and OS writes management. How does this interact with the logic which prevents seq-scans hogging shared_buffers ?
I think the logic you are referring to is the clock sweep buffer accounting scheme. That just makes sure that the most popular pages stay in the buffers. If your entire db fits in the buffer pool then it'll all get in there real fast.
Two things to consider though:
1. The checkpoint issue still stands.
2. You should really mess around with your cost estimates if this is the case. If you make random IO cost the same as sequential IO postgres will prefer index scans over bitmap index scans and table scans which makes sense if everything is in memory.
> I think the logic you are referring to is the clock sweep buffer > accounting > scheme. That just makes sure that the most popular pages stay in the > buffers. If your entire db fits in the buffer pool then it'll all get in > there real fast. Actually, I meant that in the case of a seq scan, PG will try to use just a few buffers (a ring) in shared_buffers instead of thrashing the whole buffers. But if there was actually a lot of free space in shared_buffers, do the pages stay, or do they not ?
"Pierre C" <lists@peufeu.com> writes: > Does PG issue checkpoint writes in "sorted" order ? No. IIRC, a patch for that was submitted, and rejected because no significant performance improvement could be demonstrated. We don't have enough information about the actual on-disk layout to be very intelligent about this, so it's better to just issue the writes and let the OS sort them. regards, tom lane
On Tue, Mar 16, 2010 at 1:48 PM, Pierre C <lists@peufeu.com> wrote: > Actually, I meant that in the case of a seq scan, PG will try to use just a > few buffers (a ring) in shared_buffers instead of thrashing the whole > buffers. But if there was actually a lot of free space in shared_buffers, do > the pages stay, or do they not ? They don't. The logic only kicks in if the table is expected to be > 1/4 of shared buffers though. -- greg
On Tue, Mar 16, 2010 at 2:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Pierre C" <lists@peufeu.com> writes: >> Does PG issue checkpoint writes in "sorted" order ? > > No. IIRC, a patch for that was submitted, and rejected because no > significant performance improvement could be demonstrated. We don't > have enough information about the actual on-disk layout to be very > intelligent about this, so it's better to just issue the writes and > let the OS sort them. Keep in mind that postgres is issuing writes to the OS buffer cache. It defers fsyncing the files as late as it can in the hopes that most of those buffers will be written out by the OS before then. That gives the OS a long time window in which to flush them out in whatever order and whatever schedule is most convenient. If the OS filesystem buffer cache is really small then that might not work so well. It might be worth rerunning those benchmarks on a machine with shared buffers taking up all of RAM. -- greg
Pierre C wrote: > Actually, I meant that in the case of a seq scan, PG will try to use > just a few buffers (a ring) in shared_buffers instead of thrashing the > whole buffers. But if there was actually a lot of free space in > shared_buffers, do the pages stay, or do they not ? Pages inserted into the ring buffer and later re-used for new data do not stay behind even if there is room for them. There's a potential improvement possible in that code involving better management of the situation where the buffer cache hasn't actually reached full capacity yet, but as it's an unusual case it's hard to justify optimizing for. Besides, the hope is that in this case the OS cache will end up caching everything anyway until it has a reason to evict it. So if you follow the rest of the data suggesting you should not give all the memory to PostgreSQL to manage, you end up with a reasonable solution to this problem anyway. Those pages will just live in the OS cache instead of the database's, with only a few trickling in and staying behind each time you do a sequential scan. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Stark wrote: > On Tue, Mar 16, 2010 at 2:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> "Pierre C" <lists@peufeu.com> writes: >> >>> Does PG issue checkpoint writes in "sorted" order ? >>> >> No. IIRC, a patch for that was submitted, and rejected because no >> significant performance improvement could be demonstrated. > If the OS filesystem buffer cache is really small then that might not > work so well. It might be worth rerunning those benchmarks on a > machine with shared buffers taking up all of RAM. > Here's the original patch again: http://archives.postgresql.org/message-id/20080415181742.6C97.52131E4D@oss.ntt.co.jp I was the person who tried to reproduce the suggested 10% pgbench speedup on a similar system and couldn't replicate any improvement. Never was sure what was going on to show such a difference on the reference system used to develop the patch versus mine, since they were pretty similar. Possibly some positive interaction with LVM in the test case I didn't have. Maybe the actual reason sorting helped was limitations in the HP P400 controller used there I wasn't running into with the Areca card I used. And the always popular "didn't account fully for all pgbench run to run variation" possibility crossed my mind too--that the original observed speedup wasn't caused by the patch but by something else. I did not go out of my way to find test conditions where the patch would more likely to help, like the situation you describe where shared_buffers was really large relative to the OS cache. Since the patch complicates the checkpoint code and requires some working memory to operate, it would have to be a unquestionable win using standard practices before it was worth applying. If it only helps in a situation people are unlikely to use in the field, and it net negative for everyone else, that's still going to end up on the interesting but rejected idea scrapheap at the end of the day. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Stark escribió: > On Tue, Mar 16, 2010 at 2:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Pierre C" <lists@peufeu.com> writes: > >> Does PG issue checkpoint writes in "sorted" order ? > > > > No. IIRC, a patch for that was submitted, and rejected because no > > significant performance improvement could be demonstrated. We don't > > have enough information about the actual on-disk layout to be very > > intelligent about this, so it's better to just issue the writes and > > let the OS sort them. > > Keep in mind that postgres is issuing writes to the OS buffer cache. > It defers fsyncing the files as late as it can in the hopes that most > of those buffers will be written out by the OS before then. That gives > the OS a long time window in which to flush them out in whatever order > and whatever schedule is most convenient. Maybe it would make more sense to try to reorder the fsync calls instead. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Maybe it would make more sense to try to reorder the fsync calls > instead. Reorder to what, though? You still have the problem that we don't know much about the physical layout on-disk. regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Maybe it would make more sense to try to reorder the fsync calls > > instead. > > Reorder to what, though? You still have the problem that we don't know > much about the physical layout on-disk. Well, to block numbers as a first step. However, this reminds me that sometimes we take the block-at-a-time extension policy too seriously. We had a customer that had a performance problem because they were inserting lots of data to TOAST tables, causing very frequent extensions. I kept wondering whether an allocation policy that allocated several new blocks at a time could be useful (but I didn't try it). This would also alleviate fragmentation, thus helping the physical layout be more similar to logical block numbers. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> Reorder to what, though? You still have the problem that we don't know >> much about the physical layout on-disk. > Well, to block numbers as a first step. fsync is a file-based operation, and we know exactly zip about the relative positions of different files on the disk. > However, this reminds me that sometimes we take the block-at-a-time > extension policy too seriously. Yeah, that's a huge performance penalty in some circumstances. > We had a customer that had a > performance problem because they were inserting lots of data to TOAST > tables, causing very frequent extensions. I kept wondering whether an > allocation policy that allocated several new blocks at a time could be > useful (but I didn't try it). This would also alleviate fragmentation, > thus helping the physical layout be more similar to logical block > numbers. That's not going to do anything towards reducing the actual I/O volume. Although I suppose it might be useful if it just cuts the number of seeks. regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane escribi�: > >> Reorder to what, though? You still have the problem that we don't know > >> much about the physical layout on-disk. > > > Well, to block numbers as a first step. > > fsync is a file-based operation, and we know exactly zip about the > relative positions of different files on the disk. Doh, right, I was thinking in the sync-file-range kind of API. > > We had a customer that had a > > performance problem because they were inserting lots of data to TOAST > > tables, causing very frequent extensions. I kept wondering whether an > > allocation policy that allocated several new blocks at a time could be > > useful (but I didn't try it). This would also alleviate fragmentation, > > thus helping the physical layout be more similar to logical block > > numbers. > > That's not going to do anything towards reducing the actual I/O volume. > Although I suppose it might be useful if it just cuts the number of > seeks. Oh, they had no problems with I/O volume. It was relation extension lock that was heavily contended for them. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribió: >> That's not going to do anything towards reducing the actual I/O volume. >> Although I suppose it might be useful if it just cuts the number of >> seeks. > Oh, they had no problems with I/O volume. It was relation extension > lock that was heavily contended for them. Really? I guess that serialized all the I/O ... I'll bet if we got rid of that locking somehow, they *would* have a problem with I/O volume. regards, tom lane
Tom Lane escribió: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane escribió: > >> That's not going to do anything towards reducing the actual I/O volume. > >> Although I suppose it might be useful if it just cuts the number of > >> seeks. > > > Oh, they had no problems with I/O volume. It was relation extension > > lock that was heavily contended for them. > > Really? I guess that serialized all the I/O ... I'll bet if we got rid > of that locking somehow, they *would* have a problem with I/O volume. Well, that would solve the problem as far as I'm concerned and they'd have to start talking to their storage provider ;-) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Maybe it would make more sense to try to reorder the fsync calls > instead. > The pretty obvious left behind idea from 8.3 spread checkpoint development was to similarly spread the fsync calls around. Given that we know, for example, Linux with ext3 is going to dump the whole filesystem write cache out when the fsync call comes in, the way they're currently scheduled has considerably potential for improvement. Unfortunately, since the tuning on that is going to be very platform dependent and require a lot of benchmarking work, I think we need a performance farm up and running as a prerequisite to finishing that work off. The spread checkpoint stuff was a much more obvious improvement, and that was hard enough to quantify usefully and test. Returning to the idea of the sorted checkpoints patch as a simple example, if it were possible to just push that patch to a test repo and see how that changed typical throughput/latency against a well-established history, it would be a lot easier to figure out if something like that is sensible to consider or not. I'm not sure how to make progress on similar ideas about tuning closer to the filesystem level without having something automated that takes over the actual benchmark running and data recording steps; it's just way too time consuming to do those right now with every tool that's available for PostgreSQL so far. That's the problem I work on, there are easily a half dozen good ideas for improvements here floating around where coding time is dwarfed by required performance validation time. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
See below:
Hi there,
I'm after a little bit of advice on the shared_buffers setting (I have
read the various docs on/linked from the performance tuning wiki page,
some very helpful stuff there so thanks to those people).
I am setting up a 64bit Linux server running Postgresql 8.3, the
server has 64gigs of memory and Postgres is the only major application
running on it. (This server is to go alongside some existing 8.3
servers, we will look at 8.4/9 migration later)
I'm basically wondering how the postgresql cache (ie shared_buffers)
and the OS page_cache interact. The general advice seems to be to
assign 1/4 of RAM to shared buffers.
I don't have a good knowledge of the internals but I'm wondering if
this will effectively mean that roughly the same amount of RAM being
used for the OS page cache will be used for redundantly caching
something the Postgres is caching as well?
IE when Postgres reads something from disk it will go into both the OS
page cache and the Postgresql shared_buffers and the OS page cache
copy is unlikely to be useful for anything.
If that is the case what are the downsides to having less overlap
between the caches, IE heavily favouring one or the other, such as
allocating shared_buffers to a much larger percentage (such as 90-95%
of expected 'free' memory).
Pg apparently does not have an option of using direct IO with reads which some other databases do (the O_DIRECT mode). Therefore, double-buffering with read operations seems unavoidable. Counterintuitively, it may be a good idea to just rely on OS buffering and keep shared_buffers rather small, say, 512MB.
VJ
VJ
Paul
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Greg Smith <greg@2ndquadrant.com> writes: > I'm not sure how to make progress on similar ideas about > tuning closer to the filesystem level without having something automated > that takes over the actual benchmark running and data recording steps; it's > just way too time consuming to do those right now with every tool that's > available for PostgreSQL so far. That's the problem I work on, there are > easily a half dozen good ideas for improvements here floating around where > coding time is dwarfed by required performance validation time. I still think the best tool around currently for this kind of testing is tsung, but I've yet to have the time to put money where my mouth is, as they say. Still, I'd be happy to take some time a help you decide if it's the tool you want to base your performance testing suite on or not. http://tsung.erlang-projects.org/ Regards, -- dim
Dimitri Fontaine wrote: > I still think the best tool around currently for this kind of testing is > tsung I am happy to say that for now, pgbench is the only actual testing tool supported. Done; now I don't need tsung. However, that doesn't actually solve any of the problems I was talking about though, which is why I'm not even talking about that part. We need the glue to pull out software releases, run whatever testing tool is appropriate, and then save the run artifacts in some standardized form so they can be referenced with associated build/configuration information to track down a regression when it does show up. Building those boring bits are the real issue here; load testing tools are easy to find because those are fun to work on. And as a general commentary on the vision here, tsung will never fit into this anyway because "something that can run on the buildfarm machines with the software they already have installed" is the primary target. I don't see anything about tsung so interesting that it trumps that priority, even though it is an interesting tool. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Greg Smith <greg@2ndquadrant.com> writes: > However, that doesn't actually solve any of the problems I was talking about > though, which is why I'm not even talking about that part. We need the glue > to pull out software releases, run whatever testing tool is appropriate, and > then save the run artifacts in some standardized form so they can be > referenced with associated build/configuration information to track down a > regression when it does show up. Building those boring bits are the real > issue here; load testing tools are easy to find because those are fun to > work on. Oh, ok. I missparsed the previous message. Tsung has a way to monitor OS level information, and I guess adding the build/configuration would be... as easy as adding it to pgbench :) > And as a general commentary on the vision here, tsung will never fit into > this anyway because "something that can run on the buildfarm machines with > the software they already have installed" is the primary target. I don't > see anything about tsung so interesting that it trumps that priority, even > though it is an interesting tool. I though we might talk about a performance farm which would be quite different, if only because to sustain a high enough client load you might need more than one injector machine targeting a given server at once. But if you target the buildfarm, introducing new dependencies does sound like a problem (that I can't evaluate the importance of). Regards, -- dim
2010/3/11 Paul McGarry <paul@paulmcgarry.com>: > I'm basically wondering how the postgresql cache (ie shared_buffers) > and the OS page_cache interact. The general advice seems to be to > assign 1/4 of RAM to shared buffers. > > I don't have a good knowledge of the internals but I'm wondering if > this will effectively mean that roughly the same amount of RAM being > used for the OS page cache will be used for redundantly caching > something the Postgres is caching as well? I have a similar problem but I can't see an answer in this thread. Our dedicated server has 16 GB RAM. Among other settings shared_buffers is 2 GB, effective_cache_size is 12 GB. Do shared_buffers duplicate contents of OS page cache? If so, how do I know if 25% RAM is the right value for me? Actually it would not seem to be true - the less redundancy the better. Another question - is there a tool or built-in statistic that tells when/how often/how much a table is read from disk? I mean physical read, not poll from OS cache to shared_buffers. -- Konrad Garus
On May 24, 2010, at 4:25 AM, Konrad Garus wrote: > Do shared_buffers duplicate contents of OS page cache? If so, how do I > know if 25% RAM is the right value for me? Actually it would not seem > to be true - the less redundancy the better. You can look into the pg_buffercache contrib module. > Another question - is there a tool or built-in statistic that tells > when/how often/how much a table is read from disk? I mean physical > read, not poll from OS cache to shared_buffers. Well, the pg_stat_* tables tell you how much logical IO is going on, but postgres has no way of knowing how effective theOS or disk controller caches are.
On Wed, Mar 10, 2010 at 9:28 PM, Paul McGarry <paul@paulmcgarry.com> wrote: > Hi there, > > I'm after a little bit of advice on the shared_buffers setting (I have > read the various docs on/linked from the performance tuning wiki page, > some very helpful stuff there so thanks to those people). > > I am setting up a 64bit Linux server running Postgresql 8.3, the > server has 64gigs of memory and Postgres is the only major application > running on it. (This server is to go alongside some existing 8.3 > servers, we will look at 8.4/9 migration later) > > I'm basically wondering how the postgresql cache (ie shared_buffers) > and the OS page_cache interact. The general advice seems to be to > assign 1/4 of RAM to shared buffers. > > I don't have a good knowledge of the internals but I'm wondering if > this will effectively mean that roughly the same amount of RAM being > used for the OS page cache will be used for redundantly caching > something the Postgres is caching as well? > > IE when Postgres reads something from disk it will go into both the OS > page cache and the Postgresql shared_buffers and the OS page cache > copy is unlikely to be useful for anything. > > If that is the case what are the downsides to having less overlap > between the caches, IE heavily favouring one or the other, such as > allocating shared_buffers to a much larger percentage (such as 90-95% > of expected 'free' memory). I've personally heard tons of anecdotal evidence wrt shared buffers setting. There is a bit of benchmarking info suggesting you can eek marginal gains via shared buffers setting but you have to take (unfortunately) o/s, hardware, filesystem and other factors all into account. Here is what I'm pretty confident about saying: *) a page fault to disk is a much bigger deal than a fault to pg cache vs os/ cache. many people assume that raising shared buffers decreases the chance of a disk fault. it doesn't -- at least not in the simple way you would think -- all modern o/s aggressively cache filesystem data already so we are simply layering over the o/s cache. If your database is really big -- anything that reduces disk faults is a win and increases them is a loss. tps measurements according to pgbench are not as interesting to me as iops from the disk system. *) shared buffer affects are hard to detect in the single user case. The performance of a single 'non disk bound' large query will perform pretty much the same regardless of how you set shared buffers. In other words, you will not be able to easily measure the differences in the setting outside of a real or simulated production workload. *) shared_buffers is one of the _least_ important performance settings in postgresql.conf Many settings, like work_mem, planner tweaks, commit settings, autovacuum settings, can dramatically impact your workload performance in spectacular ways, but tend to be 'case by case' specific. shared buffers affects _everything_, albeit in very subtle ways, so you have to be careful. *) I sometimes wonder if the o/s should just manage everything. we just said goodbye to the fsm (thank goodness for that!) -- what about a fully o/s managed cache? goodbye svsv ipc? note my views here are very simplistic -- I don't have anything close to a full understanding of the cache machinery in the database. merlin
2010/5/24 Merlin Moncure <mmoncure@gmail.com>: > *) a page fault to disk is a much bigger deal than a fault to pg cache > vs os/ cache. That was my impression. That's why I did not touch our 2/16 GB setting right away. I guess that 2 more gigabytes in OS cache is better than 2 more (duplicated) gigabytes in PG shared_buffers. In our case 2 GB shared_buffers appears to be enough to avoid thrashing between OS and PG. > *) shared_buffers is one of the _least_ important performance settings > in postgresql.conf > > Many settings, like work_mem, planner tweaks, commit settings, > autovacuum settings Can you recommend any sources on these parameters, especially commit settings and planner tweaks? Thank you so much for the whole answer! Not only it addresses the immediate question, but also many of the unasked that I had in the back of my head. It's brief and gives a broad view over all the performance concerns. It should be part of documentation or the first page of performance wiki. Have you copied it from somewhere? -- Konrad Garus
On Tue, May 25, 2010 at 5:58 AM, Konrad Garus <konrad.garus@gmail.com> wrote: > 2010/5/24 Merlin Moncure <mmoncure@gmail.com>: > >> *) a page fault to disk is a much bigger deal than a fault to pg cache >> vs os/ cache. > > That was my impression. That's why I did not touch our 2/16 GB setting > right away. I guess that 2 more gigabytes in OS cache is better than 2 > more (duplicated) gigabytes in PG shared_buffers. In our case 2 GB > shared_buffers appears to be enough to avoid thrashing between OS and > PG. > >> *) shared_buffers is one of the _least_ important performance settings >> in postgresql.conf >> >> Many settings, like work_mem, planner tweaks, commit settings, >> autovacuum settings > > Can you recommend any sources on these parameters, especially commit > settings and planner tweaks? > > > Thank you so much for the whole answer! Not only it addresses the > immediate question, but also many of the unasked that I had in the > back of my head. It's brief and gives a broad view over all the > performance concerns. It should be part of documentation or the first > page of performance wiki. Have you copied it from somewhere? Thank you for your nice comments. This was strictly a brain dump from yours truly. There is a fairly verbose guide on the wiki (http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server). There is a lot of good info there but it's missing a few things (from_collapse_limit for example). I would prefer to see the annotated performance oriented .conf settings to be written in terms of trade offs (too low? X too high? Y setting in order to get? Z). For example, did you know that if crank max_locks_per_transaction you also increase the duration of every query that hits pg_locks() -- well, now you do :-). merlin
2010/5/24 Konrad Garus <konrad.garus@gmail.com>: > 2010/3/11 Paul McGarry <paul@paulmcgarry.com>: > >> I'm basically wondering how the postgresql cache (ie shared_buffers) >> and the OS page_cache interact. The general advice seems to be to >> assign 1/4 of RAM to shared buffers. >> >> I don't have a good knowledge of the internals but I'm wondering if >> this will effectively mean that roughly the same amount of RAM being >> used for the OS page cache will be used for redundantly caching >> something the Postgres is caching as well? > > I have a similar problem but I can't see an answer in this thread. > > Our dedicated server has 16 GB RAM. Among other settings > shared_buffers is 2 GB, effective_cache_size is 12 GB. > > Do shared_buffers duplicate contents of OS page cache? If so, how do I > know if 25% RAM is the right value for me? Actually it would not seem > to be true - the less redundancy the better. At the moment where a block is requested for the first time (usualy 8kb from postgres, so in fact 2 blocks in OS), you have 'duplicate' buffers. But, depending of your workload, it is not so bad because those 2 blocks should not be requested untill some time (because in postgresql shared buffers) and should be evicted by OS in favor of new blocks requests. Again it depends on your workload, if you have a case where you refresh a lot the shared buffers then you will have more blocks in the 2 caches at the same time. You can try pgfincore extension to grab stats from OS cache and/or patch postgresql if you want real stats ;) pgbuffercache is provided with postgresql and deliver very usefull information : http://www.postgresql.org/docs/8.4/interactive/pgbuffercache.html > > Another question - is there a tool or built-in statistic that tells > when/how often/how much a table is read from disk? I mean physical > read, not poll from OS cache to shared_buffers. > > -- > Konrad Garus > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/5/26 Cédric Villemain <cedric.villemain.debian@gmail.com>: > At the moment where a block is requested for the first time (usualy > 8kb from postgres, so in fact 2 blocks in OS), you have 'duplicate' > buffers. > But, depending of your workload, it is not so bad because those 2 > blocks should not be requested untill some time (because in postgresql > shared buffers) and should be evicted by OS in favor of new blocks > requests. Since pg_buffercache is 4-8 times smaller, it would seem to be extremely rare to me. And when PG requests a block, it also needs to evict something from shared_buffers. > You can try pgfincore extension to grab stats from OS cache and/or > patch postgresql if you want real stats ;) Thank you! It seems to be the tool I was looking for. Could help me locate and troubleshoot the hogs in page cache. I also find the snapshot/restore function promising. Every morning our cache is cold or filled with irrelevant data left by nightly batch jobs, thus severely impacting the performance. Seems to be exactly what this tool is for. How does it work? How stable is it? Can we use it in production on a daily basis? > pgbuffercache is provided with postgresql and deliver very usefull information : > http://www.postgresql.org/docs/8.4/interactive/pgbuffercache.html Thank you. I already am using it. I've already found a few hogs with it. -- Konrad Garus
2010/5/27 Konrad Garus <konrad.garus@gmail.com>: > 2010/5/26 Cédric Villemain <cedric.villemain.debian@gmail.com>: > >> At the moment where a block is requested for the first time (usualy >> 8kb from postgres, so in fact 2 blocks in OS), you have 'duplicate' >> buffers. >> But, depending of your workload, it is not so bad because those 2 >> blocks should not be requested untill some time (because in postgresql >> shared buffers) and should be evicted by OS in favor of new blocks >> requests. > > Since pg_buffercache is 4-8 times smaller, it would seem to be > extremely rare to me. And when PG requests a block, it also needs to > evict something from shared_buffers. 3 very important things : * postgresql shared buffers are database oriented * OS shared buffers are *more* complex and will not evict the same buffers as postgres. * OS page cache can handle tens of GB where postgres usually have no gain in performance over 10GB. > >> You can try pgfincore extension to grab stats from OS cache and/or >> patch postgresql if you want real stats ;) > > Thank you! It seems to be the tool I was looking for. Could help me > locate and troubleshoot the hogs in page cache. I also find the > snapshot/restore function promising. Every morning our cache is cold > or filled with irrelevant data left by nightly batch jobs, thus > severely impacting the performance. Seems to be exactly what this tool > is for. > > How does it work? How stable is it? Can we use it in production on a > daily basis? It works thanks to mincore/posix_fadvise stuff : you need linux. It is stable enough in my own experiment. I did use it for debugging purpose in production servers with succes. BUT : * snapshot/restore is done via a flat_file (one per segment or table/index) and *it is not removed* when you drop a table. * it might exist corner case not yet handled (like snapshot a database, change things like drop table, truncate table, then restore) It needs some polish to be totally production ready but the job can be done. > >> pgbuffercache is provided with postgresql and deliver very usefull information : >> http://www.postgresql.org/docs/8.4/interactive/pgbuffercache.html > > Thank you. I already am using it. I've already found a few hogs with it. > > -- > Konrad Garus > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > It works thanks to mincore/posix_fadvise stuff : you need linux. > It is stable enough in my own experiment. I did use it for debugging > purpose in production servers with succes. What impact does it have on performance? Does it do anything, is there any interaction between it and PG/OS, when it's not executing a command explicitly invoked by me? -- Konrad Garus
2010/5/27 Konrad Garus <konrad.garus@gmail.com>: > 2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > >> It works thanks to mincore/posix_fadvise stuff : you need linux. >> It is stable enough in my own experiment. I did use it for debugging >> purpose in production servers with succes. > > What impact does it have on performance? pgmincore() and pgmincore_snapshot() both are able to mmap up to 1GB. I didn't mesure a performance impact. But I haven't enough benchmarks/test yet. > > Does it do anything, is there any interaction between it and PG/OS, > when it's not executing a command explicitly invoked by me? pgfincore does nothing until you call one of the functions. Reducing the mmap window is faisable, and I had start something to use effective_io_concurrency in order to improve prefetch (for restore) but this part of the code is not yet finished. > > -- > Konrad Garus > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > pgmincore() and pgmincore_snapshot() both are able to mmap up to 1GB. Does it mean they can occupy 1 GB of RAM? How does it relate to amount of page buffers mapped by OS? -- Konrad Garus
2010/5/27 Konrad Garus <konrad.garus@gmail.com>: > 2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > >> pgmincore() and pgmincore_snapshot() both are able to mmap up to 1GB. > > Does it mean they can occupy 1 GB of RAM? How does it relate to amount > of page buffers mapped by OS? well, that is the projection of file in memory. only projection, but the memory is still acquire. It is ok to rework this part and project something like 128MB and loop. (in fact the code is needed for 9.0 because segment can be > 1GB, I didn't check what is the optimum projection size yet) So both yes at your questions :) > > -- > Konrad Garus > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > well, that is the projection of file in memory. only projection, but > the memory is still acquire. It is ok to rework this part and project > something like 128MB and loop. (in fact the code is needed for 9.0 > because segment can be > 1GB, I didn't check what is the optimum > projection size yet) > So both yes at your questions :) So when I map 12 GB, this process will consume 1 GB and the time needed to browse through the whole 12 GB buffer? -- Konrad Garus
2010/5/27 Konrad Garus <konrad.garus@gmail.com>: > 2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > >> well, that is the projection of file in memory. only projection, but >> the memory is still acquire. It is ok to rework this part and project >> something like 128MB and loop. (in fact the code is needed for 9.0 >> because segment can be > 1GB, I didn't check what is the optimum >> projection size yet) >> So both yes at your questions :) > > So when I map 12 GB, this process will consume 1 GB and the time > needed to browse through the whole 12 GB buffer? Exactly. And the time to browse depend on the number of blocks already in core memory. I am interested by tests results and benchmarks if you are going to do some :) > > -- > Konrad Garus > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > Exactly. And the time to browse depend on the number of blocks already > in core memory. > I am interested by tests results and benchmarks if you are going to do some :) I am still thinking whether I want to do it on this prod machine. Maybe on something less critical first (but still with a good amount of memory mapped by page buffers). What system have you tested it on? Has it ever run on a few-gig system? :-) -- Konrad Garus
2010/5/28 Konrad Garus <konrad.garus@gmail.com>: > 2010/5/27 Cédric Villemain <cedric.villemain.debian@gmail.com>: > >> Exactly. And the time to browse depend on the number of blocks already >> in core memory. >> I am interested by tests results and benchmarks if you are going to do some :) > > I am still thinking whether I want to do it on this prod machine. > Maybe on something less critical first (but still with a good amount > of memory mapped by page buffers). > > What system have you tested it on? Has it ever run on a few-gig system? :-) databases up to 300GB for the stats purpose. The snapshot/restore was done for bases around 40-50GB but with only 16GB of RAM. I really thing some improvments are posible before using it in production, even if it should work well as it is. At least something to remove the orphan snapshot files (in case of drop table, or truncate). And probably increase the quality of the code around the prefetch.(better handling of effective_io_concurrency...the prefetch is linerar but blocks requests are grouped) If you are able to test/benchs on a pre-production env, do it :) -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Merlin Moncure wrote: > I would prefer to see the annotated performance oriented .conf > settings to be written in terms of trade offs (too low? X too high? Y > setting in order to get? Z). For example, did you know that if crank > max_locks_per_transaction you also increase the duration of every > query that hits pg_locks() -- well, now you do :-). > You can't do this without providing more context and tools for people to measure their systems. At PGCon last week, I presented a talk specifically about tuning shared_buffers and the checkpoint settings. What's come out of my research there is that you can stare at the data in pg_buffercache and pg_stat_bgwriter and classify systems based on the distribution of usage counts in their buffer cache on how the background writer copes with that. The right style of tuning to apply is dependent on whether someone has a high proportion of buffers with a usage count >=2. A tuning guide that actually covered that in enough detail to be an improvement over what is in the "Tuning Your PostgreSQL Server" would be overwhelming large, defeating the purpose of that document--providing a fairly bite-size guide. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Fri, May 28, 2010 at 2:57 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> >> I would prefer to see the annotated performance oriented .conf >> settings to be written in terms of trade offs (too low? X too high? Y >> setting in order to get? Z). For example, did you know that if crank >> max_locks_per_transaction you also increase the duration of every >> query that hits pg_locks() -- well, now you do :-). >> > > You can't do this without providing more context and tools for people to > measure their systems. At PGCon last week, I presented a talk specifically > about tuning shared_buffers and the checkpoint settings. What's come out of > my research there is that you can stare at the data in pg_buffercache and > pg_stat_bgwriter and classify systems based on the distribution of usage > counts in their buffer cache on how the background writer copes with that. > The right style of tuning to apply is dependent on whether someone has a > high proportion of buffers with a usage count >=2. A tuning guide that > actually covered that in enough detail to be an improvement over what is in > the "Tuning Your PostgreSQL Server" would be overwhelming large, defeating > the purpose of that document--providing a fairly bite-size guide. Sure. IOW, a .conf guide should answer: *) What are the symptoms of misconfigured shared_buffers (too low/too high)? *) How do you confirm this? Our documentation is completely unclear wrt these questions, other than to give some vague advice in order to get 'good performance'. I'm of the opinion (rightly or wrongly) that the prevailing opinions on how to configure shared_buffers are based on special case benchmarking information or simply made up. The dangers of setting it too high are very real (especially on linux) but this isn't mentioned; contrast that to the care put into the fsync language. This is in the face of some prevailing myths (more shared_buffers = more cache = faster) that have some grains of truth but aren't the whole story. I just helped out a friend that oversubscribed and blew up his linux box...oom killer strikes again. I'm not complaining here mind you; I'd just like to filter out all the anecdotal information and similar noise. shared_buffers is a bit of a bugaboo because it is fairly subtle in how it interacts with production workloads and there is so little solid information out there. I would *love* to see some formal verifiable tests showing > 20% improvements resulting from shared_buffers tweaks that could be repeated on other hardware/os installations. Got any notes for your talk? :-) merlin
Merlin Moncure wrote: > I'm of the opinion (rightly or wrongly) that the prevailing opinions > on how to configure shared_buffers are based on special case > benchmarking information or simply made up. Well, you're wrong, but it's OK; we'll forgive you this time. It's true that a lot of the earlier advice here wasn't very well tested across multiple systems. I have a stack of data that supports the anecdotal guidelines are in the right ballpark now though, most of which is protected by NDA. If you look at the spreadsheet at http://www.pgcon.org/2010/schedule/events/218.en.html you'll see three examples I was able to liberate for public consumption, due to some contributions by list regulars here (I'm working on a fourth right now). The first one has shared_buffers set at 8GB on a 96GB server, at the upper limit of where it's useful, and the database is using every bit of that more effectively than had it been given to the OS to manage. (I'm starting to get access to test hardware to investigate why there's an upper limit wall around 10GB for shared_buffers too) The other two are smaller systems, and they don't benefit nearly as much from giving the database memory given their workload. Basically it comes down to two things: 1) Are you getting a lot of buffers where the usage count is >=3? If not, you can probably reduce shared_buffers and see better performance. This is not the case with the first system shown, but is true on the second and third. 2) Is the average size of the checkpoints too large? If so, you might have to reduce shared_buffers in order to pull that down. Might even need to pull down the checkpoint parameters too. Workloads that don't like the database to have RAM certainly exist, but there are just as many that appreciate every bit of memory you dedicated to it. With all the tuning work I've been doing the last few months, the only thing I've realized the standard guidelines (as embodied by pgtune and the wiki pages) are wrong is in regards to work_mem. You have to be much more careful with that than what pgtune in particular suggests. The rest of the rules of thumb pgtune is based on and "Tuning your PostgreSQL Server" suggests are not bad. Accomplishing a major advance over the current state of things really needs some captures of real application load from a production system of both major types that we can playback, to give something more realistic than one of the boring benchmark loads. Dimitri Fontaine is working on some neat tools in that area, and now that we're working together more closely I'm hoping we can push that work forward further. That's the real limiting factor here now, assembling repeatable load testing that looks like an application rather than a benchmark. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, May 24, 2010 at 12:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > *) shared_buffers is one of the _least_ important performance settings > in postgresql.conf Yes, and no. It's usually REALLY helpful to make sure it's more than 8 or 24Megs. But it doesn't generally need to be huge to make a difference.
If, like me, you came from the Oracle world, you may be tempted to throw a ton of RAM at this. Don't. PG does not like it.
On Fri, May 28, 2010 at 4:11 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Mon, May 24, 2010 at 12:25 PM, Merlin Moncure <mmoncure@gmail.com> wrote:Yes, and no. It's usually REALLY helpful to make sure it's more than
> *) shared_buffers is one of the _least_ important performance settings
> in postgresql.conf
8 or 24Megs. But it doesn't generally need to be huge to make a
difference.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Fri, May 28, 2010 at 5:02 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> >> I'm of the opinion (rightly or wrongly) that the prevailing opinions >> on how to configure shared_buffers are based on special case >> benchmarking information or simply made up. > > Well, you're wrong, but it's OK; we'll forgive you this time. It's true > that a lot of the earlier advice here wasn't very well tested across > multiple systems. I have a stack of data that supports the anecdotal > guidelines are in the right ballpark now though, most of which is protected > by NDA. If you look at the spreadsheet at > http://www.pgcon.org/2010/schedule/events/218.en.html you'll see three > examples I was able to liberate for public consumption, due to some > contributions by list regulars here (I'm working on a fourth right now). > The first one has shared_buffers set at 8GB on a 96GB server, at the upper > limit of where it's useful, and the database is using every bit of that more > effectively than had it been given to the OS to manage. (I'm starting to > get access to test hardware to investigate why there's an upper limit wall > around 10GB for shared_buffers too) The other two are smaller systems, and > they don't benefit nearly as much from giving the database memory given > their workload. Basically it comes down to two things: > > 1) Are you getting a lot of buffers where the usage count is >=3? If not, > you can probably reduce shared_buffers and see better performance. This is > not the case with the first system shown, but is true on the second and > third. > > 2) Is the average size of the checkpoints too large? If so, you might have > to reduce shared_buffers in order to pull that down. Might even need to > pull down the checkpoint parameters too. > > Workloads that don't like the database to have RAM certainly exist, but > there are just as many that appreciate every bit of memory you dedicated to > it. > > With all the tuning work I've been doing the last few months, the only thing > I've realized the standard guidelines (as embodied by pgtune and the wiki > pages) are wrong is in regards to work_mem. You have to be much more > careful with that than what pgtune in particular suggests. The rest of the > rules of thumb pgtune is based on and "Tuning your PostgreSQL Server" > suggests are not bad. > > Accomplishing a major advance over the current state of things really needs > some captures of real application load from a production system of both > major types that we can playback, to give something more realistic than one > of the boring benchmark loads. Dimitri Fontaine is working on some neat > tools in that area, and now that we're working together more closely I'm > hoping we can push that work forward further. That's the real limiting > factor here now, assembling repeatable load testing that looks like an > application rather than a benchmark. This is great information -- exactly the kind of research I'm talking about. btw I like being proved wrong! :-) I need some time to process this. merlin