Thread: shared_buffers 8GB maximum
Hi All, I've seen the shared_buffers 8GB maximum recommendation repeated many times. I have several questions in this regard. - Is this recommendation still true for recent versions of postgres? (e.g. wasn't it the case only for really old versions where the locks on shared buffers worked much less efficiently) - I'm not a huge Linux expert, but I've heard someone saying that reading from the filesystem cache requires a context switch. I suspect that such reads are slightly more expensive now after the Meltdown/Spectre patch in the kernel. Could that be a reason for increasing the value of shared_buffers? - Could shared_buffers=128GB or more on a 250 GB RAM server be a reasonable setting? What downsides could there be? PS. Some background. We had shared_buffers=8GB initially. In pg_stat_bgwriter we saw that dirty buffers were written to disk more frequently by backends than during checkpoints (buffers_clean > buffers_checkpoint, and buffers_backend > buffers_checkpoint). According to pg_buffercache extension, there was very small percentage of dirty pages in shared buffers. The percentage of pages with usagecount >= 3 was also low. Some of our more frequently used tables and indexes are more than 10 GB in size. This all suggested that probably the bigger tables and indexes, whenever scanned, are constantly flushing pages from the shared buffers area. After increasing shared_buffers to 32GB, the picture started looking healthier. There were 1GB+ of dirty pages in shared buffers (vs 40-200MB before), 30-60% of pages with usagecount >= 3 (vs 10-40% before), buffers_checkpoint started to grow faster than buffers_clean or buffers_backend. There is still not all frequently used data fits in shared_buffers, so we're considering to increase the parameter more. I wanted to have some idea about how big it could reasonably be. PPS. I know any possible answer might be just a recommendation, and parameter values should be tested for each specific case, but still wanted to hear your opinion. Thanks. Regards, Vitaliy
On Sat, 17 Feb 2018 00:36:57 +0200, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: >- I'm not a huge Linux expert, but I've heard someone saying that >reading from the filesystem cache requires a context switch. Yes. >I suspect >that such reads are slightly more expensive now after the >Meltdown/Spectre patch in the kernel. Not necessarily - it depends on exactly what was changed ... which unfortunately I don't know for certain. Any filesystem call is a kernel transition. That's a Meltdown issue. Meltdown can be avoided by using trampoline functions to call the (real) kernel functions and isolating each trampoline so that no other code immediately follows it. This wastes some memory but there is very little added time cost. Spectre is about snooping within the user space of a single process - it has nothing to do with kernel calls. The issues with Spectre are things like untrusted code breaking out of "sandboxes", snooping on password handling or encryption, etc. Fixing Spectre requires purposefully limiting speculative execution of code and can significantly affect performance. But the effects are situation dependent. >Could that be a reason for increasing the value of shared_buffers? > >- Could shared_buffers=128GB or more on a 250 GB RAM server be a >reasonable setting? What downsides could there be? It depends. 8GB is pretty small for such a large server, but taking 1/2 the RAM is not necessarily the right thing either. The size of shared buffers affects log size and the time to complete checkpoints. If a large(ish) percentage of your workload is writes, having a very large shared space could be bad for performance, or bad for space on the log device. Another reason may be that the server is not dedicated to PG but does other things as well. Dramatically increasing PG's memory use may negatively impact something else. >PS. Some background. We had shared_buffers=8GB initially. In >pg_stat_bgwriter we saw that dirty buffers were written to disk more >frequently by backends than during checkpoints (buffers_clean > >buffers_checkpoint, and buffers_backend > buffers_checkpoint). According >to pg_buffercache extension, there was very small percentage of dirty >pages in shared buffers. The percentage of pages with usagecount >= 3 >was also low. Some of our more frequently used tables and indexes are >more than 10 GB in size. This all suggested that probably the bigger >tables and indexes, whenever scanned, are constantly flushing pages from >the shared buffers area. After increasing shared_buffers to 32GB, the >picture started looking healthier. There were 1GB+ of dirty pages in >shared buffers (vs 40-200MB before), 30-60% of pages with usagecount >= >3 (vs 10-40% before), buffers_checkpoint started to grow faster than >buffers_clean or buffers_backend. There is still not all frequently used >data fits in shared_buffers, so we're considering to increase the >parameter more. I wanted to have some idea about how big it could >reasonably be. So now you know that 32GB is better for your workload than 8GB. But that is not necessarily a reason immediately to go crazy with it. Try increasing it gradually - e.g., adding 16GB at a time - and see if the additional shared space provides any real benefit. >PPS. I know any possible answer might be just a recommendation, and >parameter values should be tested for each specific case, but still >wanted to hear your opinion. Thanks. > >Regards, >Vitaliy George
On 02/17/2018 02:56 AM, George Neuner wrote: > On Sat, 17 Feb 2018 00:36:57 +0200, Vitaliy Garnashevich > <vgarnashevich@gmail.com> wrote: > ... > >> Could that be a reason for increasing the value of shared_buffers? >> >> - Could shared_buffers=128GB or more on a 250 GB RAM server be a >> reasonable setting? What downsides could there be? > > It depends. 8GB is pretty small for such a large server, but taking > 1/2 the RAM is not necessarily the right thing either. > I certainly wouldn't recommend using 1/2 of RAM right away. There's a good chance it would be a waste of memory - for example due to double buffering, which effectively reduces "total" cache hit ratio. Start with lower value, increment it gradually and monitor behavior of the server. > The size of shared buffers affects log size and the time to complete > checkpoints. If a large(ish) percentage of your workload is writes, > having a very large shared space could be bad for performance, or > bad for space on the log device. > The size of shared_buffers has pretty much no impact on the size of the WAL - that's flat out wrong. It also does not affect the time needed to perform a checkpoint. It may mean that the checkpoint has to write more dirty buffers, but that is actually a good thing because a checkpoint is about the most efficient way to do writes. By using smaller shared buffers you're making it more likely the database has to evict (dirty) buffers from shared buffers to make space for other buffers needed by queries/vacuum/whatever. Those evictions are performed either by backends or bgwriter, both of which are less efficient than checkpointer. Not only can checkpointer perform various optimizations (e.g. sorting buffers to make the writes more sequential), but it also writes each dirty buffer just once. With smaller shared_buffers the page may have be written multiple times. What actually *does* matter is the active part of the data set, i.e. the part of the data that is actually accessed regularly. In other words, your goal is to achieve good cache hit ratio - say, 95% or more. This also helps reducing the number of syscalls (when reading data from page cache). What is the right shared_buffers size? I have no idea, as it's very dependent on the application. It might be 1GB or 100GB, hard to say. The best thing you can do is set shared buffers to some conservative value (say, 4-8GB), let the system run for a day or two, compute the cache hit ratio using metrics in pg_stat_database, and then decide if you need to resize shared buffers. >> PS. Some background. We had shared_buffers=8GB initially. In >> pg_stat_bgwriter we saw that dirty buffers were written to disk more >> frequently by backends than during checkpoints (buffers_clean > >> buffers_checkpoint, and buffers_backend > buffers_checkpoint). According >> to pg_buffercache extension, there was very small percentage of dirty >> pages in shared buffers. The percentage of pages with usagecount >= 3 >> was also low. Some of our more frequently used tables and indexes are >> more than 10 GB in size. This all suggested that probably the bigger >> tables and indexes, whenever scanned, are constantly flushing pages from >> the shared buffers area. After increasing shared_buffers to 32GB, the >> picture started looking healthier. There were 1GB+ of dirty pages in >> shared buffers (vs 40-200MB before), 30-60% of pages with usagecount >= >> 3 (vs 10-40% before), buffers_checkpoint started to grow faster than >> buffers_clean or buffers_backend. There is still not all frequently used >> data fits in shared_buffers, so we're considering to increase the >> parameter more. I wanted to have some idea about how big it could >> reasonably be. > > So now you know that 32GB is better for your workload than 8GB. But > that is not necessarily a reason immediately to go crazy with it. Try > increasing it gradually - e.g., adding 16GB at a time - and see if the > additional shared space provides any real benefit. > Gradual increases are a good approach in general. And yes, having buffers_checkpoint > buffers_clean > buffers_backend is a good idea too. Together with the cache hit ratio it's probably a more sensible metric than looking at usagecount directly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 16, 2018 at 2:36 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
Hi All,
I've seen the shared_buffers 8GB maximum recommendation repeated many times. I have several questions in this regard.
- Is this recommendation still true for recent versions of postgres? (e.g. wasn't it the case only for really old versions where the locks on shared buffers worked much less efficiently)
There were improvements in 9.3 around things like cleaning the buffer pool when tables were dropped or truncated, particular when many were dropped or truncated in the same transaction. This reduced a major penalty for very large shared_buffers, but did not reduce it to zero. The clock-sweep method for buffer eviction was made lockless using atomics in 9.5, but I think that was more about concurrency than size of shared_buffers.
- I'm not a huge Linux expert, but I've heard someone saying that reading from the filesystem cache requires a context switch. I suspect that such reads are slightly more expensive now after the Meltdown/Spectre patch in the kernel. Could that be a reason for increasing the value of shared_buffers?
Yes. I don't know the exact reason, but reading a buffer from OS cache is quite a bit more expensive than just pinning a buffer already in the buffer_pool, about 5 times more expensive the last time I tested it, which was before Meltdown. (And just pinning a buffer which is already in the cache is already pretty expensive--about 15 times as expensive as reading the next tuple from an already-pinned buffer).
- Could shared_buffers=128GB or more on a 250 GB RAM server be a reasonable setting? What downsides could there be?
The worst side effect I have from large shared_buffers in recent versions (other than swapping to death when you don't have the RAM to support it) is a pathological use case in which someone creates a table, and then in the same transaction keeps starting COPY to insert a small number of rows and then ending the COPY. If the COPY decides to skip wal logging for that table (because it was created in the same transaction and so on a crash the table will not exist anymore) then it needs to scrub the shared_buffers for every COPY end, which is slow with large shared_buffers.
You could also worry that the OS won't have enough memory left in its own cache with which to buffer dirty buffers and re-order or combine writes for more efficient writing to disk. But in my experience, the kernel is horrible at this anyway and if this is important to you it is better to let PostgreSQL have the RAM so that it can do it.
PS. Some background. We had shared_buffers=8GB initially. In pg_stat_bgwriter we saw that dirty buffers were written to disk more frequently by backends than during checkpoints (buffers_clean > buffers_checkpoint, and buffers_backend > buffers_checkpoint).
I don't think that there is any reason to think that buffers_clean > buffers_checkpoint is a problem. In fact, you could argue that it is the way it was designed to work. Although the background writer does need to tell the checkpointer about every file it dirties, so it can be fsynced at the end of the checkpoint. The overhead of this was minimal in my testing.
But buffers_backend > buffers_checkpoint could be a problem, especially if they are also much larger than buffers_clean. But the wrinkle here is that if you do bulk inserts or bulk updates (what about vacuums?), the backends by design write their own dirty buffers. So if you do those kinds of things, buffers_backend being large doesn't indicate much. There was a patch someplace a while ago to separate the counters of backend-intentional writes from backend-no-choice writes, but it never went anywhere.
According to pg_buffercache extension, there was very small percentage of dirty pages in shared buffers. The percentage of pages with usagecount >= 3 was also low. Some of our more frequently used tables and indexes are more than 10 GB in size. This all suggested that probably the bigger tables and indexes, whenever scanned, are constantly flushing pages from the shared buffers area. After increasing shared_buffers to 32GB, the picture started looking healthier. There were 1GB+ of dirty pages in shared buffers (vs 40-200MB before), 30-60% of pages with usagecount >= 3 (vs 10-40% before), buffers_checkpoint started to grow faster than buffers_clean or buffers_backend.
It is not clear to me that this is the best way to measure health. Did your response time go down? Did your throughput go up?
There is still not all frequently used data fits in shared_buffers, so we're considering to increase the parameter more. I wanted to have some idea about how big it could reasonably be.
I've made it 95% of machine RAM in specialized cases. Checkpoints were extremely traumatic, but that was simply because increasing shared_buffers allowed the throughput to go up so much that the IO subsystem couldn't cope.
For many use cases, 50% of RAM is the pessimal size. Because much of the OS cache simply becomes a copy of the shared_buffers, cutting our effective RAM size. So I think you want shared_buffers to be either a smallish fraction of RAM, so that the OS cache is the primary cache and shared_buffers just holds the hottest pages and holds the dirty pages which can't be written without a fsync of WAL. Or a largish fraction, so that OS cache is little but a staging area to get data into and out of shared_buffers. Of these two extremes, I think the smallish fraction is the safest and more general approach.
Cheers,
Jeff
> Not necessarily - it depends on exactly what was changed ... which > unfortunately I don't know for certain. > > Any filesystem call is a kernel transition. That's a Meltdown issue. > Meltdown can be avoided by using trampoline functions to call the > (real) kernel functions and isolating each trampoline so that no other > code immediately follows it. This wastes some memory but there is > very little added time cost. > > > Spectre is about snooping within the user space of a single process - > it has nothing to do with kernel calls. The issues with Spectre are > things like untrusted code breaking out of "sandboxes", snooping on > password handling or encryption, etc. > > Fixing Spectre requires purposefully limiting speculative execution of > code and can significantly affect performance. But the effects are > situation dependent. > I don't know the details either. But one of proposed fixes was to flush CPU caches after doing system calls. That's the reason why I'm asking. > So now you know that 32GB is better for your workload than 8GB. But > that is not necessarily a reason immediately to go crazy with it. Try > increasing it gradually - e.g., adding 16GB at a time - and see if the > additional shared space provides any real benefit. That's what we're going to do. Thanks! Regards, Vitaliy
> I certainly wouldn't recommend using 1/2 of RAM right away. There's a > good chance it would be a waste of memory - for example due to double > buffering, which effectively reduces "total" cache hit ratio. Double buffering is often mentioned in context of tuning shared buffers. Is there a tool to actually measure the amount of double buffering happening in the system? > Those evictions are performed either by backends or bgwriter, both of > which are less efficient than checkpointer. Not only can checkpointer > perform various optimizations (e.g. sorting buffers to make the writes > more sequential), but it also writes each dirty buffer just once. With > smaller shared_buffers the page may have be written multiple times. In the case when shared_buffers cover most of RAM, most of writes should happen by checkpointer, and cache hit ratio should be high. So a hypothetical question: Could shared_buffers=200GB on a 250 GB RAM server ever be a reasonable setting? (assuming there are no other applications running except postgres, and 50GB is enough for allocating work_mem/maintenance_work_mem and for serving queries) > The best thing you can do is set shared buffers to some conservative > value (say, 4-8GB), let the system run for a day or two, compute the > cache hit ratio using metrics in pg_stat_database, and then decide if > you need to resize shared buffers. > > Gradual increases are a good approach in general. And yes, having > > buffers_checkpoint > buffers_clean > buffers_backend > > is a good idea too. Together with the cache hit ratio it's probably a > more sensible metric than looking at usagecount directly. Thanks! While increasing shared_buffers we'll be looking at changes in cache hit ratio too. Regards, Vitaliy
2018-02-18 14:41 GMT+01:00 Vitaliy Garnashevich <vgarnashevich@gmail.com>:
I certainly wouldn't recommend using 1/2 of RAM right away. There's a
good chance it would be a waste of memory - for example due to double
buffering, which effectively reduces "total" cache hit ratio.
Double buffering is often mentioned in context of tuning shared buffers. Is there a tool to actually measure the amount of double buffering happening in the system?Those evictions are performed either by backends or bgwriter, both of
which are less efficient than checkpointer. Not only can checkpointer
perform various optimizations (e.g. sorting buffers to make the writes
more sequential), but it also writes each dirty buffer just once. With
smaller shared_buffers the page may have be written multiple times.
In the case when shared_buffers cover most of RAM, most of writes should happen by checkpointer, and cache hit ratio should be high. So a hypothetical question: Could shared_buffers=200GB on a 250 GB RAM server ever be a reasonable setting? (assuming there are no other applications running except postgres, and 50GB is enough for allocating work_mem/maintenance_work_mem and for serving queries)The best thing you can do is set shared buffers to some conservative
value (say, 4-8GB), let the system run for a day or two, compute the
cache hit ratio using metrics in pg_stat_database, and then decide if
you need to resize shared buffers.
Gradual increases are a good approach in general. And yes, having
buffers_checkpoint > buffers_clean > buffers_backend
is a good idea too. Together with the cache hit ratio it's probably a
more sensible metric than looking at usagecount directly.
Thanks! While increasing shared_buffers we'll be looking at changes in cache hit ratio too.
When we did calculation of some analytic tasks, then increasing shared_buffers had negative impact on speed. Probably hit ration was too low after change, but the maintenance of shared buffers (searching free blocks) was slower.
So optimal size of SB depends on use case too much - note -- too big SB means small work mem what can be worse .. work_mem must be multiplied by max_connection and by some constant .. 2 or 3.
Regards
Pavel
Regards,
Vitaliy
On 02/18/2018 02:41 PM, Vitaliy Garnashevich wrote: > >> I certainly wouldn't recommend using 1/2 of RAM right away. There's >> a good chance it would be a waste of memory - for example due to >> double buffering, which effectively reduces "total" cache hit >> ratio. > > Double buffering is often mentioned in context of tuning shared > buffers. Is there a tool to actually measure the amount of double > buffering happening in the system? > I'm not aware of such tool. But I suppose it could be done by integrating information from pg_buffercache and pgfincore [1]. [1] https://github.com/klando/pgfincore >> Those evictions are performed either by backends or bgwriter, both >> of which are less efficient than checkpointer. Not only can >> checkpointer perform various optimizations (e.g. sorting buffers to >> make the writes more sequential), but it also writes each dirty >> buffer just once. With smaller shared_buffers the page may have be >> written multiple times. > > In the case when shared_buffers cover most of RAM, most of writes > should happen by checkpointer, and cache hit ratio should be high. So > a hypothetical question: Could shared_buffers=200GB on a 250 GB RAM > server ever be a reasonable setting? (assuming there are no other > applications running except postgres, and 50GB is enough for > allocating work_mem/maintenance_work_mem and for serving queries) > It depends on how large is the active part of the data set is. If it fits into 200GB but not to smaller shared buffers (say, 100GB), then using 200GB may be a win. If the active set is much larger than RAM, smaller shared_buffer values work better in my experience. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Feb 18, 2018 at 7:41 AM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
In the case when shared_buffers cover most of RAM, most of writes should happen by checkpointer, and cache hit ratio should be high. So a hypothetical question: Could shared_buffers=200GB on a 250 GB RAM server ever be a reasonable setting? (assuming there are no other applications running except postgres, and 50GB is enough for allocating work_mem/maintenance_work_mem and for serving queries)
That amount of shared buffers is not sensible. I found on a 256 GB box that anything over about 50-100GB was counter productive. That was a FreeBSD system where I ran the database on top of ZFS, so there was a lot of data also stored in the ARC (memory cache). There is a setting in postgres to tell it how much RAM your system is using for the disk cache, so set that to a fair estimate of how much your system will use. I set mine to 50% of RAM. I did not limit the cache at the OS level since it is good about giving up that memory for the needs of the running processes.
Yes. I don't know the exact reason, but reading a buffer from OS cache is quite a bit more expensive than just pinning a buffer already in the buffer_pool, about 5 times more expensive the last time I tested it, which was before Meltdown. (And just pinning a buffer which is already in the cache is already pretty expensive--about 15 times as expensive as reading the next tuple from an already-pinned buffer).
Thanks for the numbers. Just out of curiosity, do you happen to know how much more expensive compared to that a read from disk is? And also, how much the pinning can be slowed down, when having to iterate using the clock-sweep method over large shared_buffers?
I don't think that there is any reason to think that buffers_clean > buffers_checkpoint is a problem. In fact, you could argue that it is the way it was designed to work. Although the background writer does need to tell the checkpointer about every file it dirties, so it can be fsynced at the end of the checkpoint. The overhead of this was minimal in my testing.
The reason why I mentioned buffers_clean is because I was assuming that under "healthy" conditions, most writes should be done by checkpointer, because, as it was already mentioned, that's the most efficient way of writing (no duplicate writes of the same buffer, write optimizations etc.). I was thinking about bgwriter as a way of reducing latency by avoiding the case when a backend has to write buffers by itself. So that would mean that big numbers in buffers_clean and buffers_backend compared to buffers_checkpoint, would mean that a lot of writes are done not by checkpointer, and thus probably less efficiently than they could be. That might have resulted in IO writes being more random, and more IO writes done in general, because same buffer can be written multiple times between checkpoints.
But buffers_backend > buffers_checkpoint could be a problem, especially if they are also much larger than buffers_clean. But the wrinkle here is that if you do bulk inserts or bulk updates (what about vacuums?), the backends by design write their own dirty buffers. So if you do those kinds of things, buffers_backend being large doesn't indicate much. There was a patch someplace a while ago to separate the counters of backend-intentional writes from backend-no-choice writes, but it never went anywhere.
We do daily manual vacuuming. Knowing what part of total writes is accounted for them indeed would be nice.
When looking at buffers_checkpoint/buffers_clean/buffers_backend, I was saving the numbers with several hours interval, knowing that there are no vacuums running at that time, and calculated the difference.
It is not clear to me that this is the best way to measure health. Did your response time go down? Did your throughput go up?
We have mixed type of DB usage. There is OLTP-like part with many small read/write transactions. Predictable latency does not matter in that case, but throughput does, because that is basically a background data loading job. Then there is an OLAP-like part when heavier report queries are being run. Then there are more background jobs which are a combination of both, which at first run long queries and then do lots of small inserts, thus pre-calculating some data for bigger reports.
After increasing shared_buffers 8GB -> 64GB, there was 7% improvement in run time of the background pre-calculating job (measured by running several times in a row, and caches are hot).
When we configured hugepages for the bigger shared_buffers, the additional improvement was around 3%.
Regards,
Vitaliy
When we did calculation of some analytic tasks, then increasing shared_buffers had negative impact on speed. Probably hit ration was too low after change, but the maintenance of shared buffers (searching free blocks) was slower.
What was the size of shared buffers when slowdown happened (approximately)?
Regards,
Vitaliy