Thread: Swappiness setting on a linux pg server
I just came to think about /proc/sys/swappiness ... When this one is set to a high number (say, 100 - which is maximum), the kernel will aggressively swap out all memory that is not beeing accessed, to allow more memory for caches. For a postgres server, OS caches are good, because postgres relies on the OS to cache indices, etc. At the other hand, for any production server it's very bad to experience bursts of iowait when/if the swapped out memory becomes needed - particularly if the server is used for interactive queries, like serving a web application. I know there are much religion on this topic in general, I'm just curious if anyone has done any serious thoughts or (even better!) experimenting with the swappiness setting on a loaded postgres server. I would assume that the default setting (60) is pretty OK and sane, and that modifying the setting would have insignificant effect. My religious belief is that, however insignificant, a higher setting would have been better :-) We're running linux kernel 2.6.17.7 (debian) on the postgres server, and our memory stats looks like this: total used free shared buffers cached Mem: 6083M 5846M 236M 0 31M 5448M -/+ buffers/cache: 366M 5716M Swap: 2643M 2M 2640M In addition to the postgres server we're running some few cronscripts and misc on it - nothing significant though.
On Thu, Oct 19, 2006 at 03:54:28PM +0200, Tobias Brox wrote: > I just came to think about /proc/sys/swappiness ... > > When this one is set to a high number (say, 100 - which is maximum), the > kernel will aggressively swap out all memory that is not beeing > accessed, to allow more memory for caches. For a postgres server, OS > caches are good, because postgres relies on the OS to cache indices, > etc. At the other hand, for any production server it's very bad to > experience bursts of iowait when/if the swapped out memory becomes > needed - particularly if the server is used for interactive queries, > like serving a web application. > > I know there are much religion on this topic in general, I'm just > curious if anyone has done any serious thoughts or (even better!) > experimenting with the swappiness setting on a loaded postgres server. I think it'd be much better to experiment with using much larger shared_buffers settings. The conventional wisdom there is from 7.x days when you really didn't want a large buffer, but that doesn't really apply with the new buffer management we got in 8.0. I know of one site that doubled their performance by setting shared_buffers to 50% of memory. Something else to consider is that many people will put pg_xlog on the same drives as the OS (and swap). It's pretty important that those drives not have much activity other than pg_xlog, so any swap activity would have an even larger than normal impact on performance. > I would assume that the default setting (60) is pretty OK and sane, and > that modifying the setting would have insignificant effect. My > religious belief is that, however insignificant, a higher setting would > have been better :-) > > We're running linux kernel 2.6.17.7 (debian) on the postgres server, and > our memory stats looks like this: > total used free shared buffers cached > Mem: 6083M 5846M 236M 0 31M 5448M > -/+ buffers/cache: 366M 5716M > Swap: 2643M 2M 2640M > > In addition to the postgres server we're running some few cronscripts > and misc on it - nothing significant though. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
[Jim C. Nasby - Thu at 10:28:31AM -0500] > I think it'd be much better to experiment with using much larger > shared_buffers settings. The conventional wisdom there is from 7.x days > when you really didn't want a large buffer, but that doesn't really > apply with the new buffer management we got in 8.0. I know of one site > that doubled their performance by setting shared_buffers to 50% of > memory. Oh, that's interessting. I will give it a shot. Our config is "inheritated" from the 7.x-days, so we have a fairly low setting compared to available memory. From the 7.x-days the logic was that "a lot of careful thought has been given when designing the OS cache/buffer subsystem, we don't really want to reinvent the wheel" or something like that. Sadly it's not easy to measure the overall performance impact of such tunings in a production environment, so such a setting tends to be tuned by religion rather than science :-) > Something else to consider is that many people will put pg_xlog on the > same drives as the OS (and swap). It's pretty important that those > drives not have much activity other than pg_xlog, so any swap activity > would have an even larger than normal impact on performance. Hm ... that's actually our current setting, we placed the postgres database itself on a separate disk, not the xlog. So we should have done it the other way around? No wonder the performance is badly affected by backups etc ... What else, I gave the swappiness a second thought, compared to our actual memory usage statistics ... turning down the swappiness would have no significant effect since we're only using 2M of swap (hardly significant) and our total memory usage by applications (including the pg shared buffers) is less than 400M out of 6G. Maybe we could have moved some 50M of this to swap, but that's not really significant compared to our 6G of memory.
On Thu, Oct 19, 2006 at 06:00:54PM +0200, Tobias Brox wrote: > [Jim C. Nasby - Thu at 10:28:31AM -0500] > > I think it'd be much better to experiment with using much larger > > shared_buffers settings. The conventional wisdom there is from 7.x days > > when you really didn't want a large buffer, but that doesn't really > > apply with the new buffer management we got in 8.0. I know of one site > > that doubled their performance by setting shared_buffers to 50% of > > memory. > > Oh, that's interessting. I will give it a shot. Our config is > "inheritated" from the 7.x-days, so we have a fairly low setting > compared to available memory. From the 7.x-days the logic was that "a > lot of careful thought has been given when designing the OS cache/buffer > subsystem, we don't really want to reinvent the wheel" or something like > that. Yeah, test setups are a good thing to have... > Sadly it's not easy to measure the overall performance impact of such > tunings in a production environment, so such a setting tends to be tuned > by religion rather than science :-) > > > Something else to consider is that many people will put pg_xlog on the > > same drives as the OS (and swap). It's pretty important that those > > drives not have much activity other than pg_xlog, so any swap activity > > would have an even larger than normal impact on performance. > > Hm ... that's actually our current setting, we placed the postgres > database itself on a separate disk, not the xlog. So we should have > done it the other way around? No wonder the performance is badly > affected by backups etc ... Well, typically I see setups where people dedicate say 6 drives to the data and 2 drives for the OS and pg_xlog. The issue with pg_xlog is you don't need bandwidth... you need super-low latency. The best way to accomplish that is to get a battery-backed RAID controller that you can enable write caching on. In fact, if the controller is good enough, you can theoretically get away with just building one big RAID10 and letting the controller provide the low-latency fsyncs that pg_xlog depends on. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
[Jim C. Nasby - Thu at 10:28:31AM -0500] > I think it'd be much better to experiment with using much larger > shared_buffers settings. The conventional wisdom there is from 7.x days > when you really didn't want a large buffer, but that doesn't really > apply with the new buffer management we got in 8.0. I know of one site > that doubled their performance by setting shared_buffers to 50% of > memory. I've upped it a bit, but it would require a server restart to get the new setting into effect. This is relatively "expensive" for us. Does anyone else share the viewpoint of Nasby, and does anyone have recommendation for a good value? Our previous value was 200M, and I don't want to go to the extremes just yet. We have 6G of memory totally.
[Jim C. Nasby - Thu at 11:31:26AM -0500] > Yeah, test setups are a good thing to have... We would need to replicate the production traffic as well to do reliable tests. Well, we'll get to that one day ... > The issue with pg_xlog is you don't need bandwidth... you need super-low > latency. The best way to accomplish that is to get a battery-backed RAID > controller that you can enable write caching on. Sounds a bit risky to me :-)
On Thu, Oct 19, 2006 at 06:39:22PM +0200, Tobias Brox wrote: > [Jim C. Nasby - Thu at 11:31:26AM -0500] > > Yeah, test setups are a good thing to have... > > We would need to replicate the production traffic as well to do reliable > tests. Well, we'll get to that one day ... Marginally reliable tests are usually better than none at all. :) > > The issue with pg_xlog is you don't need bandwidth... you need super-low > > latency. The best way to accomplish that is to get a battery-backed RAID > > controller that you can enable write caching on. > > Sounds a bit risky to me :-) Well, you do need to understand what happens if the machine does lose power... namely you have a limited amount of time to get power back to the machine so that the controller can flush that data out. Other than that, it's not very risky. As for shared_buffers, conventional wisdom has been to use between 10% and 25% of memory, bounding towards the lower end as you get into larger quantities of memory. So in your case, 600M wouldn't be pushing things much at all. Even 1G wouldn't be that out of the ordinary. Also remember that the more memory for shared_buffers, the less for sorting/hashes/etc. (work_mem) -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
[Jim C. Nasby - Thu at 11:45:32AM -0500] > > > The issue with pg_xlog is you don't need bandwidth... you need super-low > > > latency. The best way to accomplish that is to get a battery-backed RAID > > > controller that you can enable write caching on. > > > > Sounds a bit risky to me :-) > > Well, you do need to understand what happens if the machine does lose > power... namely you have a limited amount of time to get power back to > the machine so that the controller can flush that data out. Other than > that, it's not very risky. We have burned ourself more than once due to unreliable raid controllers ... > quantities of memory. So in your case, 600M wouldn't be pushing things > much at all. Even 1G wouldn't be that out of the ordinary. Also remember > that the more memory for shared_buffers, the less for > sorting/hashes/etc. (work_mem) What do you mean, a high value for the shared_buffers implicates I can/should lower the work_mem value? Or just that I should remember to have more than enough memory for both work_mem, shared_buffers and OS caches? What is a sane value for the work_mem? It's currently set to 8M.
On Thu, Oct 19, 2006 at 06:53:49PM +0200, Tobias Brox wrote: > [Jim C. Nasby - Thu at 11:45:32AM -0500] > > > > The issue with pg_xlog is you don't need bandwidth... you need super-low > > > > latency. The best way to accomplish that is to get a battery-backed RAID > > > > controller that you can enable write caching on. > > > > > > Sounds a bit risky to me :-) > > > > Well, you do need to understand what happens if the machine does lose > > power... namely you have a limited amount of time to get power back to > > the machine so that the controller can flush that data out. Other than > > that, it's not very risky. > > We have burned ourself more than once due to unreliable raid controllers > ... Well, if you're buying unreliable hardware, there's not much you can do... you're setting yourself up for problems. > > quantities of memory. So in your case, 600M wouldn't be pushing things > > much at all. Even 1G wouldn't be that out of the ordinary. Also remember > > that the more memory for shared_buffers, the less for > > sorting/hashes/etc. (work_mem) > > What do you mean, a high value for the shared_buffers implicates I > can/should lower the work_mem value? Or just that I should remember to > have more than enough memory for both work_mem, shared_buffers and OS > caches? What is a sane value for the work_mem? It's currently set to > 8M. The key is that there's enough memory for shared_buffers and work_mem without going to swapping. If you're consuming that much work_mem I wouldn't worry at all about OS caching. What's reasonable for work_mem depends on your workload. If you've got some reporting queries that you know aren't run very concurrently they might benefit from large values of work_mem. For stats.distributed.net, I set work_mem to something like 2MB in the config file, but the nightly batch routines manually set it up to 256M or more, because I know that those only run one at a time, and having that extra memory means a lot of stuff that would otherwise have to spill to disk now stays in memory. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
[Jim C. Nasby - Thu at 12:00:39PM -0500] > What's reasonable for work_mem depends on your workload. If you've got > some reporting queries that you know aren't run very concurrently they > might benefit from large values of work_mem. For stats.distributed.net, > I set work_mem to something like 2MB in the config file, but the nightly > batch routines manually set it up to 256M or more, because I know that > those only run one at a time, and having that extra memory means a lot > of stuff that would otherwise have to spill to disk now stays in memory. That sounds like a good idea; indeed we do have some few heavy reporting queries and they are not run much concurrently (the application checks the pg_stat_activity table and will disallow reports to be taken out if there is too much activity there). We probably would benefit from raising the work mem just for those reports, and lower it for the rest of the connections.
[Jim C. Nasby - Thu at 12:00:39PM -0500] > Well, if you're buying unreliable hardware, there's not much you can > do... you're setting yourself up for problems. I'm luckily not responsible for the hardware, but my general experience tells that you never know anything about hardware reliability until the hardware actually breaks :-) It's not always so that more expensive equipment equals better equipment.
At 12:35 PM 10/19/2006, Tobias Brox wrote: >[Jim C. Nasby - Thu at 10:28:31AM -0500] > > I think it'd be much better to experiment with using much larger > > shared_buffers settings. The conventional wisdom there is from 7.x days > > when you really didn't want a large buffer, but that doesn't really > > apply with the new buffer management we got in 8.0. I know of one site > > that doubled their performance by setting shared_buffers to 50% of > > memory. > >I've upped it a bit, but it would require a server restart to get the >new setting into effect. This is relatively "expensive" for us. Does >anyone else share the viewpoint of Nasby, and does anyone have >recommendation for a good value? Our previous value was 200M, and I >don't want to go to the extremes just yet. We have 6G of memory >totally. Jim is correct that traditional 7.x folklore regarding shared buffer size is nowhere near as valid for 8.x. Jim tends to know what he is talking about when speaking about pg operational issues. Nonetheless, "YMMV". The only sure way to know what is best for your SW running on your HW under your load conditions is to test, test, test. A= Find out how much RAM your OS image needs. Usually 1/3 to 2/3 of a GB is plenty. B= Find out how much RAM pg tasks need during typical peak usage and how much each of those tasks is using. This will tell you what work_mem should be. Note that you may well find out that you have not been using the best size for work_mem for some tasks when you investigate this. (Total RAM) - A - B - (small amount for error margin) = 1st pass at shared_buffers setting. If this results in better performance that your current settings, either declare victory and stop or cut the number in half and see what it does to performance. Then you can either set it to what experiment thus far has shown to be best or use binary search to change the size of shared_buffers and do experiments to your heart's content. Ron
On 10/19/06, Ron <rjpeace@earthlink.net> wrote: > Nonetheless, "YMMV". The only sure way to know what is best for your > SW running on your HW under your load conditions is to test, test, test. anybody have/know of some data on shared buffer settings on 8.1+? merlin
[Ron - Thu at 03:10:35PM -0400] > Jim is correct that traditional 7.x folklore regarding shared buffer > size is nowhere near as valid for 8.x. Jim tends to know what he is > talking about when speaking about pg operational issues. I would not doubt it, but it's always better to hear it from more people :-) > Nonetheless, "YMMV". The only sure way to know what is best for your > SW running on your HW under your load conditions is to test, test, test. Certainly. My time and possibilities for testing is not that great at the moment, and in any case I believe some small adjustments won't cause the really significant results. In any case, our database server is not on fire at the moment and people are not angry because of slow reports at the moment. (actually, I started this thread out of nothing but curiousity ... triggered by somebody complaining about his desktop windows computer swapping too much :-) So, for this round of tunings I'm more than satisfied just relying on helpful rules of the thumb. > A= Find out how much RAM your OS image needs. > Usually 1/3 to 2/3 of a GB is plenty. A quick glance on "free" already revealed we are using less than 400 MB out of 6G totally (with the 7.x-mindset that the OS should take care of cacheing), and according to our previous settings, the shared buffers was eating 200 MB of this - so most of our memory is free. > B= Find out how much RAM pg tasks need during typical peak usage and > how much each of those tasks is using. I believe we have quite good control of the queries ... there are safeguards to prevent most of the heavy queries to run concurrently, and the lighter queries shouldn't spend much memory, so it should be safe for us to bump up the setting a bit. In any case, I guess the OS is not that bad at handling the memory issue. Unused memory will be used relatively intelligently (i.e. buffering the temp files used by sorts) and overuse of memory will cause some swapping (which is probably quite much worse than using temp files directly, but a little bit of swapping is most probably not a disaster).
> I just came to think about /proc/sys/swappiness ... > > When this one is set to a high number (say, 100 - which is maximum), the > kernel will aggressively swap out all memory that is not beeing > accessed, to allow more memory for caches. For a postgres server, OS > caches are good, because postgres relies on the OS to cache indices, > etc. At the other hand, for any production server it's very bad to > experience bursts of iowait when/if the swapped out memory becomes > needed - particularly if the server is used for interactive queries, > like serving a web application. This is very useful on smaller systems where memory is a scarce commodity. I have a Xen virtual server with 128MB ram. I noticed a big improvement in query performance when I upped swappiness to 80. It gave just enough more memory to fs buffers so my common queries ran in memory. Yes, throwing more ram at it is usually the better solution, but it's nice linux gives you that knob to turn when adding ram isn't an option, at least for me. -- Kevin
[Jim C. Nasby - Thu at 11:31:26AM -0500] > The issue with pg_xlog is you don't need bandwidth... you need super-low > latency. The best way to accomplish that is to get a battery-backed RAID > controller that you can enable write caching on. In fact, if the > controller is good enough, you can theoretically get away with just > building one big RAID10 and letting the controller provide the > low-latency fsyncs that pg_xlog depends on. I was talking a bit about our system administrator. We're running 4 disks in raid 1+0 for the database and 2 disks in raid 1 for the WALs and for OS. He wasn't really sure if we had write cacheing on the RAID controller or not. He pasted me some lines from the dmesg: sda: asking for cache data failed sda: assuming drive cache: write through failed line is expected from these controllers 0000:02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 4 (rev 06) I think we're going to move system logs, temporary files and backup files from the wal-disks to the db-disks. Since our database aren't on fire for the moment, I suppose we'll wait moving the rest of the OS :-)