Thread: Linux memory zone reclaim
Newer Linux systems with lots of cores have a problem I've been running into a lot more lately I wanted to share initial notes on. My "newer" means running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux distributions like RHEL6 and Debian Squeeze. The issue is around Linux's zone_reclaim feature. When it pops up, turning that feature off help a lot. Details on what I understand of the problem are below, and as always things may have changed already in even newer kernels. zone_reclaim tries to optimize memory speed on NUMA systems with more than one CPU socket. There some banks of memory that can be "closer" to a particular socket, as measured by transfer rate, because of how the memory is routed to the various cores on each socket. There is no true default for this setting. Linux checks the hardware and turns this on/off based on what transfer rate it sees between NUMA nodes, where there are more than one and its test shows some distance between them. You can tell if this is turned on like this: echo /proc/sys/vm/zone_reclaim_mode Where 1 means it's enabled. Install the numactl utility and you can see why it's made that decision: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17 node 0 size: 73718 MB node 0 free: 419 MB node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23 node 1 size: 73728 MB node 1 free: 30 MB node distances: node 0 1 0: 10 21 1: 21 10 Note how the "distance" for a transfer from node 0->0 or 1->1 is 10 units, while 0->1 or 1->0 is 21. That what's tested at boot time, where the benchmarked speed is turned into this abstract distance number. And if there is a large difference in cross-zone timing, then zone reclaim is enabled. Scott Marlowe has been griping about this on the mailing lists here for a while now, and it's increasingly trouble for systems I've been seeing lately too. This is a well known problem with MySQL: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ and NUMA issues have impacted Oracle too. On PostgreSQL shared_buffers isn't normally set as high as MySQL's buffer cache, making it a bit less vulnerable to this class of problem. But it's surely still a big problem for PostgreSQL on some systems. I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux system where it's turned on now. I'm still working through whether it also makes sense in all cases to use the more complicated memory interleaving suggestions that MySQL users have implemented, something most people would need to push into their PostgreSQL server started up scripts in /etc/init.d (That will be a fun rpm/deb packaging issue to deal with if this becomes more wide-spread) Suggestions on whether that is necessary, or if just disabling zone_reclaim is enough, are welcome from anyone who wants to try and benchmark it. Note that this is all tricky to test because some of the bad behavior only happens when the server runs this zone reclaim method, which isn't a trivial situation to create at will. Servers that have this problem tend to have it pop up intermittently, you'll see one incredibly slow query periodically while most are fast. All depends on exactly what core is executing, where the memory it needs is at, and whether the server wants to reclaim memory (and just what that means its own complicated topic) as part of that. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On Tue, Jul 17, 2012 at 7:52 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Newer Linux systems with lots of cores have a problem I've been running into > a lot more lately I wanted to share initial notes on. My "newer" means > running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux > distributions like RHEL6 and Debian Squeeze. The issue is around Linux's > zone_reclaim feature. When it pops up, turning that feature off help a lot. > Details on what I understand of the problem are below, and as always things > may have changed already in even newer kernels. SNIP > Scott Marlowe has been griping about this on the mailing lists here for a > while now, and it's increasingly trouble for systems I've been seeing lately > too. This is a well known problem with MySQL: > http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ Thanks for the link, I'll read up on it. I do have access to large (24 to 40 core) NUMA machines so I might try some benchmarking on them to see how they work.
On the larger, cellular Itanium systems with multiple motherboards (rx6600 to Superdome) Oracle has done a lot of tuning with the HP-UX kernel calls to optimize for NUMA issues. Will be interesting to see what they bring to Linux.
On Jul 17, 2012 9:01 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
On Tue, Jul 17, 2012 at 7:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Newer Linux systems with lots of cores have a problem I've been running into
> a lot more lately I wanted to share initial notes on. My "newer" means
> running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux
> distributions like RHEL6 and Debian Squeeze. The issue is around Linux's
> zone_reclaim feature. When it pops up, turning that feature off help a lot.
> Details on what I understand of the problem are below, and as always things
> may have changed already in even newer kernels.
SNIP
> Scott Marlowe has been griping about this on the mailing lists here for a
> while now, and it's increasingly trouble for systems I've been seeing lately
> too. This is a well known problem with MySQL:
> http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/
Thanks for the link, I'll read up on it. I do have access to large
(24 to 40 core) NUMA machines so I might try some benchmarking on them
to see how they work.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Tue, Jul 17, 2012 at 11:00 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > Thanks for the link, I'll read up on it. I do have access to large > (24 to 40 core) NUMA machines so I might try some benchmarking on them > to see how they work. It must have been said already, but I'll repeat it just in case: I think postgres has an easy solution. Spawn the postmaster with "interleave", to allocate shared memory, and then switch to "local" on the backends.
On Tue, Jul 18, 2012 at 2:38 AM, Claudio Freire wrote: >It must have been said already, but I'll repeat it just in case: >I think postgres has an easy solution. Spawn the postmaster with >"interleave", to allocate shared memory, and then switch to "local" on >the backends. Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script which starts the postmaster using a numactl wrapper, but all subsequent client processes are started with interleaving enabled as well. Any ideas how to make just the postmaster process start with interleaving? Thanks
On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote: > Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and > PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script > which starts the postmaster using a numactl wrapper, but all subsequent > client processes are started with interleaving enabled as well. Any ideas > how to make just the postmaster process start with interleaving? postmaster should call numactl right after forking: http://linux.die.net/man/2/set_mempolicy
On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote: >> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and >> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script >> which starts the postmaster using a numactl wrapper, but all subsequent >> client processes are started with interleaving enabled as well. Any ideas >> how to make just the postmaster process start with interleaving? > > postmaster should call numactl right after forking: > http://linux.die.net/man/2/set_mempolicy Something like the attached patch (untested)
Attachment
On Tue, Jul 24, 2012 at 5:12 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Something like the attached patch (untested) Sorry, on that patch, MPOL_INTERLEAVE should be MPOL_DEFAULT
On 24/07/2012 21:12, Claudio Freire wrote: > On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote: >>> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and >>> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script >>> which starts the postmaster using a numactl wrapper, but all subsequent >>> client processes are started with interleaving enabled as well. Any ideas >>> how to make just the postmaster process start with interleaving? >> postmaster should call numactl right after forking: >> http://linux.die.net/man/2/set_mempolicy > Something like the attached patch (untested) Cheers, I'll give it a go, I wonder if this is likely to be integrated into the main code? As has been mentioned here before, postgresql isn't as badly affected as mysql for example, but I'm wondering if the trend to larger memory and more cores/nodes means it should be offered as an option? Although saying that I've read that 10Gb of shared buffers may be enough even in big machines 128+Gb ram.. Thoughts? John
On Tue, Jul 24, 2012 at 6:23 PM, John Lister <john.lister@kickstone.com> wrote: > Cheers, I'll give it a go, I wonder if this is likely to be integrated into > the main code? As has been mentioned here before, postgresql isn't as badly > affected as mysql for example, but I'm wondering if the trend to larger > memory and more cores/nodes means it should be offered as an option? > Although saying that I've read that 10Gb of shared buffers may be enough > even in big machines 128+Gb ram.. Remember to change MPOL_INTERLEAVED to MPOL_DEFAULT ;-) I'm trying to test it myself
My experience is that disabling swap and turning off zone_reclaim_mode gets rid of any real problem for a large memory postgresql database server. While it would be great to have a NUMA aware pgsql, I question the solidity and reliability of the current linux kernel implementation in a NUMA evironment, especially given the poor behaviour of the linux kernel as regards swap behaviour. On Tue, Jul 24, 2012 at 3:23 PM, John Lister <john.lister@kickstone.com> wrote: > On 24/07/2012 21:12, Claudio Freire wrote: >> >> On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> >> wrote: >>> >>> On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> >>> wrote: >>>> >>>> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 >>>> and >>>> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper >>>> script >>>> which starts the postmaster using a numactl wrapper, but all subsequent >>>> client processes are started with interleaving enabled as well. Any >>>> ideas >>>> how to make just the postmaster process start with interleaving? >>> >>> postmaster should call numactl right after forking: >>> http://linux.die.net/man/2/set_mempolicy >> >> Something like the attached patch (untested) > > Cheers, I'll give it a go, I wonder if this is likely to be integrated into > the main code? As has been mentioned here before, postgresql isn't as badly > affected as mysql for example, but I'm wondering if the trend to larger > memory and more cores/nodes means it should be offered as an option? > Although saying that I've read that 10Gb of shared buffers may be enough > even in big machines 128+Gb ram.. > > Thoughts? > > John > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance -- To understand recursion, one must first understand recursion.
On Tue, Jul 24, 2012 at 6:23 PM, John Lister <john.lister@kickstone.com> wrote: > Cheers, I'll give it a go, I wonder if this is likely to be integrated into > the main code? As has been mentioned here before, postgresql isn't as badly > affected as mysql for example, but I'm wondering if the trend to larger > memory and more cores/nodes means it should be offered as an option? > Although saying that I've read that 10Gb of shared buffers may be enough > even in big machines 128+Gb ram.. > > Thoughts? The attached (better) patch builds and doesn't crash at least. Which is always good. Configure with --with-numa
Attachment
On Tue, Jul 17, 2012 at 09:52:11PM -0400, Greg Smith wrote: > I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux > system where it's turned on now. I'm still working through whether -------------------------------- > it also makes sense in all cases to use the more complicated memory > interleaving suggestions that MySQL users have implemented, > something most people would need to push into their PostgreSQL > server started up scripts in /etc/init.d (That will be a fun > rpm/deb packaging issue to deal with if this becomes more > wide-spread) Suggestions on whether that is necessary, or if just > disabling zone_reclaim is enough, are welcome from anyone who wants > to try and benchmark it. Should I be turning it off on my server too? It is enabled on my system. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Smith <greg@2ndQuadrant.com> wrote: > You can tell if this is turned on like this: > > echo /proc/sys/vm/zone_reclaim_mode As a data point, the benchmarks I did for some of the 9.2 scalability features does not appear to have this turned on: # cat /proc/sys/vm/zone_reclaim_mode 0 Our Linux version: Linux version 2.6.32.46-0.3-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP 2011-09-29 17:49:31 +0200 This has 32 cores (64 "threads" with HT) on 4 Xeon X7560 CPUs. Intel(R) Xeon(R) CPU X7560 @ 2.27GHz It has 256GB RAM on 4GB DIMMs, with each core controlling 2 DIMMs and each core able to directly talk to every other core. So, it is non-uniform, but with this arrangement it is more a matter that there is an 8GB set of memory that is "fast" for each core and the other 97% of RAM is all accessible at the same speed. There were some other options for what to install on this system or how to install it which wouldn't have kept things this tight. > Install the numactl utility and you can see why it's made that > decision: We get this: # numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 65519 MB node 0 free: 283 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 65536 MB node 1 free: 25 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 65536 MB node 2 free: 26 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 65536 MB node 3 free: 25 MB node distances: node 0 1 2 3 0: 10 11 11 11 1: 11 10 11 11 2: 11 11 10 11 3: 11 11 11 10 When considering a hardware purchase, it might be wise to pay close attention to how "far" a core may need to go to get to the most "distant" RAM. -Kevin
On Mon, Jul 30, 2012 at 10:43 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > node distances: > node 0 1 2 3 > 0: 10 11 11 11 > 1: 11 10 11 11 > 2: 11 11 10 11 > 3: 11 11 11 10 > > When considering a hardware purchase, it might be wise to pay close > attention to how "far" a core may need to go to get to the most > "distant" RAM. I think the zone_reclaim gets turned on with a high ratio. If the inter node costs were the same, and the intranode costs dropped in half, zone reclaim would likely get turned on at boot time. I had something similar in a 48 core system but if I recall correctly the matrix was 8x8 and the cost differential was much higher. The symptoms I saw was that a very hard working db, on a 128G machine with about 95G as OS / kernel cache, would slow to a crawl with kswapd working very hard (I think it was kswapd) after a period of 1 to 3 weeks. Note that actual swap in and out wasn't all that great by vmstat. The same performance hit happened on a similar machine used as a file server after a similar period of warm up. The real danger here is that the misbehavior can take a long time to show up, and from what I read at the time, the performance gain for any zone reclaim = 1 was minimal for a file or db server, and more in line for a large virtual machine farm, with a lot of processes chopped into sections small enough to fit in one node's memory and not need a lot of access from another node. Anything that relies on the OS to cache is likely not served by zone reclaim = 1.
> node distances: > node 0 1 2 3 > 0: 10 11 11 11 > 1: 11 10 11 11 > 2: 11 11 10 11 > 3: 11 11 11 10 > > When considering a hardware purchase, it might be wise to pay close > attention to how "far" a core may need to go to get to the most > "distant" RAM. Yikes, my server is certainly asymmetric: node distances: node 0 1 0: 10 21 1: 21 10 and my Debian Squeeze certainly knows that: $ cat < /proc/sys/vm/zone_reclaim_mode 1 Server specs: http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012 I have 12 2GB DDR3 DIMs. Of course, my home server is ridiculously idle too. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 7/30/12 10:09 AM, Scott Marlowe wrote: > I think the zone_reclaim gets turned on with a high ratio. If the > inter node costs were the same, and the intranode costs dropped in > half, zone reclaim would likely get turned on at boot time. We've been seeing a major problem with zone_reclaim and Linux, in that Linux won't use the FS cache on the "distant" RAM *at all* if it thinks that RAM is distant enough. Thus, you get instances of seeing only half of RAM used for FS cache, even though the database is 5X larger than RAM. This is poor design on Linux's part, since even the distant RAM is faster than disk. For now, we've been disabling zone_reclaim entirely. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Aug 3, 2012 at 4:30 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 7/30/12 10:09 AM, Scott Marlowe wrote: >> I think the zone_reclaim gets turned on with a high ratio. If the >> inter node costs were the same, and the intranode costs dropped in >> half, zone reclaim would likely get turned on at boot time. > > We've been seeing a major problem with zone_reclaim and Linux, in that > Linux won't use the FS cache on the "distant" RAM *at all* if it thinks > that RAM is distant enough. Thus, you get instances of seeing only half > of RAM used for FS cache, even though the database is 5X larger than RAM. > > This is poor design on Linux's part, since even the distant RAM is > faster than disk. For now, we've been disabling zone_reclaim entirely. I haven't run into this, but we were running ubuntu 10.04 LTS. What kernel were you running when this happened? I'd love to see a test case on this, as it seems like a major regression if it's on newer kernels, and we're looking at running 12.04 LTS soon on one of our bigger machines.
>> This is poor design on Linux's part, since even the distant RAM is >> faster than disk. For now, we've been disabling zone_reclaim entirely. > > I haven't run into this, but we were running ubuntu 10.04 LTS. What > kernel were you running when this happened? I'd love to see a test > case on this, as it seems like a major regression if it's on newer > kernels, and we're looking at running 12.04 LTS soon on one of our > bigger machines. Jeff Frost will have a blog up about it later; we're still collecting data. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com