Thread: Linux memory zone reclaim

Linux memory zone reclaim

From
Greg Smith
Date:
Newer Linux systems with lots of cores have a problem I've been running
into a lot more lately I wanted to share initial notes on.  My "newer"
means running the 2.6.32 kernel or later, since I mostly track
"enterprise" Linux distributions like RHEL6 and Debian Squeeze.  The
issue is around Linux's zone_reclaim feature.  When it pops up, turning
that feature off help a lot.  Details on what I understand of the
problem are below, and as always things may have changed already in even
newer kernels.

zone_reclaim tries to optimize memory speed on NUMA systems with more
than one CPU socket.  There some banks of memory that can be "closer" to
a particular socket, as measured by transfer rate, because of how the
memory is routed to the various cores on each socket.  There is no true
default for this setting.  Linux checks the hardware and turns this
on/off based on what transfer rate it sees between NUMA nodes, where
there are more than one and its test shows some distance between them.
You can tell if this is turned on like this:

echo /proc/sys/vm/zone_reclaim_mode

Where 1 means it's enabled.  Install the numactl utility and you can see
why it's made that decision:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 73718 MB
node 0 free: 419 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 73728 MB
node 1 free: 30 MB
node distances:
node   0   1
   0:  10  21
   1:  21  10

Note how the "distance" for a transfer from node 0->0 or 1->1 is 10
units, while 0->1 or 1->0 is 21.  That what's tested at boot time, where
the benchmarked speed is turned into this abstract distance number.  And
if there is a large difference in cross-zone timing, then zone reclaim
is enabled.

Scott Marlowe has been griping about this on the mailing lists here for
a while now, and it's increasingly trouble for systems I've been seeing
lately too.  This is a well known problem with MySQL:
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/
and NUMA issues have impacted Oracle too.  On PostgreSQL shared_buffers
isn't normally set as high as MySQL's buffer cache, making it a bit less
vulnerable to this class of problem.  But it's surely still a big
problem for PostgreSQL on some systems.

I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux
system where it's turned on now.  I'm still working through whether it
also makes sense in all cases to use the more complicated memory
interleaving suggestions that MySQL users have implemented, something
most people would need to push into their PostgreSQL server started up
scripts in /etc/init.d  (That will be a fun rpm/deb packaging issue to
deal with if this becomes more wide-spread)  Suggestions on whether that
is necessary, or if just disabling zone_reclaim is enough, are welcome
from anyone who wants to try and benchmark it.

Note that this is all tricky to test because some of the bad behavior
only happens when the server runs this zone reclaim method, which isn't
a trivial situation to create at will.  Servers that have this problem
tend to have it pop up intermittently, you'll see one incredibly slow
query periodically while most are fast.  All depends on exactly what
core is executing, where the memory it needs is at, and whether the
server wants to reclaim memory (and just what that means its own
complicated topic) as part of that.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


Re: Linux memory zone reclaim

From
Scott Marlowe
Date:
On Tue, Jul 17, 2012 at 7:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Newer Linux systems with lots of cores have a problem I've been running into
> a lot more lately I wanted to share initial notes on.  My "newer" means
> running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux
> distributions like RHEL6 and Debian Squeeze.  The issue is around Linux's
> zone_reclaim feature.  When it pops up, turning that feature off help a lot.
> Details on what I understand of the problem are below, and as always things
> may have changed already in even newer kernels.

SNIP

> Scott Marlowe has been griping about this on the mailing lists here for a
> while now, and it's increasingly trouble for systems I've been seeing lately
> too.  This is a well known problem with MySQL:
> http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

Thanks for the link, I'll read up on it.  I do have access to large
(24 to 40 core) NUMA machines so I might try some benchmarking on them
to see how they work.

Re: Linux memory zone reclaim

From
Dave Crooke
Date:

On the larger, cellular Itanium systems with multiple motherboards (rx6600 to Superdome) Oracle has done a lot of tuning with the HP-UX kernel calls to optimize for NUMA issues. Will be interesting to see what they bring to Linux.

On Jul 17, 2012 9:01 PM, "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
On Tue, Jul 17, 2012 at 7:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Newer Linux systems with lots of cores have a problem I've been running into
> a lot more lately I wanted to share initial notes on.  My "newer" means
> running the 2.6.32 kernel or later, since I mostly track "enterprise" Linux
> distributions like RHEL6 and Debian Squeeze.  The issue is around Linux's
> zone_reclaim feature.  When it pops up, turning that feature off help a lot.
> Details on what I understand of the problem are below, and as always things
> may have changed already in even newer kernels.

SNIP

> Scott Marlowe has been griping about this on the mailing lists here for a
> while now, and it's increasingly trouble for systems I've been seeing lately
> too.  This is a well known problem with MySQL:
> http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/

Thanks for the link, I'll read up on it.  I do have access to large
(24 to 40 core) NUMA machines so I might try some benchmarking on them
to see how they work.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 17, 2012 at 11:00 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>
> Thanks for the link, I'll read up on it.  I do have access to large
> (24 to 40 core) NUMA machines so I might try some benchmarking on them
> to see how they work.

It must have been said already, but I'll repeat it just in case:

I think postgres has an easy solution. Spawn the postmaster with
"interleave", to allocate shared memory, and then switch to "local" on
the backends.

Re: Linux memory zone reclaim

From
John Lister
Date:
On Tue, Jul 18, 2012 at 2:38 AM, Claudio Freire wrote:
 >It must have been said already, but I'll repeat it just in case:

 >I think postgres has an easy solution. Spawn the postmaster with
 >"interleave", to allocate shared memory, and then switch to "local" on
 >the backends.

Do you have a suggestion about how to do that? I'm running Ubuntu 12.04
and PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper
script which starts the postmaster using a numactl wrapper, but all
subsequent client processes are started with interleaving enabled as
well. Any ideas how to make just the postmaster process start with
interleaving?

Thanks


Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote:
> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and
> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script
> which starts the postmaster using a numactl wrapper, but all subsequent
> client processes are started with interleaving enabled as well. Any ideas
> how to make just the postmaster process start with interleaving?

postmaster should call numactl right after forking:
http://linux.die.net/man/2/set_mempolicy

Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote:
>> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and
>> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script
>> which starts the postmaster using a numactl wrapper, but all subsequent
>> client processes are started with interleaving enabled as well. Any ideas
>> how to make just the postmaster process start with interleaving?
>
> postmaster should call numactl right after forking:
> http://linux.die.net/man/2/set_mempolicy

Something like the attached patch (untested)

Attachment

Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 24, 2012 at 5:12 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> Something like the attached patch (untested)

Sorry, on that patch, MPOL_INTERLEAVE should be MPOL_DEFAULT

Re: Linux memory zone reclaim

From
John Lister
Date:
On 24/07/2012 21:12, Claudio Freire wrote:
> On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com> wrote:
>>> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04 and
>>> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper script
>>> which starts the postmaster using a numactl wrapper, but all subsequent
>>> client processes are started with interleaving enabled as well. Any ideas
>>> how to make just the postmaster process start with interleaving?
>> postmaster should call numactl right after forking:
>> http://linux.die.net/man/2/set_mempolicy
> Something like the attached patch (untested)
Cheers, I'll give it a go, I wonder if this is likely to be integrated
into the main code? As has been mentioned here before, postgresql isn't
as badly affected as mysql for example, but I'm wondering if the trend
to larger memory and more cores/nodes means it should be offered as an
option? Although saying that I've read that 10Gb of shared buffers may
be enough even in big machines 128+Gb ram..

Thoughts?

John



Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 24, 2012 at 6:23 PM, John Lister <john.lister@kickstone.com> wrote:
> Cheers, I'll give it a go, I wonder if this is likely to be integrated into
> the main code? As has been mentioned here before, postgresql isn't as badly
> affected as mysql for example, but I'm wondering if the trend to larger
> memory and more cores/nodes means it should be offered as an option?
> Although saying that I've read that 10Gb of shared buffers may be enough
> even in big machines 128+Gb ram..

Remember to change MPOL_INTERLEAVED to MPOL_DEFAULT ;-)

I'm trying to test it myself

Re: Linux memory zone reclaim

From
Scott Marlowe
Date:
My experience is that disabling swap and turning off zone_reclaim_mode
gets rid of any real problem for a large memory postgresql database
server.  While it would be great to have a NUMA aware pgsql, I
question the solidity and reliability of the current linux kernel
implementation in a NUMA evironment, especially given the poor
behaviour of the linux kernel as regards swap behaviour.

On Tue, Jul 24, 2012 at 3:23 PM, John Lister <john.lister@kickstone.com> wrote:
> On 24/07/2012 21:12, Claudio Freire wrote:
>>
>> On Tue, Jul 24, 2012 at 3:41 PM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>> On Tue, Jul 24, 2012 at 3:36 PM, John Lister <john.lister@kickstone.com>
>>> wrote:
>>>>
>>>> Do you have a suggestion about how to do that? I'm running Ubuntu 12.04
>>>> and
>>>> PG 9.1, I've modified pg_ctlcluster to cause pg_ctl to use a wrapper
>>>> script
>>>> which starts the postmaster using a numactl wrapper, but all subsequent
>>>> client processes are started with interleaving enabled as well. Any
>>>> ideas
>>>> how to make just the postmaster process start with interleaving?
>>>
>>> postmaster should call numactl right after forking:
>>> http://linux.die.net/man/2/set_mempolicy
>>
>> Something like the attached patch (untested)
>
> Cheers, I'll give it a go, I wonder if this is likely to be integrated into
> the main code? As has been mentioned here before, postgresql isn't as badly
> affected as mysql for example, but I'm wondering if the trend to larger
> memory and more cores/nodes means it should be offered as an option?
> Although saying that I've read that 10Gb of shared buffers may be enough
> even in big machines 128+Gb ram..
>
> Thoughts?
>
> John
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
To understand recursion, one must first understand recursion.

Re: Linux memory zone reclaim

From
Claudio Freire
Date:
On Tue, Jul 24, 2012 at 6:23 PM, John Lister <john.lister@kickstone.com> wrote:
> Cheers, I'll give it a go, I wonder if this is likely to be integrated into
> the main code? As has been mentioned here before, postgresql isn't as badly
> affected as mysql for example, but I'm wondering if the trend to larger
> memory and more cores/nodes means it should be offered as an option?
> Although saying that I've read that 10Gb of shared buffers may be enough
> even in big machines 128+Gb ram..
>
> Thoughts?

The attached (better) patch builds and doesn't crash at least.
Which is always good.

Configure with --with-numa

Attachment

Re: Linux memory zone reclaim

From
Bruce Momjian
Date:
On Tue, Jul 17, 2012 at 09:52:11PM -0400, Greg Smith wrote:
> I've taken to disabling /proc/sys/vm/zone_reclaim_mode on any Linux
> system where it's turned on now.  I'm still working through whether
  --------------------------------
> it also makes sense in all cases to use the more complicated memory
> interleaving suggestions that MySQL users have implemented,
> something most people would need to push into their PostgreSQL
> server started up scripts in /etc/init.d  (That will be a fun
> rpm/deb packaging issue to deal with if this becomes more
> wide-spread)  Suggestions on whether that is necessary, or if just
> disabling zone_reclaim is enough, are welcome from anyone who wants
> to try and benchmark it.

Should I be turning it off on my server too?  It is enabled on my
system.


--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Linux memory zone reclaim

From
"Kevin Grittner"
Date:
Greg Smith <greg@2ndQuadrant.com> wrote:

> You can tell if this is turned on like this:
>
> echo /proc/sys/vm/zone_reclaim_mode

As a data point, the benchmarks I did for some of the 9.2
scalability features does not appear to have this turned on:

# cat /proc/sys/vm/zone_reclaim_mode
0

Our Linux version:

Linux version 2.6.32.46-0.3-default (geeko@buildhost) (gcc version
4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP
2011-09-29 17:49:31 +0200

This has 32 cores (64 "threads" with HT) on 4 Xeon X7560 CPUs.

Intel(R) Xeon(R) CPU           X7560  @ 2.27GHz

It has 256GB RAM on 4GB DIMMs, with each core controlling 2 DIMMs
and each core able to directly talk to every other core.  So, it
is non-uniform, but with this arrangement it is more a matter that
there is an 8GB set of memory that is "fast" for each core and the
other 97% of RAM is all accessible at the same speed.  There were
some other options for what to install on this system or how to
install it which wouldn't have kept things this tight.

> Install the numactl utility and you can see why it's made that
> decision:

We get this:

# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65519 MB
node 0 free: 283 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 25 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 26 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65536 MB
node 3 free: 25 MB
node distances:
node   0   1   2   3
  0:  10  11  11  11
  1:  11  10  11  11
  2:  11  11  10  11
  3:  11  11  11  10

When considering a hardware purchase, it might be wise to pay close
attention to how "far" a core may need to go to get to the most
"distant" RAM.

-Kevin

Re: Linux memory zone reclaim

From
Scott Marlowe
Date:
On Mon, Jul 30, 2012 at 10:43 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> node distances:
> node   0   1   2   3
>   0:  10  11  11  11
>   1:  11  10  11  11
>   2:  11  11  10  11
>   3:  11  11  11  10
>
> When considering a hardware purchase, it might be wise to pay close
> attention to how "far" a core may need to go to get to the most
> "distant" RAM.

I think the zone_reclaim gets turned on with a high ratio.  If the
inter node costs were the same, and the intranode costs dropped in
half, zone reclaim would likely get turned on at boot time.

I had something similar in a 48 core system but if I recall correctly
the matrix was 8x8 and the cost differential was much higher.

The symptoms I saw was that a very hard working db, on a 128G machine
with about 95G as OS / kernel cache, would slow to a crawl with kswapd
working very hard (I think it was kswapd) after a period of 1 to 3
weeks.  Note that actual swap in and out wasn't all that great by
vmstat.  The same performance hit happened on a similar machine used
as a file server after a similar period of warm up.

The real danger here is that the misbehavior can take a long time to
show up, and from what I read at the time, the performance gain for
any zone reclaim = 1 was minimal for a file or db server, and more in
line for a large virtual machine farm, with a lot of processes chopped
into sections small enough to fit in one node's memory and not need a
lot of access from another node.  Anything that relies on the OS to
cache is likely not served by zone reclaim = 1.

Re: Linux memory zone reclaim

From
Bruce Momjian
Date:
> node distances:
> node   0   1   2   3
>   0:  10  11  11  11
>   1:  11  10  11  11
>   2:  11  11  10  11
>   3:  11  11  11  10
>
> When considering a hardware purchase, it might be wise to pay close
> attention to how "far" a core may need to go to get to the most
> "distant" RAM.

Yikes, my server is certainly asymmetric:

    node distances:
    node   0   1
      0:  10  21
      1:  21  10

and my Debian Squeeze certainly knows that:

    $ cat <  /proc/sys/vm/zone_reclaim_mode
    1

Server specs:

    http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012

I have 12 2GB DDR3 DIMs.

Of course, my home server is ridiculously idle too.  :-)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Linux memory zone reclaim

From
Josh Berkus
Date:
On 7/30/12 10:09 AM, Scott Marlowe wrote:
> I think the zone_reclaim gets turned on with a high ratio.  If the
> inter node costs were the same, and the intranode costs dropped in
> half, zone reclaim would likely get turned on at boot time.

We've been seeing a major problem with zone_reclaim and Linux, in that
Linux won't use the FS cache on the "distant" RAM *at all* if it thinks
that RAM is distant enough.  Thus, you get instances of seeing only half
of RAM used for FS cache, even though the database is 5X larger than RAM.

This is poor design on Linux's part, since even the distant RAM is
faster than disk.  For now, we've been disabling zone_reclaim entirely.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Linux memory zone reclaim

From
Scott Marlowe
Date:
On Fri, Aug 3, 2012 at 4:30 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 7/30/12 10:09 AM, Scott Marlowe wrote:
>> I think the zone_reclaim gets turned on with a high ratio.  If the
>> inter node costs were the same, and the intranode costs dropped in
>> half, zone reclaim would likely get turned on at boot time.
>
> We've been seeing a major problem with zone_reclaim and Linux, in that
> Linux won't use the FS cache on the "distant" RAM *at all* if it thinks
> that RAM is distant enough.  Thus, you get instances of seeing only half
> of RAM used for FS cache, even though the database is 5X larger than RAM.
>
> This is poor design on Linux's part, since even the distant RAM is
> faster than disk.  For now, we've been disabling zone_reclaim entirely.

I haven't run into this, but we were running ubuntu 10.04 LTS.  What
kernel were you running when this happened?  I'd love to see a test
case on this, as it seems like a major regression if it's on newer
kernels, and we're looking at running 12.04 LTS soon on one of our
bigger machines.

Re: Linux memory zone reclaim

From
Josh Berkus
Date:
>> This is poor design on Linux's part, since even the distant RAM is
>> faster than disk.  For now, we've been disabling zone_reclaim entirely.
>
> I haven't run into this, but we were running ubuntu 10.04 LTS.  What
> kernel were you running when this happened?  I'd love to see a test
> case on this, as it seems like a major regression if it's on newer
> kernels, and we're looking at running 12.04 LTS soon on one of our
> bigger machines.

Jeff Frost will have a blog up about it later; we're still collecting data.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com