Thread: Tons of free RAM. Can't make it go away.
Hey everyone! This is pretty embarrassing, but I've never seen this before. This is our system's current memory allocation from 'free -m': total used free buffers cached Mem: 72485 58473 14012 3 34020 -/+ buffers/cache: 24449 48036 So, I've got 14GB of RAM that the OS is just refusing to use for disk or page cache. Does anyone know what might cause that? Our uname -sir, for reference: Linux 3.2.0-31-generic x86_64 -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Mon, Oct 22, 2012 at 2:35 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > So, I've got 14GB of RAM that the OS is just refusing to use for disk or > page cache. Does anyone know what might cause that? Maybe there's just nothing to put inside? How big is your database? How much of it gets accessed?
On 10/22/2012 12:44 PM, Claudio Freire wrote: > Maybe there's just nothing to put inside? > How big is your database? How much of it gets accessed? Trust me, there's plenty. We have a DB that's 6x larger than RAM that's currently experiencing 6000TPS, and according to iostat, anywhere from 20-60% disk utilization that's mostly reads. It's pretty aggressively keeping that 14GB free, and it's driving me nuts. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Mon, 22 Oct 2012 12:35:32 -0500 Shaun Thomas <sthomas@optionshouse.com> wrote: > Hey everyone! > > This is pretty embarrassing, but I've never seen this before. This is > our system's current memory allocation from 'free -m': > > total used free buffers cached > Mem: 72485 58473 14012 3 34020 > -/+ buffers/cache: 24449 48036 > > So, I've got 14GB of RAM that the OS is just refusing to use for disk > or page cache. Does anyone know what might cause that? Maybe it's not needed? What make you think the OS shall allocate all the memory? -- Frank Lanitz <frank@frank.uvena.de>
Attachment
On Mon, Oct 22, 2012 at 2:49 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: >> Maybe there's just nothing to put inside? >> How big is your database? How much of it gets accessed? > > > Trust me, there's plenty. We have a DB that's 6x larger than RAM that's > currently experiencing 6000TPS, and according to iostat, anywhere from > 20-60% disk utilization that's mostly reads. > > It's pretty aggressively keeping that 14GB free, and it's driving me nuts. > :) Did you check the kernel's zone_reclaim_mode ?
On Mon, Oct 22, 2012 at 12:49:49PM -0500, Shaun Thomas wrote: > Trust me, there's plenty. We have a DB that's 6x larger than RAM > that's currently experiencing 6000TPS, and according to iostat, > anywhere from 20-60% disk utilization that's mostly reads. Could it be related to zone_reclaim_mode? What is vm.zone_reclaim_mode set to? /marcus
On 10/22/2012 12:53 PM, Claudio Freire wrote: > Did you check the kernel's zone_reclaim_mode ? It's currently set to 0, which as I'm led to believe, is the setting I want there. But here's something interesting: numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 node 0 size: 36853 MB node 0 free: 13816 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 node 1 size: 36863 MB node 1 free: 751 MB node distances: node 0 1 0: 10 20 1: 20 10 Looks like CPU 0 is hoarding memory. :( -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Mon, Oct 22, 2012 at 3:01 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > >> Did you check the kernel's zone_reclaim_mode ? > > > It's currently set to 0, which as I'm led to believe, is the setting I want > there. Yep > But here's something interesting: > > numactl --hardware > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 > node 0 size: 36853 MB > node 0 free: 13816 MB > node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 > node 1 size: 36863 MB > node 1 free: 751 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > Looks like CPU 0 is hoarding memory. :( You may want to try setting the numa policy before launching postgres: numactl --interleave=all pg_ctl start or numactl --preferred=+0 pg_ctl start
This is a good general discussion of the problem - looks like you could replace "MySQL" with "PostgreSQL" everywhere without loss of generality: http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-archite cture/ Dan -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Claudio Freire Sent: Monday, October 22, 2012 2:14 PM To: sthomas@optionshouse.com Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Tons of free RAM. Can't make it go away. On Mon, Oct 22, 2012 at 3:01 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: > >> Did you check the kernel's zone_reclaim_mode ? > > > It's currently set to 0, which as I'm led to believe, is the setting I want > there. Yep > But here's something interesting: > > numactl --hardware > > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 > node 0 size: 36853 MB > node 0 free: 13816 MB > node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 > node 1 size: 36863 MB > node 1 free: 751 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > > Looks like CPU 0 is hoarding memory. :( You may want to try setting the numa policy before launching postgres: numactl --interleave=all pg_ctl start or numactl --preferred=+0 pg_ctl start -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On 10/22/2012 01:20 PM, Franklin, Dan (FEN) wrote: > http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-archite > cture/ Yeah, I remember reading that a while back. While interesting, it doesn't really apply to PG, in that unlike MySQL, we don't allocate any large memory segments directly to any large block. With MySQL, it's not uncommon to dedicate over 50% of RAM to the MySQL process itself, but I don't often see PG systems with more than 8GB in shared_buffers. All the rest should be available for random allocation in general. At least, in theory. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On 10/22/2012 01:14 PM, Claudio Freire wrote: > You may want to try setting the numa policy before launching postgres: > > numactl --interleave=all pg_ctl start I thought about that. I'd try it on one of our stage nodes, but both of them show an even memory split. I'm not sure why our prod node is acting this way. We've used bcfg2 so every server has the exact same configuration, including kernel parameters, startup settings, and so on. I can only conclude that there's something about the activity itself that's causing it. I'll have to take another look after the market closes to see if the unallocated chunk shrinks. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Mon, Oct 22, 2012 at 3:24 PM, Shaun Thomas <sthomas@optionshouse.com> wrote: >> http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-archite >> cture/ > > > Yeah, I remember reading that a while back. While interesting, it doesn't > really apply to PG, in that unlike MySQL, we don't allocate any large memory > segments directly to any large block. With MySQL, it's not uncommon to > dedicate over 50% of RAM to the MySQL process itself, but I don't often see > PG systems with more than 8GB in shared_buffers. Actually, one problem that creeps up in PG is that shared buffers tends to be allocated all within one node (the postmaster's), stealing a lot from workers. I had written a patch that sets the policy to interleave in the master, while launching (and setting up shared buffers), and then back to preferring local when forking a worker. I never had a chance to test it. I only have one numa system, and it's in production so I can't really test much there. I think, unless it gives you trouble with the page cache, numactl --prefer=+0 should work nicely for postgres overall. Failing that, numactl --interleave=all would, IMO, be better than the system default.
On 10/22/2012 01:44 PM, Claudio Freire wrote: > I think, unless it gives you trouble with the page cache, numactl > --prefer=+0 should work nicely for postgres overall. Failing that, > numactl --interleave=all would, IMO, be better than the system > default. Thanks, I'll consider that. FWIW, our current stage cluster node is *not* doing this at all. In fact, here's a numastat from stage: node0 node1 numa_hit 1623243097 1558610594 numa_miss 257459057 310098727 numa_foreign 310098727 257459057 interleave_hit 25822175 26010606 local_node 1616379287 1545600377 other_node 264322867 323108944 Then from prod: node0 node1 numa_hit 4987625178 3695967931 numa_miss 1678204346 418284176 numa_foreign 418284176 1678204370 interleave_hit 27578 27720 local_node 4988131216 3696305260 other_node 1677698308 417946847 Note how ridiculously uneven node0 and node1 are in comparison to what we're seeing in stage. I'm willing to bet something is just plain wrong with our current production node. So I'm working with our NOC team to schedule a failover to the alternate node. If that resolves it, I'll see if I can't get some kind of answer from our infrastructure guys to share in case someone else encounters this. Yes, even if that answer is "reboot." :) Thanks again! -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
Sorry for late response, but may be you are still strugling.
It can be that some query(s) use a lot of work mem, either because of high work_mem setting or because of planner error. In this case the moment query runs it will need memory that will later be returned and become free. Usually this can be seen as active memory spike with a lot of free memory after.
2012/10/22 Shaun Thomas <sthomas@optionshouse.com>
--
Best regards,
Vitalii Tymchyshyn
Hey everyone!
This is pretty embarrassing, but I've never seen this before. This is our system's current memory allocation from 'free -m':
total used free buffers cached
Mem: 72485 58473 14012 3 34020
-/+ buffers/cache: 24449 48036
So, I've got 14GB of RAM that the OS is just refusing to use for disk or page cache. Does anyone know what might cause that?
Our uname -sir, for reference:
Linux 3.2.0-31-generic x86_64
--
Best regards,
Vitalii Tymchyshyn
On 10/27/2012 10:49 PM, Віталій Тимчишин wrote: > It can be that some query(s) use a lot of work mem, either because of > high work_mem setting or because of planner error. In this case the > moment query runs it will need memory that will later be returned and > become free. Usually this can be seen as active memory spike with a lot > of free memory after. Yeah, I had briefly considered that. But our work-mem is only 16MB, and even a giant query would have trouble allocating 10+GB with that size of work-mem buckets. That's why I later listed the numa info. In our case, processor 0 is heavily unbalanced with its memory accesses compared to processor 1. I think the theory that we didn't start with interleave put an 8GB (our shared_buffers) segment all on processor 0, which unbalanced a lot of other stuff. Of course, that leaves 4-6GB unaccounted for. And numactl still shows a heavy preference for freeing memory from proc 0. It seems to only do it on this node, so we're going to switch nodes soon and see if the problem reappears. We may have to perform a node hardware audit if this persists. Thanks for your input, though. :) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-444-8534 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email