Thread: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Maciek Sakrejda

Date:

17 October 2012, 08:26:38

We've run into a perplexing issue with a customer database. He moved
from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB
RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is
now regularly getting constant errors regarding running out of shared
memory (there were none on the old system in the recent couple of
days' logs from before the upgrade):

ERROR:  out of shared memory
HINT:  You might need to increase max_pred_locks_per_transaction.

The query causing this has structurally identical plans on both systems:

old: http://explain.depesz.com/s/Epzq
new: http://explain.depesz.com/s/WZo

The settings ( "select name, setting from pg_settings where source <>
'default' and name not like 'log%' and name not like 'ssl%' and name
not like 'syslog%'" ) are almost identical
(max_pred_locks_per_transaction itself is at the default):

17c17
<  effective_cache_size         | 1530000
---
>  effective_cache_size         | 337500
38c38
<  shared_buffers               | 424960
---
>  shared_buffers               | 93696

The kernels are both 2.6.32. The workload has not changed
significantly. Could something in 9.1.6 be to blame here? Looking at
the changelog, this seems vanishingly unlikely. Any ideas?

Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Martin French

Date:

17 October 2012, 08:53:29

> > We've run into a perplexing issue with a customer database. He moved > from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB > RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is > now regularly getting constant errors regarding running out of shared > memory (there were none on the old system in the recent couple of > days' logs from before the upgrade): > > ERROR: out of shared memory > HINT: You might need to increase max_pred_locks_per_transaction. > > The query causing this has structurally identical plans on both systems: > > old:http://explain.depesz.com/s/Epzq> new:http://explain.depesz.com/s/WZo> > The settings ( "select name, setting from pg_settings where source <> > 'default' and name not like 'log%' and name not like 'ssl%' and name > not like 'syslog%'" ) are almost identical > (max_pred_locks_per_transaction itself is at the default): > > 17c17 > < effective_cache_size | 1530000 > --- > > effective_cache_size | 337500 > 38c38 > < shared_buffers | 424960 > --- > > shared_buffers | 93696 > > The kernels are both 2.6.32. The workload has not changed > significantly. Could something in 9.1.6 be to blame here? Looking at > the changelog, this seems vanishingly unlikely. Any ideas? >

What are the settings for:

work_mem
maintenance_work_mem

How many concurrent connections are there?

Have you ran explain analyze on the query that doesn't crash (i.e the old box) to get the exact execution plan?

Has the DB been vacuum analyzed?

Cheers

=============================================

Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England

Telephone numbers:
+44 (0)115 951 88 00 (main)

For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================

================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================

Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Maciek Sakrejda

Date:

17 October 2012, 09:13:51

On Wed, Oct 17, 2012 at 1:53 AM, Martin French
<Martin.French@romaxtech.com> wrote:

Thanks for your response.

> What are the settings for:
> work_mem
 100MB

> maintenance_work_mem
 64MB

> How many concurrent connections are there?
~20

> Have you ran explain analyze on the query that doesn't crash (i.e the old
> box) to get the exact execution plan?

I can try that in the morning, but I didn't think this was relevant. I
know cost estimates can be off, but can the plan actually change
between a vanilla explain and an explain analyze?

> Has the DB been vacuum analyzed?

Not outside of autovacuum, no, but it's actually a former replica of
the first database (sorry I neglected to mention this earlier).

Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Martin French

Date:

17 October 2012, 09:29:06

> On Wed, Oct 17, 2012 at 1:53 AM, Martin French > <Martin.French@romaxtech.com> wrote: > > Thanks for your response. > > > What are the settings for: > > work_mem > 100MB This is a little higher than I would ordinarily set. I tend to cap at about 64MB

> > > maintenance_work_mem > 64MB
In Contrast, this is a little low for me, but I guess that table size is a big factor here.
> > > How many concurrent connections are there? > ~20 > > > Have you ran explain analyze on the query that doesn't crash (i.e the old > > box) to get the exact execution plan? > > I can try that in the morning, but I didn't think this was relevant. I > know cost estimates can be off, but can the plan actually change > between a vanilla explain and an explain analyze? >
The explain analyze gives a more detailed output.

> > > Has the DB been vacuum analyzed? > > Not outside of autovacuum, no, but it's actually a former replica of > the first database (sorry I neglected to mention this earlier). >
This may be worthwhile. Even with autovacuum on, I still Vac Analyze manually during quiet periods. whether it's actually necessary or not, figure it's belt and braces.

Looking at the explain, It'd suggest the tables aren't very large, so I can't see really why there'd be a problem. Notwithstanding the fact that you are only relatively small shared_buffers.

Are there no other messages in the log files re: out of memory. There should be a dump which will show you where the memory usage is occurring.

Other than that, you may want to consider increasing the shared buffers and see if that has any effect. Alternately, you may want to increase max_pred_locks_per_transaction beyond the default of 64, although this is not a parameter I've had to yet adjust.

Cheers

=============================================

Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England

Telephone numbers:
+44 (0)115 951 88 00 (main)

For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================

================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================

Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Tom Lane

Date:

17 October 2012, 14:18:54

Maciek Sakrejda <m.sakrejda@gmail.com> writes:
> We've run into a perplexing issue with a customer database. He moved
> from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB
> RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is
> now regularly getting constant errors regarding running out of shared
> memory (there were none on the old system in the recent couple of
> days' logs from before the upgrade):

> ERROR:  out of shared memory
> HINT:  You might need to increase max_pred_locks_per_transaction.

This has nothing to do with work_mem nor maintenance_work_mem; rather,
it means you're running out of space in the database-wide lock table.
You need to take the hint's advice.

> The query causing this has structurally identical plans on both systems:

> old: http://explain.depesz.com/s/Epzq
> new: http://explain.depesz.com/s/WZo

The query in itself doesn't seem very exceptional.  I wonder whether
you recently switched your application to use serializable mode?  But
anyway, a query's demand for predicate locks can depend on a lot of
not-very-visible factors, such as how many physical pages the tuples
it accesses are spread across.  I don't find it too hard to credit
that yesterday you were just under the limit and today you're just
over even though "nothing changed".

            regards, tom lane

Re: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6

From

Maciek Sakrejda

Date:

17 October 2012, 15:09:14

On Wed, Oct 17, 2012 at 7:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ERROR:  out of shared memory
>> HINT:  You might need to increase max_pred_locks_per_transaction.
>
> This has nothing to do with work_mem nor maintenance_work_mem; rather,
> it means you're running out of space in the database-wide lock table.
> You need to take the hint's advice.

Sure, just trying to understand why this happened in the first place.

> The query in itself doesn't seem very exceptional.  I wonder whether
> you recently switched your application to use serializable mode?

The change (for some transactions) was relatively recent, but predated
the switch to the replica by several days. Before the switch,
everything was running fine.

> But
> anyway, a query's demand for predicate locks can depend on a lot of
> not-very-visible factors, such as how many physical pages the tuples
> it accesses are spread across.  I don't find it too hard to credit
> that yesterday you were just under the limit and today you're just
> over even though "nothing changed".

Interesting, thanks for the input. So it could be just a coincidence
that the errors occurred in lock-step with the promotion? Or does a
replica have a different (or different enough) physical layout that
this could have been a factor (my understanding of replication is
relatively high level--read: vague)?