Thread: Out of shared mem on new box with more mem, 9.1.5 -> 9.1.6
We've run into a perplexing issue with a customer database. He moved from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is now regularly getting constant errors regarding running out of shared memory (there were none on the old system in the recent couple of days' logs from before the upgrade): ERROR: out of shared memory HINT: You might need to increase max_pred_locks_per_transaction. The query causing this has structurally identical plans on both systems: old: http://explain.depesz.com/s/Epzq new: http://explain.depesz.com/s/WZo The settings ( "select name, setting from pg_settings where source <> 'default' and name not like 'log%' and name not like 'ssl%' and name not like 'syslog%'" ) are almost identical (max_pred_locks_per_transaction itself is at the default): 17c17 < effective_cache_size | 1530000 --- > effective_cache_size | 337500 38c38 < shared_buffers | 424960 --- > shared_buffers | 93696 The kernels are both 2.6.32. The workload has not changed significantly. Could something in 9.1.6 be to blame here? Looking at the changelog, this seems vanishingly unlikely. Any ideas?
>
> We've run into a perplexing issue with a customer database. He moved
> from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB
> RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is
> now regularly getting constant errors regarding running out of shared
> memory (there were none on the old system in the recent couple of
> days' logs from before the upgrade):
>
> ERROR: out of shared memory
> HINT: You might need to increase max_pred_locks_per_transaction.
>
> The query causing this has structurally identical plans on both systems:
>
> old: http://explain.depesz.com/s/Epzq
> new: http://explain.depesz.com/s/WZo
>
> The settings ( "select name, setting from pg_settings where source <>
> 'default' and name not like 'log%' and name not like 'ssl%' and name
> not like 'syslog%'" ) are almost identical
> (max_pred_locks_per_transaction itself is at the default):
>
> 17c17
> < effective_cache_size | 1530000
> ---
> > effective_cache_size | 337500
> 38c38
> < shared_buffers | 424960
> ---
> > shared_buffers | 93696
>
> The kernels are both 2.6.32. The workload has not changed
> significantly. Could something in 9.1.6 be to blame here? Looking at
> the changelog, this seems vanishingly unlikely. Any ideas?
>
What are the settings for:
work_mem
maintenance_work_mem
How many concurrent connections are there?
Have you ran explain analyze on the query that doesn't crash (i.e the old box) to get the exact execution plan?
Has the DB been vacuum analyzed?
Cheers
=============================================
Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England
Telephone numbers:
+44 (0)115 951 88 00 (main)
For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================
================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================
> We've run into a perplexing issue with a customer database. He moved
> from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB
> RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is
> now regularly getting constant errors regarding running out of shared
> memory (there were none on the old system in the recent couple of
> days' logs from before the upgrade):
>
> ERROR: out of shared memory
> HINT: You might need to increase max_pred_locks_per_transaction.
>
> The query causing this has structurally identical plans on both systems:
>
> old: http://explain.depesz.com/s/Epzq
> new: http://explain.depesz.com/s/WZo
>
> The settings ( "select name, setting from pg_settings where source <>
> 'default' and name not like 'log%' and name not like 'ssl%' and name
> not like 'syslog%'" ) are almost identical
> (max_pred_locks_per_transaction itself is at the default):
>
> 17c17
> < effective_cache_size | 1530000
> ---
> > effective_cache_size | 337500
> 38c38
> < shared_buffers | 424960
> ---
> > shared_buffers | 93696
>
> The kernels are both 2.6.32. The workload has not changed
> significantly. Could something in 9.1.6 be to blame here? Looking at
> the changelog, this seems vanishingly unlikely. Any ideas?
>
What are the settings for:
work_mem
maintenance_work_mem
How many concurrent connections are there?
Have you ran explain analyze on the query that doesn't crash (i.e the old box) to get the exact execution plan?
Has the DB been vacuum analyzed?
Cheers
=============================================
Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England
Telephone numbers:
+44 (0)115 951 88 00 (main)
For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================
================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================
On Wed, Oct 17, 2012 at 1:53 AM, Martin French <Martin.French@romaxtech.com> wrote: Thanks for your response. > What are the settings for: > work_mem 100MB > maintenance_work_mem 64MB > How many concurrent connections are there? ~20 > Have you ran explain analyze on the query that doesn't crash (i.e the old > box) to get the exact execution plan? I can try that in the morning, but I didn't think this was relevant. I know cost estimates can be off, but can the plan actually change between a vanilla explain and an explain analyze? > Has the DB been vacuum analyzed? Not outside of autovacuum, no, but it's actually a former replica of the first database (sorry I neglected to mention this earlier).
> On Wed, Oct 17, 2012 at 1:53 AM, Martin French
> <Martin.French@romaxtech.com> wrote:
>
> Thanks for your response.
>
> > What are the settings for:
> > work_mem
> 100MB
This is a little higher than I would ordinarily set. I tend to cap at about 64MB
>
> > maintenance_work_mem
> 64MB
In Contrast, this is a little low for me, but I guess that table size is a big factor here.
>
> > How many concurrent connections are there?
> ~20
>
> > Have you ran explain analyze on the query that doesn't crash (i.e the old
> > box) to get the exact execution plan?
>
> I can try that in the morning, but I didn't think this was relevant. I
> know cost estimates can be off, but can the plan actually change
> between a vanilla explain and an explain analyze?
>
The explain analyze gives a more detailed output.
>
> > Has the DB been vacuum analyzed?
>
> Not outside of autovacuum, no, but it's actually a former replica of
> the first database (sorry I neglected to mention this earlier).
>
This may be worthwhile. Even with autovacuum on, I still Vac Analyze manually during quiet periods. whether it's actually necessary or not, figure it's belt and braces.
Looking at the explain, It'd suggest the tables aren't very large, so I can't see really why there'd be a problem. Notwithstanding the fact that you are only relatively small shared_buffers.
Are there no other messages in the log files re: out of memory. There should be a dump which will show you where the memory usage is occurring.
Other than that, you may want to consider increasing the shared buffers and see if that has any effect. Alternately, you may want to increase max_pred_locks_per_transaction beyond the default of 64, although this is not a parameter I've had to yet adjust.
Cheers
=============================================
Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England
Telephone numbers:
+44 (0)115 951 88 00 (main)
For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================
================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================
> <Martin.French@romaxtech.com> wrote:
>
> Thanks for your response.
>
> > What are the settings for:
> > work_mem
> 100MB
This is a little higher than I would ordinarily set. I tend to cap at about 64MB
>
> > maintenance_work_mem
> 64MB
In Contrast, this is a little low for me, but I guess that table size is a big factor here.
>
> > How many concurrent connections are there?
> ~20
>
> > Have you ran explain analyze on the query that doesn't crash (i.e the old
> > box) to get the exact execution plan?
>
> I can try that in the morning, but I didn't think this was relevant. I
> know cost estimates can be off, but can the plan actually change
> between a vanilla explain and an explain analyze?
>
The explain analyze gives a more detailed output.
>
> > Has the DB been vacuum analyzed?
>
> Not outside of autovacuum, no, but it's actually a former replica of
> the first database (sorry I neglected to mention this earlier).
>
This may be worthwhile. Even with autovacuum on, I still Vac Analyze manually during quiet periods. whether it's actually necessary or not, figure it's belt and braces.
Looking at the explain, It'd suggest the tables aren't very large, so I can't see really why there'd be a problem. Notwithstanding the fact that you are only relatively small shared_buffers.
Are there no other messages in the log files re: out of memory. There should be a dump which will show you where the memory usage is occurring.
Other than that, you may want to consider increasing the shared buffers and see if that has any effect. Alternately, you may want to increase max_pred_locks_per_transaction beyond the default of 64, although this is not a parameter I've had to yet adjust.
Cheers
=============================================
Romax Technology Limited
Rutherford House
Nottingham Science & Technology Park
Nottingham,
NG7 2PZ
England
Telephone numbers:
+44 (0)115 951 88 00 (main)
For other office locations see:
http://www.romaxtech.com/Contact
=================================
===============
E-mail: info@romaxtech.com
Website: www.romaxtech.com
=================================
================
Confidentiality Statement
This transmission is for the addressee only and contains information that is confidential and privileged.
Unless you are the named addressee, or authorised to receive it on behalf of the addressee
you may not copy or use it, or disclose it to anyone else.
If you have received this transmission in error please delete from your system and contact the sender. Thank you for your cooperation.
=================================================
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > We've run into a perplexing issue with a customer database. He moved > from a 9.1.5 to a 9.1.6 and upgraded from an EC2 m1.medium (3.75GB > RAM, 1.3 GB shmmax), to an m2.xlarge (17GB RAM, 5.7 GB shmmax), and is > now regularly getting constant errors regarding running out of shared > memory (there were none on the old system in the recent couple of > days' logs from before the upgrade): > ERROR: out of shared memory > HINT: You might need to increase max_pred_locks_per_transaction. This has nothing to do with work_mem nor maintenance_work_mem; rather, it means you're running out of space in the database-wide lock table. You need to take the hint's advice. > The query causing this has structurally identical plans on both systems: > old: http://explain.depesz.com/s/Epzq > new: http://explain.depesz.com/s/WZo The query in itself doesn't seem very exceptional. I wonder whether you recently switched your application to use serializable mode? But anyway, a query's demand for predicate locks can depend on a lot of not-very-visible factors, such as how many physical pages the tuples it accesses are spread across. I don't find it too hard to credit that yesterday you were just under the limit and today you're just over even though "nothing changed". regards, tom lane
On Wed, Oct 17, 2012 at 7:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ERROR: out of shared memory >> HINT: You might need to increase max_pred_locks_per_transaction. > > This has nothing to do with work_mem nor maintenance_work_mem; rather, > it means you're running out of space in the database-wide lock table. > You need to take the hint's advice. Sure, just trying to understand why this happened in the first place. > The query in itself doesn't seem very exceptional. I wonder whether > you recently switched your application to use serializable mode? The change (for some transactions) was relatively recent, but predated the switch to the replica by several days. Before the switch, everything was running fine. > But > anyway, a query's demand for predicate locks can depend on a lot of > not-very-visible factors, such as how many physical pages the tuples > it accesses are spread across. I don't find it too hard to credit > that yesterday you were just under the limit and today you're just > over even though "nothing changed". Interesting, thanks for the input. So it could be just a coincidence that the errors occurred in lock-step with the promotion? Or does a replica have a different (or different enough) physical layout that this could have been a factor (my understanding of replication is relatively high level--read: vague)?