Thread: Possible Corrputed shared memory

Possible Corrputed shared memory

From
James Sebastian
Date:
Dear all,

I am novice to postgres admin world.

I am using postgresl 9.1.11 on Ubutu 12.04 and this is used by application server running on tomcat.

The application slowed down considerably and the postgresql-9.1-main.log reported pgstat wait timeout as below.

2015-07-30 09:38:20 IST WARNING:  pgstat wait timeout
2015-07-30 09:50:20 IST WARNING:  pgstat wait timeout
2015-07-30 09:50:26 IST WARNING:  pgstat wait timeout
2015-07-30 10:00:55 IST WARNING:  pgstat wait timeout
2015-07-30 10:04:51 IST WARNING:  pgstat wait timeout
2015-07-30 10:06:51 IST WARNING:  pgstat wait timeout
2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

I attempted application stop and database shutdown using init scripts followed by start.
The database failed to restart with the following error.

2015-07-30 14:14:17 IST LOG:  received immediate shutdown request
2015-07-30 14:14:17 IST WARNING:  terminating connection because of crash of another server process
2015-07-30 14:14:17 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-07-30 14:14:17 IST HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2015-07-30 14:14:17 IST LOG:  could not send data to client: Broken pipe
2015-07-30 14:14:17 IST WARNING:  terminating connection because of crash of another server process
2015-07-30 14:14:17 IST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-07-30 14:14:17 IST HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2015-07-30 14:14:17 IST LOG:  could not send data to client: Broken pipe

The whole recovery took more than 2.5 hours to complete to come back to a state to accept connections back. This caused considerable disruption in normal service.

Hardware support says after analysis that there is no hardware faults related to network or IO (disks). Also could not figure out any network errors like IP conflicts/IP changes etc as well.

I would like to troubleshoot this further to figure out what caused this problem.
Any help is appreciated in this regard and precautions I can take to understand the problems if occurs in future.

Regards,
James

Re: Possible Corrputed shared memory

From
Tom Lane
Date:
James Sebastian <james.sebastian@gmail.com> writes:
> I attempted application stop and database shutdown using init scripts
> followed by start.
> The database failed to restart with the following error.

> 2015-07-30 14:14:17 IST LOG:  received immediate shutdown request
> ...

> The whole recovery took more than 2.5 hours to complete to come back to a
> state to accept connections back. This caused considerable disruption in
> normal service.

If you use "immediate" mode shutdown, then yes, crash recovery is what
will ensue.  Think of that as being like the large red panic button that
dumps the fuel rods out of a nuclear reactor: you push that button only
when disaster is otherwise certain, because recovering will be slow and
expensive.

For an ordinary clean shutdown, you use "fast" mode, or "smart" mode
if you're not expecting any long-lived client sessions.

(If you're using a wrapper script that defaults to "immediate" mode,
I'd advise changing that default.)

Also, if it took that long to recover, you might have raised the
checkpoint interval settings too high.

            regards, tom lane


Re: Possible Corrputed shared memory

From
James Sebastian
Date:


On 1 August 2015 at 19:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Sebastian <james.sebastian@gmail.com> writes:
> I attempted application stop and database shutdown using init scripts
> followed by start.
> The database failed to restart with the following error.

> 2015-07-30 14:14:17 IST LOG:  received immediate shutdown request
> ...

> The whole recovery took more than 2.5 hours to complete to come back to a
> state to accept connections back. This caused considerable disruption in
> normal service.

If you use "immediate" mode shutdown, then yes, crash recovery is what
will ensue.  Think of that as being like the large red panic button that
dumps the fuel rods out of a nuclear reactor: you push that button only
when disaster is otherwise certain, because recovering will be slow and
expensive.

For an ordinary clean shutdown, you use "fast" mode, or "smart" mode
if you're not expecting any long-lived client sessions.

(If you're using a wrapper script that defaults to "immediate" mode,
I'd advise changing that default.)


I am using default /etc/init.d/postgresql start|stop scripts.
I will check if it is using ordinary clean shutdown

Also, if it took that long to recover, you might have raised the
checkpoint interval settings too high.

I am using the following parameters

checkpoint_segments = 10 (from OS default 3)
checkpoint_completion_target = 0.8 (from OS default 0.5)
archive_mode=on
archive_timeout=600

Does any of the above looks too high ?

Regards,
James

Re: Possible Corrputed shared memory

From
Tom Lane
Date:
James Sebastian <james.sebastian@gmail.com> writes:
> On 1 August 2015 at 19:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, if it took that long to recover, you might have raised the
>> checkpoint interval settings too high.

> I am using the following parameters

> checkpoint_segments = 10 (from OS default 3)
> checkpoint_completion_target = 0.8 (from OS default 0.5)
> archive_mode=on
> archive_timeout=600

[ scratches head... ] It should certainly not have taken very long to
replay 10 WAL segments worth of data.  I surmise that the problems
you were having before the shutdown were worse than you thought, ie
checkpoints were failing to complete, probably due to a persistent
I/O error, so that there was a whole lot more than normal to replay
after the last successful checkpoint.  Is there any evidence of such
distress in the postmaster log?

            regards, tom lane


Re: Possible Corrputed shared memory

From
James Sebastian
Date:


On 1 August 2015 at 20:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Sebastian <james.sebastian@gmail.com> writes:
> On 1 August 2015 at 19:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, if it took that long to recover, you might have raised the
>> checkpoint interval settings too high.

> I am using the following parameters

> checkpoint_segments = 10 (from OS default 3)
> checkpoint_completion_target = 0.8 (from OS default 0.5)
> archive_mode=on
> archive_timeout=600

[ scratches head... ] It should certainly not have taken very long to
replay 10 WAL segments worth of data.  I surmise that the problems
you were having before the shutdown were worse than you thought, ie
checkpoints were failing to complete, probably due to a persistent
I/O error, so that there was a whole lot more than normal to replay
after the last successful checkpoint.  Is there any evidence of such
distress in the postmaster log?

[....sigh....Thanks.....]

We had very slow application performance and many hanging threads as per pgadmin -> server status
Also logs had the following which also indicating probably high I/O (as per google search results)

2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

I got hardware analysed and I am sure there was no disc problems as per them.
Load on application was usual...and that brings me to this email list

Thanks for all the help so far. I am learning much and becoming  little more comfortable  with dealing with postgres administration from pure os admin background.
 
                        regards, tom lane

Re: Possible Corrputed shared memory

From
Tom Lane
Date:
James Sebastian <james.sebastian@gmail.com> writes:
> On 1 August 2015 at 20:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [ scratches head... ] It should certainly not have taken very long to
>> replay 10 WAL segments worth of data.  I surmise that the problems
>> you were having before the shutdown were worse than you thought, ie
>> checkpoints were failing to complete, probably due to a persistent
>> I/O error, so that there was a whole lot more than normal to replay
>> after the last successful checkpoint.  Is there any evidence of such
>> distress in the postmaster log?

> We had very slow application performance and many hanging threads as per
> pgadmin -> server status
> Also logs had the following which also indicating probably high I/O (as per
> google search results)

> 2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
> 2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

Well, those might mean problems with the stats collector subprocess, but
that's pretty noncritical; it would certainly not have prevented
checkpoints from completing.  No other unexplained log entries?

            regards, tom lane


Re: Possible Corrputed shared memory

From
James Sebastian
Date:


>> [ scratches head... ] It should certainly not have taken very long to
>> replay 10 WAL segments worth of data.  I surmise that the problems
>> you were having before the shutdown were worse than you thought, ie
>> checkpoints were failing to complete, probably due to a persistent
>> I/O error, so that there was a whole lot more than normal to replay
>> after the last successful checkpoint.  Is there any evidence of such
>> distress in the postmaster log?

> We had very slow application performance and many hanging threads as per
> pgadmin -> server status
> Also logs had the following which also indicating probably high I/O (as per
> google search results)

> 2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
> 2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

Well, those might mean problems with the stats collector subprocess, but
that's pretty noncritical; it would certainly not have prevented
checkpoints from completing.  No other unexplained log entries?
  
None. I have default  log configuration as given by Ubuntu as default. Probably I would need to increase them to get some more detailed level. I will do some reading on them.
As of now, logging_collector and other log related are commented out, so whatever is default is functioning and logging to postgresql-9.1-main.log

If you have some suggestions, please let me know.

One area of concern, I am realising now which might have contributed to this is below.

I am using postgres as backend db of a java based application over tomcat. JVM is allocated 2048 MB memory heap size from tomcat.
But shared_buffers in postgres was at the default of 32m. I increased them to 144 M and increased shmmax to 320 M. This machine has 8 GB RAM and used as Database and Tomcat/application server.  I can allocate more memory, but my application side colleagues like to know why we keep high and how to figure out instead of blindly allocating a high memory.

I am not even sure how to figure out what should be my buffers.
I now kept
effective_cache_size = 256MB
max_connections = 100
work_mem = 10MB
and all others are default provided by postgres 9.1 in Ubuntu 12.04

Let me know any of them looks bad and worse.

Regards
James

Re: Possible Corrputed shared memory

From
Wei Shan
Date:
Hi James,

Your values seems too low. U allocate 2GB to an application server but only 144M for the DB server. Doesn't it look weird to you?

Perhaps you should consider bumping it up to least 2GB.

Cheers.

On 2 August 2015 at 11:11, James Sebastian <james.sebastian@gmail.com> wrote:


>> [ scratches head... ] It should certainly not have taken very long to
>> replay 10 WAL segments worth of data.  I surmise that the problems
>> you were having before the shutdown were worse than you thought, ie
>> checkpoints were failing to complete, probably due to a persistent
>> I/O error, so that there was a whole lot more than normal to replay
>> after the last successful checkpoint.  Is there any evidence of such
>> distress in the postmaster log?

> We had very slow application performance and many hanging threads as per
> pgadmin -> server status
> Also logs had the following which also indicating probably high I/O (as per
> google search results)

> 2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
> 2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

Well, those might mean problems with the stats collector subprocess, but
that's pretty noncritical; it would certainly not have prevented
checkpoints from completing.  No other unexplained log entries?
  
None. I have default  log configuration as given by Ubuntu as default. Probably I would need to increase them to get some more detailed level. I will do some reading on them.
As of now, logging_collector and other log related are commented out, so whatever is default is functioning and logging to postgresql-9.1-main.log

If you have some suggestions, please let me know.

One area of concern, I am realising now which might have contributed to this is below.

I am using postgres as backend db of a java based application over tomcat. JVM is allocated 2048 MB memory heap size from tomcat.
But shared_buffers in postgres was at the default of 32m. I increased them to 144 M and increased shmmax to 320 M. This machine has 8 GB RAM and used as Database and Tomcat/application server.  I can allocate more memory, but my application side colleagues like to know why we keep high and how to figure out instead of blindly allocating a high memory.

I am not even sure how to figure out what should be my buffers.
I now kept
effective_cache_size = 256MB
max_connections = 100
work_mem = 10MB
and all others are default provided by postgres 9.1 in Ubuntu 12.04

Let me know any of them looks bad and worse.

Regards
James



--
Regards,
Ang Wei Shan

Re: Possible Corrputed shared memory

From
James Sebastian
Date:
Total database size is very small. But probably I will increase it..Initially it had only default 32 MB shared_buffers, later allocated 6 times to 144 MB.
Thanks for the suggestion..

On 3 August 2015 at 12:47, Wei Shan <weishan.ang@gmail.com> wrote:
Hi James,

Your values seems too low. U allocate 2GB to an application server but only 144M for the DB server. Doesn't it look weird to you?

Perhaps you should consider bumping it up to least 2GB.

Cheers.

On 2 August 2015 at 11:11, James Sebastian <james.sebastian@gmail.com> wrote:


>> [ scratches head... ] It should certainly not have taken very long to
>> replay 10 WAL segments worth of data.  I surmise that the problems
>> you were having before the shutdown were worse than you thought, ie
>> checkpoints were failing to complete, probably due to a persistent
>> I/O error, so that there was a whole lot more than normal to replay
>> after the last successful checkpoint.  Is there any evidence of such
>> distress in the postmaster log?

> We had very slow application performance and many hanging threads as per
> pgadmin -> server status
> Also logs had the following which also indicating probably high I/O (as per
> google search results)

> 2015-07-30 10:10:21 IST WARNING:  pgstat wait timeout
> 2015-07-30 10:12:21 IST WARNING:  pgstat wait timeout

Well, those might mean problems with the stats collector subprocess, but
that's pretty noncritical; it would certainly not have prevented
checkpoints from completing.  No other unexplained log entries?
  
None. I have default  log configuration as given by Ubuntu as default. Probably I would need to increase them to get some more detailed level. I will do some reading on them.
As of now, logging_collector and other log related are commented out, so whatever is default is functioning and logging to postgresql-9.1-main.log

If you have some suggestions, please let me know.

One area of concern, I am realising now which might have contributed to this is below.

I am using postgres as backend db of a java based application over tomcat. JVM is allocated 2048 MB memory heap size from tomcat.
But shared_buffers in postgres was at the default of 32m. I increased them to 144 M and increased shmmax to 320 M. This machine has 8 GB RAM and used as Database and Tomcat/application server.  I can allocate more memory, but my application side colleagues like to know why we keep high and how to figure out instead of blindly allocating a high memory.

I am not even sure how to figure out what should be my buffers.
I now kept
effective_cache_size = 256MB
max_connections = 100
work_mem = 10MB
and all others are default provided by postgres 9.1 in Ubuntu 12.04

Let me know any of them looks bad and worse.

Regards
James



--
Regards,
Ang Wei Shan