Re: performance tuning: shared_buffers, sort_mem; swap - Mailing list pgsql-admin

From Tom Lane
Subject Re: performance tuning: shared_buffers, sort_mem; swap
Date
Msg-id 1806.1029277058@sss.pgh.pa.us
Whole thread Raw
In response to Re: performance tuning: shared_buffers, sort_mem; swap  (Thomas O'Connell <tfo@monsterlabs.com>)
List pgsql-admin
"Thomas O'Connell" <tfo@monsterlabs.com> writes:
>> If it happens to select
>> a database backend to kill, the postmaster will interpret the backend's
>> unexpected exit as a crash, and will force a database restart.

> I guess this is what we're seeing, then. Right before the IPC error,
> there are usually several of these:

> "NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted shared memory.

> is this the forced database restart you mention above?

Yup.  The actual sequence of events is:

1. Some backend dies (which in Unix terms is "exits with nonzero
status").  We're hypothesizing that the kernel kills it with the
equivalent of a "kill -9", but coredumps and other untimely ends would
also produce nonzero exit status.  Zero status means a normal exit(0)
call.

2. The postmaster gets a report that one of its child processes quit.
Seeing the nonzero status, it assumes the worst and begins the recovery
fire drill.  The first step is for it to send SIGQUIT to all its other
children and wait for them to exit.

3. The other backends receive SIGQUIT, spit out the "The Postmaster has
informed me ..." message to their connected clients, and immediately
exit().

4. When the postmaster has received exit reports for all of its
children, it releases its existing shared memory segment and then begins
the same procedure it would normally use at startup --- of which one of
the first steps is to try to acquire a shmem segment with shmget().
What you are seeing is that that fails.

Therefore, the "postmaster has informed me" messages are also post-crash
noise.  What you want to look for is what happened to trigger this whole
sequence.  What I'd expect to see is something like the attached log,
which I made by issuing a manual kill -9 against a perfectly innocent
backend:

<<normal operation up to here>>
LOG:  server process (pid 1700) was terminated by signal 9
LOG:  terminating any other active server processes
WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
LOG:  all server processes terminated; reinitializing shared memory and semaphores
LOG:  database system was interrupted at 2002-08-13 16:43:33 EDT
LOG:  checkpoint record is at 0/1C129B8
LOG:  redo record is at 0/1C129B8; undo record is at 0/0; shutdown FALSE
LOG:  next transaction id: 5641; next oid: 156131
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  ReadRecord: record with zero length at 0/1C129F8
LOG:  redo is not required
LOG:  database system is ready
<<normal operation resumes>>

(This is with development sources, which label the messages a little
differently than prior releases, but you should see pretty much the same
text in your postmaster log.)  The "terminated by signal 9" part,
which the postmaster prints out when it gets a child-death report with a
nonzero exit status, is the actually useful information in this series.
I have only one "postmaster informed me" message because there was only
one other live backend, but in general you might see a bunch of 'em.

So, what are you seeing to start the avalanche?

            regards, tom lane

pgsql-admin by date:

Previous
From: Thomas O'Connell
Date:
Subject: Re: performance tuning: shared_buffers, sort_mem; swap
Next
From: Tim Ellis
Date:
Subject: tedia2sql 1.2.4 released