Re: performance tuning: shared_buffers, sort_mem; swap - Mailing list pgsql-admin
From | Tom Lane |
---|---|
Subject | Re: performance tuning: shared_buffers, sort_mem; swap |
Date | |
Msg-id | 1806.1029277058@sss.pgh.pa.us Whole thread Raw |
In response to | Re: performance tuning: shared_buffers, sort_mem; swap (Thomas O'Connell <tfo@monsterlabs.com>) |
List | pgsql-admin |
"Thomas O'Connell" <tfo@monsterlabs.com> writes: >> If it happens to select >> a database backend to kill, the postmaster will interpret the backend's >> unexpected exit as a crash, and will force a database restart. > I guess this is what we're seeing, then. Right before the IPC error, > there are usually several of these: > "NOTICE: Message from PostgreSQL backend: > The Postmaster has informed me that some other backend > died abnormally and possibly corrupted shared memory. > is this the forced database restart you mention above? Yup. The actual sequence of events is: 1. Some backend dies (which in Unix terms is "exits with nonzero status"). We're hypothesizing that the kernel kills it with the equivalent of a "kill -9", but coredumps and other untimely ends would also produce nonzero exit status. Zero status means a normal exit(0) call. 2. The postmaster gets a report that one of its child processes quit. Seeing the nonzero status, it assumes the worst and begins the recovery fire drill. The first step is for it to send SIGQUIT to all its other children and wait for them to exit. 3. The other backends receive SIGQUIT, spit out the "The Postmaster has informed me ..." message to their connected clients, and immediately exit(). 4. When the postmaster has received exit reports for all of its children, it releases its existing shared memory segment and then begins the same procedure it would normally use at startup --- of which one of the first steps is to try to acquire a shmem segment with shmget(). What you are seeing is that that fails. Therefore, the "postmaster has informed me" messages are also post-crash noise. What you want to look for is what happened to trigger this whole sequence. What I'd expect to see is something like the attached log, which I made by issuing a manual kill -9 against a perfectly innocent backend: <<normal operation up to here>> LOG: server process (pid 1700) was terminated by signal 9 LOG: terminating any other active server processes WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2002-08-13 16:43:33 EDT LOG: checkpoint record is at 0/1C129B8 LOG: redo record is at 0/1C129B8; undo record is at 0/0; shutdown FALSE LOG: next transaction id: 5641; next oid: 156131 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: record with zero length at 0/1C129F8 LOG: redo is not required LOG: database system is ready <<normal operation resumes>> (This is with development sources, which label the messages a little differently than prior releases, but you should see pretty much the same text in your postmaster log.) The "terminated by signal 9" part, which the postmaster prints out when it gets a child-death report with a nonzero exit status, is the actually useful information in this series. I have only one "postmaster informed me" message because there was only one other live backend, but in general you might see a bunch of 'em. So, what are you seeing to start the avalanche? regards, tom lane
pgsql-admin by date: