Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array - Mailing list pgsql-bugs

From Daniel Hahler
Subject Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
Date
Msg-id 53319E20.9030006@thequod.de
Whole thread Raw
In response to Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 25.03.2014 15:36, Alvaro Herrera wrote:
> Tom Lane wrote:
>> postgresql@thequod.de writes:
>>> PostgreSQL just failed to startup after a reboot (which was forced vi=
a
>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):
>>
>>> 2014-03-24 13:32:47 CET LOG:  could not receive data from client: Con=
nection
>>> reset by peer
>>>     2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
>>> 2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot =
108,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot =
109,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot =
110,
>>> but it contains 0
>>
>> That's odd (and as you say, unexpected) but this log extract doesn't g=
ive
>> much clue as to how we got into this state.  What was going on before
>> this?  In particular, it's hard to call this "failure to start up" whe=
n
>> you evidently had a hundred or so postmaster child processes already.
>> Could there have been some unexpected surge in the number of connectio=
n
>> attempts just after the database came up?  Also, this extract doesn't =
look
>> like anything that would've caused the postmaster to decide to shut do=
wn
>> again, so what happened after that?  Or in short, I want to see the re=
st
>> of the log not just this part.

That was the whole log.

The rotated one before has only:
2014-03-22 03:51:37 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 03:52:25 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 03:59:31 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 04:00:18 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 06:03:06 CET LOG:  could not receive data from client: Connect=
ion reset by peer

Should I increase the logging verbosity, in case this happens again?
If so, to what? (I have not configured logging yet, so it has the default=
s from your Debian package).

> Here's my guess --- this is a virtualized system that somehow dumped
> some state to disk to hibernate while the host was being rebooted; and
> then, when the host was up again, it tried to resurrect the virtual
> machine and found things to be all inconsistent.

Yes, the container was frozen during reboot:

=46rom the host:
Mar 25 11:54:48 HN kernel: [   76.237452] CT: 144: started
Mar 25 11:55:03 HN kernel: [   91.201145] CT: 144: restored

OpenVZ uses "suspend" by default to stop containers on host reboots.
I will change this to "stop" for the PostgreSQL container, but still this=
 seems like something PostgreSQL should handle better.

FWIW, I have just suspended and started the container manually, and Postg=
reSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after reb=
ooting the host?

Please also note that the PostgreSQL error happened half an hour after th=
e reboot/resuming of the container.


Thanks,
Daniel.

--=20
http://daniel.hahler.de/

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
Next
From: Tom Lane
Date:
Subject: Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array