Thread: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

postgresql@thequod.de

Date:

25 March 2014, 14:17:23

The following bug has been logged on the website:

Bug reference:      9721
Logged by:          Daniel Hahler
Email address:      postgresql@thequod.de
PostgreSQL version: 9.3.3
Operating system:   Debian
Description:

PostgreSQL just failed to startup after a reboot (which was forced via
remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

2014-03-24 13:32:47 CET LOG:  could not receive data from client: Connection
reset by peer
    2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot 108,
but it contains 0
2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot 109,
but it contains 0
2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot 110,
but it contains 0

I have found that the source code says that this should never happen:

    /* Out of slots ... should never happen, else postmaster.c messed up */
    elog(FATAL, "no free slots in PMChildFlags array");

Another attempt at starting it worked:

2014-03-25 12:38:43 CET LOG:  database system was interrupted; last known up
at 2014-03-24 18:22:38 CET
2014-03-25 12:38:43 CET LOG:  database system was not properly shut down;
automatic recovery in progress
2014-03-25 12:38:43 CET LOG:  incomplete startup packet
2014-03-25 12:38:44 CET LOG:  record with zero length at 0/7843020
2014-03-25 12:38:44 CET LOG:  redo is not required
2014-03-25 12:38:44 CET LOG:  autovacuum launcher started
2014-03-25 12:38:44 CET LOG:  database system is ready to accept
connections


I am using 9.3.3-1.pgdg70+1 from http://apt.postgresql.org/pub/repos/apt/ on
a Debian Wheezy system, running in an OpenVZ container.

I have just now noticed that 9.3.4 has been released, and upgraded to it. I
am sorry if that's something that has been fixed in that update already.


Thanks!

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Tom Lane

Date:

25 March 2014, 14:26:59

postgresql@thequod.de writes:
> PostgreSQL just failed to startup after a reboot (which was forced via
> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

> 2014-03-24 13:32:47 CET LOG:  could not receive data from client: Connection
> reset by peer
>     2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
> 2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot 108,
> but it contains 0
> 2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot 109,
> but it contains 0
> 2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot 110,
> but it contains 0

That's odd (and as you say, unexpected) but this log extract doesn't give
much clue as to how we got into this state.  What was going on before
this?  In particular, it's hard to call this "failure to start up" when
you evidently had a hundred or so postmaster child processes already.
Could there have been some unexpected surge in the number of connection
attempts just after the database came up?  Also, this extract doesn't look
like anything that would've caused the postmaster to decide to shut down
again, so what happened after that?  Or in short, I want to see the rest
of the log not just this part.

            regards, tom lane

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Alvaro Herrera

Date:

25 March 2014, 14:37:00

Tom Lane wrote:
> postgresql@thequod.de writes:
> > PostgreSQL just failed to startup after a reboot (which was forced via
> > remote Ctrl-Alt-Delete on the PostgreSQL's containers host):
>
> > 2014-03-24 13:32:47 CET LOG:  could not receive data from client: Connection
> > reset by peer
> >     2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
> > 2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot 108,
> > but it contains 0
> > 2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot 109,
> > but it contains 0
> > 2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot 110,
> > but it contains 0
>
> That's odd (and as you say, unexpected) but this log extract doesn't give
> much clue as to how we got into this state.  What was going on before
> this?  In particular, it's hard to call this "failure to start up" when
> you evidently had a hundred or so postmaster child processes already.
> Could there have been some unexpected surge in the number of connection
> attempts just after the database came up?  Also, this extract doesn't look
> like anything that would've caused the postmaster to decide to shut down
> again, so what happened after that?  Or in short, I want to see the rest
> of the log not just this part.

Here's my guess --- this is a virtualized system that somehow dumped
some state to disk to hibernate while the host was being rebooted; and
then, when the host was up again, it tried to resurrect the virtual
machine and found things to be all inconsistent.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Alvaro Herrera

Date:

25 March 2014, 15:26:15

Daniel Hahler wrote:
> On 25.03.2014 15:36, Alvaro Herrera wrote:

> > Here's my guess --- this is a virtualized system that somehow dumped
> > some state to disk to hibernate while the host was being rebooted; and
> > then, when the host was up again, it tried to resurrect the virtual
> > machine and found things to be all inconsistent.
>
> Yes, the container was frozen during reboot:
>
> From the host:
> Mar 25 11:54:48 HN kernel: [   76.237452] CT: 144: started
> Mar 25 11:55:03 HN kernel: [   91.201145] CT: 144: restored
>
> OpenVZ uses "suspend" by default to stop containers on host reboots.
> I will change this to "stop" for the PostgreSQL container, but still this seems like something PostgreSQL should
handlebetter. 

Of course.  We will accept your patch as soon as you discover the bug.
Thanks for the offer.

> FWIW, I have just suspended and started the container manually, and
> PostgreSQL kept running (upgraded to 9.3.4 in the meantime).
>
> Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host?

Maybe it forgot to restore the shared memory state.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Daniel Hahler

Date:

25 March 2014, 15:38:59

On 25.03.2014 15:36, Alvaro Herrera wrote:
> Tom Lane wrote:
>> postgresql@thequod.de writes:
>>> PostgreSQL just failed to startup after a reboot (which was forced vi=
a
>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):
>>
>>> 2014-03-24 13:32:47 CET LOG:  could not receive data from client: Con=
nection
>>> reset by peer
>>>     2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
>>> 2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot =
108,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot =
109,
>>> but it contains 0
>>> 2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot =
110,
>>> but it contains 0
>>
>> That's odd (and as you say, unexpected) but this log extract doesn't g=
ive
>> much clue as to how we got into this state.  What was going on before
>> this?  In particular, it's hard to call this "failure to start up" whe=
n
>> you evidently had a hundred or so postmaster child processes already.
>> Could there have been some unexpected surge in the number of connectio=
n
>> attempts just after the database came up?  Also, this extract doesn't =
look
>> like anything that would've caused the postmaster to decide to shut do=
wn
>> again, so what happened after that?  Or in short, I want to see the re=
st
>> of the log not just this part.

That was the whole log.

The rotated one before has only:
2014-03-22 03:51:37 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 03:52:25 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 03:59:31 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 04:00:18 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-22 06:03:06 CET LOG:  could not receive data from client: Connect=
ion reset by peer

Should I increase the logging verbosity, in case this happens again?
If so, to what? (I have not configured logging yet, so it has the default=
s from your Debian package).

> Here's my guess --- this is a virtualized system that somehow dumped
> some state to disk to hibernate while the host was being rebooted; and
> then, when the host was up again, it tried to resurrect the virtual
> machine and found things to be all inconsistent.

Yes, the container was frozen during reboot:

=46rom the host:
Mar 25 11:54:48 HN kernel: [   76.237452] CT: 144: started
Mar 25 11:55:03 HN kernel: [   91.201145] CT: 144: restored

OpenVZ uses "suspend" by default to stop containers on host reboots.
I will change this to "stop" for the PostgreSQL container, but still this=
 seems like something PostgreSQL should handle better.

FWIW, I have just suspended and started the container manually, and Postg=
reSQL kept running (upgraded to 9.3.4 in the meantime).

Maybe it's a bug with OpenVZ and how it restores some resources after reb=
ooting the host?

Please also note that the PostgreSQL error happened half an hour after th=
e reboot/resuming of the container.


Thanks,
Daniel.

--=20
http://daniel.hahler.de/

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Tom Lane

Date:

25 March 2014, 16:03:03

Daniel Hahler <postgresql@thequod.de> writes:
> On 25.03.2014 15:36, Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> postgresql@thequod.de writes:
>>>> PostgreSQL just failed to startup after a reboot (which was forced via
>>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):

>>> ... in short, I want to see the rest of the log not just this part.

> That was the whole log.

Oh.  So actually, "failed to startup" was a complete misdescription, and
what you should've said was "it appeared to be running normally, but I
saw this burst of weird messages in the postmaster log, so I decided to
restart it"?

I tend to concur with Alvaro's theory that OpenVZ must've failed to
restore the contents of shared memory precisely.  If there were no other
symptoms than these, it seems like the misrestore must have affected just
a few small areas of the shared memory segment.  Which seems odd, but then
again I know nothing of the internals of OpenVZ, so maybe it's a plausible
failure mode.

            regards, tom lane

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Tom Lane

Date:

25 March 2014, 16:27:15

Daniel Hahler <postgresql@thequod.de> writes:
> On 25.03.2014 17:02, Tom Lane wrote:
>> Oh.  So actually, "failed to startup" was a complete misdescription, and
>> what you should've said was "it appeared to be running normally, but I
>> saw this burst of weird messages in the postmaster log, so I decided to
>> restart it"?

> Sorry, "failed to startup" was a misdescription, because I did not remember that OpenVZ suspends containers on
reboots.
> I should have said "crashed half an hour after being resumed".

> I have not restarted PostgreSQL though, but started it when noticing that it wasn't running (anymore).

Well, your log doesn't show any evidence to support the claim that
Postgres wasn't running.  We see some child processes having minor trouble
while exiting, and one that failed while trying to start up; the latter
being strong evidence that the postmaster was still working at least.

So in short, you haven't given us a complete description of what happened.
What led you to decide you needed to restart Postgres?

            regards, tom lane

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Daniel Hahler

Date:

25 March 2014, 20:05:11

On 25.03.2014 16:26, Alvaro Herrera wrote:

>> FWIW, I have just suspended and started the container manually, and
>> PostgreSQL kept running (upgraded to 9.3.4 in the meantime).
>>
>> Maybe it's a bug with OpenVZ and how it restores some resources after =
rebooting the host?
>=20
> Maybe it forgot to restore the shared memory state.

Shouldn't that have caused PostgreSQL to error out much earlier - and not=
 after half an hour?

Anyway, maybe it won't affect 9.3.4 anymore, and it's good to know that i=
t might just be related to rebooting with the "suspend" option.

Regards and thanks,
Daniel.

--=20
http://daniel.hahler.de/

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Daniel Hahler

Date:

25 March 2014, 20:05:19

On 25.03.2014 17:26, Tom Lane wrote:
> Daniel Hahler <postgresql@thequod.de> writes:
>> On 25.03.2014 17:02, Tom Lane wrote:
>>> Oh.  So actually, "failed to startup" was a complete misdescription, =
and
>>> what you should've said was "it appeared to be running normally, but =
I
>>> saw this burst of weird messages in the postmaster log, so I decided =
to
>>> restart it"?
>=20
>> Sorry, "failed to startup" was a misdescription, because I did not rem=
ember that OpenVZ suspends containers on reboots.
>> I should have said "crashed half an hour after being resumed".
>=20
>> I have not restarted PostgreSQL though, but started it when noticing t=
hat it wasn't running (anymore).
>=20
> Well, your log doesn't show any evidence to support the claim that
> Postgres wasn't running.  We see some child processes having minor trou=
ble
> while exiting, and one that failed while trying to start up; the latter=

> being strong evidence that the postmaster was still working at least.

I've received the following error on 11:32:19 from a client app trying to=
 connect:
    OperationalError: could not connect to server: Connection refused
    Is the server running on host "10.122.x.x" and accepting
    TCP/IP connections on port x?

> So in short, you haven't given us a complete description of what happen=
ed.
> What led you to decide you needed to restart Postgres?

There was no Postgres process running in 'ps aux', and I only had to star=
t, not restart it.

Relevant entries from the shell history (llog is for syslog/messages):

   47  12:32  ps aux
   48  12:33  llog
   52  12:33  less ~l/daemon.log
   55  12:34  less ~l/postgresql/postgresql-9.3-main.log
   56  12:34  ps aux
   64  12:38  service postgresql start


Regards,
Daniel.

--=20
http://daniel.hahler.de/

Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array

From

Daniel Hahler

Date:

25 March 2014, 20:05:32

On 25.03.2014 17:02, Tom Lane wrote:
> Daniel Hahler <postgresql@thequod.de> writes:
>> On 25.03.2014 15:36, Alvaro Herrera wrote:
>>> Tom Lane wrote:
>>>> postgresql@thequod.de writes:
>>>>> PostgreSQL just failed to startup after a reboot (which was forced =
via
>>>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host):
>=20
>>>> ... in short, I want to see the rest of the log not just this part.
>=20
>> That was the whole log.
>=20
> Oh.  So actually, "failed to startup" was a complete misdescription, an=
d
> what you should've said was "it appeared to be running normally, but I
> saw this burst of weird messages in the postmaster log, so I decided to=

> restart it"?

Sorry, "failed to startup" was a misdescription, because I did not rememb=
er that OpenVZ suspends containers on reboots.
I should have said "crashed half an hour after being resumed".

I have not restarted PostgreSQL though, but started it when noticing that=
 it wasn't running (anymore).

=46rom the logs (again):
2014-03-24 13:32:47 CET LOG:  could not receive data from client: Connect=
ion reset by peer
2014-03-25 12:32:17 CET FATAL:  no free slots in PMChildFlags array
2014-03-25 12:32:17 CET LOG:  process 9975 releasing ProcSignal slot 108,=
 but it contains 0
2014-03-25 12:32:17 CET LOG:  process 9974 releasing ProcSignal slot 109,=
 but it contains 0
2014-03-25 12:32:17 CET LOG:  process 9976 releasing ProcSignal slot 110,=
 but it contains 0

I have noticed it, and started it:

2014-03-25 12:38:43 CET LOG:  database system was interrupted; last known=
 up at 2014-03-24 18:22:38 CET

If you expect more information from the logs (in case this happens again)=
, I would appreciate some hint about what would be important to enable in=
 the configuration.


Regards,
Daniel.

--=20
http://daniel.hahler.de/