Thread: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
From
postgresql@thequod.de
Date:
The following bug has been logged on the website: Bug reference: 9721 Logged by: Daniel Hahler Email address: postgresql@thequod.de PostgreSQL version: 9.3.3 Operating system: Debian Description: PostgreSQL just failed to startup after a reboot (which was forced via remote Ctrl-Alt-Delete on the PostgreSQL's containers host): 2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection reset by peer 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108, but it contains 0 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109, but it contains 0 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110, but it contains 0 I have found that the source code says that this should never happen: /* Out of slots ... should never happen, else postmaster.c messed up */ elog(FATAL, "no free slots in PMChildFlags array"); Another attempt at starting it worked: 2014-03-25 12:38:43 CET LOG: database system was interrupted; last known up at 2014-03-24 18:22:38 CET 2014-03-25 12:38:43 CET LOG: database system was not properly shut down; automatic recovery in progress 2014-03-25 12:38:43 CET LOG: incomplete startup packet 2014-03-25 12:38:44 CET LOG: record with zero length at 0/7843020 2014-03-25 12:38:44 CET LOG: redo is not required 2014-03-25 12:38:44 CET LOG: autovacuum launcher started 2014-03-25 12:38:44 CET LOG: database system is ready to accept connections I am using 9.3.3-1.pgdg70+1 from http://apt.postgresql.org/pub/repos/apt/ on a Debian Wheezy system, running in an OpenVZ container. I have just now noticed that 9.3.4 has been released, and upgraded to it. I am sorry if that's something that has been fixed in that update already. Thanks!
postgresql@thequod.de writes: > PostgreSQL just failed to startup after a reboot (which was forced via > remote Ctrl-Alt-Delete on the PostgreSQL's containers host): > 2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection > reset by peer > 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array > 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108, > but it contains 0 > 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109, > but it contains 0 > 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110, > but it contains 0 That's odd (and as you say, unexpected) but this log extract doesn't give much clue as to how we got into this state. What was going on before this? In particular, it's hard to call this "failure to start up" when you evidently had a hundred or so postmaster child processes already. Could there have been some unexpected surge in the number of connection attempts just after the database came up? Also, this extract doesn't look like anything that would've caused the postmaster to decide to shut down again, so what happened after that? Or in short, I want to see the rest of the log not just this part. regards, tom lane
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
From
Alvaro Herrera
Date:
Tom Lane wrote: > postgresql@thequod.de writes: > > PostgreSQL just failed to startup after a reboot (which was forced via > > remote Ctrl-Alt-Delete on the PostgreSQL's containers host): > > > 2014-03-24 13:32:47 CET LOG: could not receive data from client: Connection > > reset by peer > > 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array > > 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108, > > but it contains 0 > > 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109, > > but it contains 0 > > 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110, > > but it contains 0 > > That's odd (and as you say, unexpected) but this log extract doesn't give > much clue as to how we got into this state. What was going on before > this? In particular, it's hard to call this "failure to start up" when > you evidently had a hundred or so postmaster child processes already. > Could there have been some unexpected surge in the number of connection > attempts just after the database came up? Also, this extract doesn't look > like anything that would've caused the postmaster to decide to shut down > again, so what happened after that? Or in short, I want to see the rest > of the log not just this part. Here's my guess --- this is a virtualized system that somehow dumped some state to disk to hibernate while the host was being rebooted; and then, when the host was up again, it tried to resurrect the virtual machine and found things to be all inconsistent. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
From
Alvaro Herrera
Date:
Daniel Hahler wrote: > On 25.03.2014 15:36, Alvaro Herrera wrote: > > Here's my guess --- this is a virtualized system that somehow dumped > > some state to disk to hibernate while the host was being rebooted; and > > then, when the host was up again, it tried to resurrect the virtual > > machine and found things to be all inconsistent. > > Yes, the container was frozen during reboot: > > From the host: > Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started > Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored > > OpenVZ uses "suspend" by default to stop containers on host reboots. > I will change this to "stop" for the PostgreSQL container, but still this seems like something PostgreSQL should handlebetter. Of course. We will accept your patch as soon as you discover the bug. Thanks for the offer. > FWIW, I have just suspended and started the container manually, and > PostgreSQL kept running (upgraded to 9.3.4 in the meantime). > > Maybe it's a bug with OpenVZ and how it restores some resources after rebooting the host? Maybe it forgot to restore the shared memory state. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 25.03.2014 15:36, Alvaro Herrera wrote: > Tom Lane wrote: >> postgresql@thequod.de writes: >>> PostgreSQL just failed to startup after a reboot (which was forced vi= a >>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host): >> >>> 2014-03-24 13:32:47 CET LOG: could not receive data from client: Con= nection >>> reset by peer >>> 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array >>> 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot = 108, >>> but it contains 0 >>> 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot = 109, >>> but it contains 0 >>> 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot = 110, >>> but it contains 0 >> >> That's odd (and as you say, unexpected) but this log extract doesn't g= ive >> much clue as to how we got into this state. What was going on before >> this? In particular, it's hard to call this "failure to start up" whe= n >> you evidently had a hundred or so postmaster child processes already. >> Could there have been some unexpected surge in the number of connectio= n >> attempts just after the database came up? Also, this extract doesn't = look >> like anything that would've caused the postmaster to decide to shut do= wn >> again, so what happened after that? Or in short, I want to see the re= st >> of the log not just this part. That was the whole log. The rotated one before has only: 2014-03-22 03:51:37 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 03:52:25 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 03:59:31 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 04:00:18 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 06:03:06 CET LOG: could not receive data from client: Connect= ion reset by peer Should I increase the logging verbosity, in case this happens again? If so, to what? (I have not configured logging yet, so it has the default= s from your Debian package). > Here's my guess --- this is a virtualized system that somehow dumped > some state to disk to hibernate while the host was being rebooted; and > then, when the host was up again, it tried to resurrect the virtual > machine and found things to be all inconsistent. Yes, the container was frozen during reboot: =46rom the host: Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored OpenVZ uses "suspend" by default to stop containers on host reboots. I will change this to "stop" for the PostgreSQL container, but still this= seems like something PostgreSQL should handle better. FWIW, I have just suspended and started the container manually, and Postg= reSQL kept running (upgraded to 9.3.4 in the meantime). Maybe it's a bug with OpenVZ and how it restores some resources after reb= ooting the host? Please also note that the PostgreSQL error happened half an hour after th= e reboot/resuming of the container. Thanks, Daniel. --=20 http://daniel.hahler.de/
Daniel Hahler <postgresql@thequod.de> writes: > On 25.03.2014 15:36, Alvaro Herrera wrote: >> Tom Lane wrote: >>> postgresql@thequod.de writes: >>>> PostgreSQL just failed to startup after a reboot (which was forced via >>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host): >>> ... in short, I want to see the rest of the log not just this part. > That was the whole log. Oh. So actually, "failed to startup" was a complete misdescription, and what you should've said was "it appeared to be running normally, but I saw this burst of weird messages in the postmaster log, so I decided to restart it"? I tend to concur with Alvaro's theory that OpenVZ must've failed to restore the contents of shared memory precisely. If there were no other symptoms than these, it seems like the misrestore must have affected just a few small areas of the shared memory segment. Which seems odd, but then again I know nothing of the internals of OpenVZ, so maybe it's a plausible failure mode. regards, tom lane
Daniel Hahler <postgresql@thequod.de> writes: > On 25.03.2014 17:02, Tom Lane wrote: >> Oh. So actually, "failed to startup" was a complete misdescription, and >> what you should've said was "it appeared to be running normally, but I >> saw this burst of weird messages in the postmaster log, so I decided to >> restart it"? > Sorry, "failed to startup" was a misdescription, because I did not remember that OpenVZ suspends containers on reboots. > I should have said "crashed half an hour after being resumed". > I have not restarted PostgreSQL though, but started it when noticing that it wasn't running (anymore). Well, your log doesn't show any evidence to support the claim that Postgres wasn't running. We see some child processes having minor trouble while exiting, and one that failed while trying to start up; the latter being strong evidence that the postmaster was still working at least. So in short, you haven't given us a complete description of what happened. What led you to decide you needed to restart Postgres? regards, tom lane
On 25.03.2014 16:26, Alvaro Herrera wrote: >> FWIW, I have just suspended and started the container manually, and >> PostgreSQL kept running (upgraded to 9.3.4 in the meantime). >> >> Maybe it's a bug with OpenVZ and how it restores some resources after = rebooting the host? >=20 > Maybe it forgot to restore the shared memory state. Shouldn't that have caused PostgreSQL to error out much earlier - and not= after half an hour? Anyway, maybe it won't affect 9.3.4 anymore, and it's good to know that i= t might just be related to rebooting with the "suspend" option. Regards and thanks, Daniel. --=20 http://daniel.hahler.de/
On 25.03.2014 17:26, Tom Lane wrote: > Daniel Hahler <postgresql@thequod.de> writes: >> On 25.03.2014 17:02, Tom Lane wrote: >>> Oh. So actually, "failed to startup" was a complete misdescription, = and >>> what you should've said was "it appeared to be running normally, but = I >>> saw this burst of weird messages in the postmaster log, so I decided = to >>> restart it"? >=20 >> Sorry, "failed to startup" was a misdescription, because I did not rem= ember that OpenVZ suspends containers on reboots. >> I should have said "crashed half an hour after being resumed". >=20 >> I have not restarted PostgreSQL though, but started it when noticing t= hat it wasn't running (anymore). >=20 > Well, your log doesn't show any evidence to support the claim that > Postgres wasn't running. We see some child processes having minor trou= ble > while exiting, and one that failed while trying to start up; the latter= > being strong evidence that the postmaster was still working at least. I've received the following error on 11:32:19 from a client app trying to= connect: OperationalError: could not connect to server: Connection refused Is the server running on host "10.122.x.x" and accepting TCP/IP connections on port x? > So in short, you haven't given us a complete description of what happen= ed. > What led you to decide you needed to restart Postgres? There was no Postgres process running in 'ps aux', and I only had to star= t, not restart it. Relevant entries from the shell history (llog is for syslog/messages): 47 12:32 ps aux 48 12:33 llog 52 12:33 less ~l/daemon.log 55 12:34 less ~l/postgresql/postgresql-9.3-main.log 56 12:34 ps aux 64 12:38 service postgresql start Regards, Daniel. --=20 http://daniel.hahler.de/
On 25.03.2014 17:02, Tom Lane wrote: > Daniel Hahler <postgresql@thequod.de> writes: >> On 25.03.2014 15:36, Alvaro Herrera wrote: >>> Tom Lane wrote: >>>> postgresql@thequod.de writes: >>>>> PostgreSQL just failed to startup after a reboot (which was forced = via >>>>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host): >=20 >>>> ... in short, I want to see the rest of the log not just this part. >=20 >> That was the whole log. >=20 > Oh. So actually, "failed to startup" was a complete misdescription, an= d > what you should've said was "it appeared to be running normally, but I > saw this burst of weird messages in the postmaster log, so I decided to= > restart it"? Sorry, "failed to startup" was a misdescription, because I did not rememb= er that OpenVZ suspends containers on reboots. I should have said "crashed half an hour after being resumed". I have not restarted PostgreSQL though, but started it when noticing that= it wasn't running (anymore). =46rom the logs (again): 2014-03-24 13:32:47 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot 108,= but it contains 0 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot 109,= but it contains 0 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot 110,= but it contains 0 I have noticed it, and started it: 2014-03-25 12:38:43 CET LOG: database system was interrupted; last known= up at 2014-03-24 18:22:38 CET If you expect more information from the logs (in case this happens again)= , I would appreciate some hint about what would be important to enable in= the configuration. Regards, Daniel. --=20 http://daniel.hahler.de/