Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array - Mailing list pgsql-bugs
From | Daniel Hahler |
---|---|
Subject | Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array |
Date | |
Msg-id | 53319E20.9030006@thequod.de Whole thread Raw |
In response to | Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array (Alvaro Herrera <alvherre@2ndquadrant.com>) |
Responses |
Re: BUG #9721: Fatal error on startup: no free slots in PMChildFlags array
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-bugs |
On 25.03.2014 15:36, Alvaro Herrera wrote: > Tom Lane wrote: >> postgresql@thequod.de writes: >>> PostgreSQL just failed to startup after a reboot (which was forced vi= a >>> remote Ctrl-Alt-Delete on the PostgreSQL's containers host): >> >>> 2014-03-24 13:32:47 CET LOG: could not receive data from client: Con= nection >>> reset by peer >>> 2014-03-25 12:32:17 CET FATAL: no free slots in PMChildFlags array >>> 2014-03-25 12:32:17 CET LOG: process 9975 releasing ProcSignal slot = 108, >>> but it contains 0 >>> 2014-03-25 12:32:17 CET LOG: process 9974 releasing ProcSignal slot = 109, >>> but it contains 0 >>> 2014-03-25 12:32:17 CET LOG: process 9976 releasing ProcSignal slot = 110, >>> but it contains 0 >> >> That's odd (and as you say, unexpected) but this log extract doesn't g= ive >> much clue as to how we got into this state. What was going on before >> this? In particular, it's hard to call this "failure to start up" whe= n >> you evidently had a hundred or so postmaster child processes already. >> Could there have been some unexpected surge in the number of connectio= n >> attempts just after the database came up? Also, this extract doesn't = look >> like anything that would've caused the postmaster to decide to shut do= wn >> again, so what happened after that? Or in short, I want to see the re= st >> of the log not just this part. That was the whole log. The rotated one before has only: 2014-03-22 03:51:37 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 03:52:25 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 03:59:31 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 04:00:18 CET LOG: could not receive data from client: Connect= ion reset by peer 2014-03-22 06:03:06 CET LOG: could not receive data from client: Connect= ion reset by peer Should I increase the logging verbosity, in case this happens again? If so, to what? (I have not configured logging yet, so it has the default= s from your Debian package). > Here's my guess --- this is a virtualized system that somehow dumped > some state to disk to hibernate while the host was being rebooted; and > then, when the host was up again, it tried to resurrect the virtual > machine and found things to be all inconsistent. Yes, the container was frozen during reboot: =46rom the host: Mar 25 11:54:48 HN kernel: [ 76.237452] CT: 144: started Mar 25 11:55:03 HN kernel: [ 91.201145] CT: 144: restored OpenVZ uses "suspend" by default to stop containers on host reboots. I will change this to "stop" for the PostgreSQL container, but still this= seems like something PostgreSQL should handle better. FWIW, I have just suspended and started the container manually, and Postg= reSQL kept running (upgraded to 9.3.4 in the meantime). Maybe it's a bug with OpenVZ and how it restores some resources after reb= ooting the host? Please also note that the PostgreSQL error happened half an hour after th= e reboot/resuming of the container. Thanks, Daniel. --=20 http://daniel.hahler.de/
pgsql-bugs by date: