Thread: 8.4-vintage problem in postmaster.c

8.4-vintage problem in postmaster.c

From
Alvaro Herrera
Date:
Hi,

Stefan Kaltenbrunner reported a problem in postmaster via IM to me.  I
thought I had nailed down the bug, but after more careful reading of the
code, turns out I was wrong.

The reported problem is that postmaster shuts itself down with this
error message:

2010-11-12 10:19:05 CET FATAL:  no free slots in PMChildFlags array

I thought that canAcceptConnections() was confused about what
the result of CountChildren() meant, but apparently not.

This is a change from the 8.3 code that didn't have the ChildSlots
stuff -- previously, if canAcceptConnections failed to report
CAC_TOOMANY, it would just fail later when trying to add the backend to
the shared-inval queue, as stated in the comment therein.  In the new
code, however, failure to keep an accurate count means that we fail
later in AssigPostmasterChildSlot with a FATAL error, leading to overall
shutdown.

In postmaster.c, this all happens before forking, so I see no way for
the system to be confused due to multiple processes starting in
parallel.


If you suspect that this may have to do with some race condition on
starting many backends quickly, you would probably be right.  The
evidence from the log (which thankfully is set to DEBUG3, though most
other settings about it seem to be rather broken) says that there were
many backend starting just before the FATAL message:

2010-11-12 10:18:55 CET DEBUG:  forked new backend, pid=2632 socket=348
2010-11-12 10:18:55 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:18:55 CET DEBUG:  forked new backend, pid=2972 socket=348
2010-11-12 10:18:55 CET DEBUG:  forked new backend, pid=2724 socket=348
2010-11-12 10:18:57 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:18:57 CET DEBUG:  forked new backend, pid=2724 socket=348
2010-11-12 10:18:57 CET DEBUG:  forked new backend, pid=2632 socket=348
2010-11-12 10:19:00 CET DEBUG:  forked new backend, pid=2724 socket=348
2010-11-12 10:19:01 CET DEBUG:  forked new backend, pid=2972 socket=348
2010-11-12 10:19:01 CET DEBUG:  forked new backend, pid=2724 socket=348
2010-11-12 10:19:02 CET DEBUG:  forked new backend, pid=2984 socket=348
2010-11-12 10:19:02 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2984 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2984 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2972 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2724 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2972 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2904 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=1280 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2984 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=2904 socket=348
2010-11-12 10:19:04 CET DEBUG:  forked new backend, pid=840 socket=348
2010-11-12 10:19:05 CET DEBUG:  forked new backend, pid=2724 socket=348

This is Windows 2000 Server --- I guess the PIDs being reused rather
quickly is not something to worry particularly about.  (Also note that
log_line_prefix does not include the PID so it's not easy to learn much
more from the log, according to Stefan).

-- 
Álvaro Herrera <alvherre@alvh.no-ip.org>


Re: 8.4-vintage problem in postmaster.c

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Stefan Kaltenbrunner reported a problem in postmaster via IM to me.  I
> thought I had nailed down the bug, but after more careful reading of the
> code, turns out I was wrong.

> The reported problem is that postmaster shuts itself down with this
> error message:

> 2010-11-12 10:19:05 CET FATAL:  no free slots in PMChildFlags array

Just looking at it, I think that the logic in canAcceptConnections got
broken by somebody in 8.4, and then broken some more in 9.0: in some
cases it will return an "okay to proceed" status without having checked
for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
PM_HOT_STANDBY state?  What version was actually running?
        regards, tom lane


Re: 8.4-vintage problem in postmaster.c

From
Stefan Kaltenbrunner
Date:
On 11/13/2010 06:58 PM, Tom Lane wrote:
> Alvaro Herrera<alvherre@alvh.no-ip.org>  writes:
>> Stefan Kaltenbrunner reported a problem in postmaster via IM to me.  I
>> thought I had nailed down the bug, but after more careful reading of the
>> code, turns out I was wrong.
>
>> The reported problem is that postmaster shuts itself down with this
>> error message:
>
>> 2010-11-12 10:19:05 CET FATAL:  no free slots in PMChildFlags array
>
> Just looking at it, I think that the logic in canAcceptConnections got
> broken by somebody in 8.4, and then broken some more in 9.0: in some
> cases it will return an "okay to proceed" status without having checked
> for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
> PM_HOT_STANDBY state?  What version was actually running?

I don't have too many details on the actual setup (working on that) but 
the box in question is running 8.4.2 and had no issues before the 
upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
looks plausible).


Stefan


Re: 8.4-vintage problem in postmaster.c

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> On 11/13/2010 06:58 PM, Tom Lane wrote:
>> Just looking at it, I think that the logic in canAcceptConnections got
>> broken by somebody in 8.4, and then broken some more in 9.0: in some
>> cases it will return an "okay to proceed" status without having checked
>> for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
>> PM_HOT_STANDBY state?  What version was actually running?

> I don't have too many details on the actual setup (working on that) but 
> the box in question is running 8.4.2 and had no issues before the 
> upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
> looks plausible).

Well, this failure would certainly involve a flood of connection
attempts, so it's possible it's a pre-existing bug that they just did
not happen to trip over before.  But the sequence of events that I'm
thinking about is a smart shutdown attempt (SIGTERM to postmaster)
while an online backup is in progress, followed by a flood of
near-simultaneous connection attempts while the backup is still active.
        regards, tom lane


Re: 8.4-vintage problem in postmaster.c

From
Stefan Kaltenbrunner
Date:
On 11/13/2010 11:07 PM, Tom Lane wrote:
> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
>> On 11/13/2010 06:58 PM, Tom Lane wrote:
>>> Just looking at it, I think that the logic in canAcceptConnections got
>>> broken by somebody in 8.4, and then broken some more in 9.0: in some
>>> cases it will return an "okay to proceed" status without having checked
>>> for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
>>> PM_HOT_STANDBY state?  What version was actually running?
>
>> I don't have too many details on the actual setup (working on that) but
>> the box in question is running 8.4.2 and had no issues before the
>> upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage
>> looks plausible).
>
> Well, this failure would certainly involve a flood of connection
> attempts, so it's possible it's a pre-existing bug that they just did
> not happen to trip over before.  But the sequence of events that I'm

afaik this seems to be fairly reproducible on the current box so 
something in 8.4 seems to trigger that issue more often now.


> thinking about is a smart shutdown attempt (SIGTERM to postmaster)
> while an online backup is in progress, followed by a flood of
> near-simultaneous connection attempts while the backup is still active.

interesting - but I don't think that the setup in question actually uses 
online backups at all..



Stefan


Re: 8.4-vintage problem in postmaster.c

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of sáb nov 13 19:07:50 -0300 2010:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> > On 11/13/2010 06:58 PM, Tom Lane wrote:
> >> Just looking at it, I think that the logic in canAcceptConnections got
> >> broken by somebody in 8.4, and then broken some more in 9.0: in some
> >> cases it will return an "okay to proceed" status without having checked
> >> for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
> >> PM_HOT_STANDBY state?  What version was actually running?
> 
> > I don't have too many details on the actual setup (working on that) but 
> > the box in question is running 8.4.2 and had no issues before the 
> > upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage 
> > looks plausible).
> 
> Well, this failure would certainly involve a flood of connection
> attempts, so it's possible it's a pre-existing bug that they just did
> not happen to trip over before.  But the sequence of events that I'm
> thinking about is a smart shutdown attempt (SIGTERM to postmaster)
> while an online backup is in progress, followed by a flood of
> near-simultaneous connection attempts while the backup is still active.

As far as I could gather from Stefan's description, I think this is
pretty unlikely.  It seems to me that the "too many children" error
message is very common in the 8.3 setup already, and the only reason
they have a problem on 8.4 is that it crashes instead.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 8.4-vintage problem in postmaster.c

From
Stefan Kaltenbrunner
Date:
On 11/15/2010 03:24 PM, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of sáb nov 13 19:07:50 -0300 2010:
>> Stefan Kaltenbrunner<stefan@kaltenbrunner.cc>  writes:
>>> On 11/13/2010 06:58 PM, Tom Lane wrote:
>>>> Just looking at it, I think that the logic in canAcceptConnections got
>>>> broken by somebody in 8.4, and then broken some more in 9.0: in some
>>>> cases it will return an "okay to proceed" status without having checked
>>>> for TOOMANY children.  Was this system possibly in PM_WAIT_BACKUP or
>>>> PM_HOT_STANDBY state?  What version was actually running?
>>
>>> I don't have too many details on the actual setup (working on that) but
>>> the box in question is running 8.4.2 and had no issues before the
>>> upgrade to 8.4 (ie 8.3 was reported to work fine - so a 8.4+ breakage
>>> looks plausible).
>>
>> Well, this failure would certainly involve a flood of connection
>> attempts, so it's possible it's a pre-existing bug that they just did
>> not happen to trip over before.  But the sequence of events that I'm
>> thinking about is a smart shutdown attempt (SIGTERM to postmaster)
>> while an online backup is in progress, followed by a flood of
>> near-simultaneous connection attempts while the backup is still active.
>
> As far as I could gather from Stefan's description, I think this is
> pretty unlikely.  It seems to me that the "too many children" error
> message is very common in the 8.3 setup already, and the only reason
> they have a problem on 8.4 is that it crashes instead.

not sure if that is true - but 8.4 crashes whereas 8.3  just (seems to) 
works - the issue is still there with 8_4_STABLE...

DEBUG3 level output (last few hours - 7MB in size) is available under 
http://www.kaltenbrunner.cc/files/postgresql-2010-11-24_143513.log
From looking at the code I'm not immediatly seeing what is going wrong 
here but maybe somebody else has an idea.


Stefan