Thread: Standalone backends run StartupXLOG in an incorrect environment

Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

19 September 2009, 14:21:22

I realized the truth of $SUBJECT while reading this report:
http://archives.postgresql.org/pgsql-general/2009-09/msg00712.php

In a standalone backend, postgres.c tries to run StartupXLOG after
having done only BaseInit(), which means that we don't have a PGPROC
(hence can't take LWLocks much less heavyweight locks) and we have not
totally finished initializing the bufmgr either.  This is apparently
enough for the "normal" case where there's no log replay to do; but
as the above report shows, it's completely inadequate for some of the
more complex code paths in replay.  I suspect this has been broken
from the beginning.

Fixing this will require rearranging things around InitPostgres
(in particular, I think InitBufferPoolBackend will have to be
called directly from postgres.c).  Since that code got rearranged
quite a bit last month, I'd be hesitant to try to back-patch whatever
fix we come up with for HEAD.  Seeing that we'd never noticed the
problem before, I think it's okay to fix it just in HEAD and not
risk back-patching ... comments?

Also, does this have any impact on the Hot Standby stuff?
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Alvaro Herrera

Date:

21 September 2009, 15:08:17

Tom Lane wrote:

> Fixing this will require rearranging things around InitPostgres
> (in particular, I think InitBufferPoolBackend will have to be
> called directly from postgres.c).  Since that code got rearranged
> quite a bit last month, I'd be hesitant to try to back-patch whatever
> fix we come up with for HEAD.  Seeing that we'd never noticed the
> problem before, I think it's okay to fix it just in HEAD and not
> risk back-patching ... comments?

So if you need to enter standalone mode, you'll have to start
postmaster, wait for replay to finish, stop it and restart standalone.
Would this be a problem when you need standalone mode in an emergency,
for example when the database won't start due to Xid wraparound?
Frankly I don't have a problem with no backpatching but ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

21 September 2009, 15:24:28

Alvaro Herrera <alvherre@commandprompt.com> writes:
> So if you need to enter standalone mode, you'll have to start
> postmaster, wait for replay to finish, stop it and restart standalone.

Yeah, that's the case at the moment.

> Would this be a problem when you need standalone mode in an emergency,
> for example when the database won't start due to Xid wraparound?

If it won't run through StartupXLOG under the postmaster, it's not going
to do so standalone either.
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Robert Treat

Date:

21 September 2009, 23:27:11

On Monday 21 September 2009 14:24:07 Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > So if you need to enter standalone mode, you'll have to start
> > postmaster, wait for replay to finish, stop it and restart standalone.
>
> Yeah, that's the case at the moment.
>
> > Would this be a problem when you need standalone mode in an emergency,
> > for example when the database won't start due to Xid wraparound?
>
> If it won't run through StartupXLOG under the postmaster, it's not going
> to do so standalone either.
>

Hmm... istr cases where I couldn't startup regular postgres but could in 
stand-alone mode that had system indexes disabled...I could be misremembering 
that so that the postmaster would start, I just couldn't connect unless in 
stand-alone.  In any case this does seem less than ideal, but if there aren't 
any better options...

-- 
Robert Treat
Conjecture: http://www.xzilla.net
Consulting: http://www.omniti.com

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Simon Riggs

Date:

17 March 2010, 18:04:46

On Sat, 2009-09-19 at 13:21 -0400, Tom Lane wrote:
> I realized the truth of $SUBJECT while reading this report:
> http://archives.postgresql.org/pgsql-general/2009-09/msg00712.php

...

> Also, does this have any impact on the Hot Standby stuff?

It could potentially, but there is not much HS-related code in the early
boot sequence. It's mostly StartupXLog() or later.

HS won't work in standalone backends, by definition, so I don't see too
much to worry me.

[This is an open item for 9.0, hence the response to an apparently old
hackers thread]

-- Simon Riggs           www.2ndQuadrant.com

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Bruce Momjian

Date:

22 March 2010, 22:29:52

Simon Riggs wrote:
> On Sat, 2009-09-19 at 13:21 -0400, Tom Lane wrote:
> > I realized the truth of $SUBJECT while reading this report:
> > http://archives.postgresql.org/pgsql-general/2009-09/msg00712.php
> 
> ...
> 
> > Also, does this have any impact on the Hot Standby stuff?
> 
> It could potentially, but there is not much HS-related code in the early
> boot sequence. It's mostly StartupXLog() or later.
> 
> HS won't work in standalone backends, by definition, so I don't see too
> much to worry me.
> 
> [This is an open item for 9.0, hence the response to an apparently old
> hackers thread]

Thanks for the reply;  9.0 open item removed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

22 March 2010, 22:31:00

Bruce Momjian <bruce@momjian.us> writes:
>> [This is an open item for 9.0, hence the response to an apparently old
>> hackers thread]

> Thanks for the reply;  9.0 open item removed.

I think you misread his reply.  Please put that back.
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Bruce Momjian

Date:

22 March 2010, 22:32:28

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> >> [This is an open item for 9.0, hence the response to an apparently old
> >> hackers thread]
> 
> > Thanks for the reply;  9.0 open item removed.
> 
> I think you misread his reply.  Please put that back.

OK,  I re-read it and still don't understand, but I don't have to.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 PG East:  http://www.enterprisedb.com/community/nav-pg-east-2010.do

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Robert Haas

Date:

19 April 2010, 12:00:35

On Mon, Mar 22, 2010 at 9:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>> >> [This is an open item for 9.0, hence the response to an apparently old
>> >> hackers thread]
>>
>> > Thanks for the reply;  9.0 open item removed.
>>
>> I think you misread his reply.  Please put that back.
>
> OK,  I re-read it and still don't understand, but I don't have to.

I re-read it too and I don't understand either.  This is LISTED as an
open item for 9.0, but it is apparently not a new regression, so I
think we should move it to the Todo list instead.  This problem was
discovered six months ago, is not a new regression, and there is
apparently no movement toward a fix, so it doesn't make sense to me
that we should hold up either 9.0 beta or 9.0 final on account of it.
Or am I confused?

....Robert

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

19 April 2010, 12:20:02

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 22, 2010 at 9:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> OK, �I re-read it and still don't understand, but I don't have to.

> I re-read it too and I don't understand either.

The point is that a standalone backend will fail to execute recovery
correctly:
http://archives.postgresql.org/pgsql-hackers/2009-09/msg01297.php
This is bad enough now but seems likely to be an even bigger foot-gun
in an HS/SR world.

> This is LISTED as an
> open item for 9.0, but it is apparently not a new regression, so I
> think we should move it to the Todo list instead.  This problem was
> discovered six months ago, is not a new regression, and there is
> apparently no movement toward a fix, so it doesn't make sense to me
> that we should hold up either 9.0 beta or 9.0 final on account of it.

If you think we're at the point where this item is the main thing
standing between us and beta, I'll go do something about it.  I've
been waiting for the HS code to settle before trying to design a
solution...
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Simon Riggs

Date:

19 April 2010, 13:09:35

On Mon, 2010-04-19 at 11:19 -0400, Tom Lane wrote:

> If you think we're at the point where this item is the main thing
> standing between us and beta, I'll go do something about it.  I've
> been waiting for the HS code to settle before trying to design a
> solution...

I'm not hugely interested in supporting HS in standalone backends. If it
works, that's nice. If it doesn't then I suggest we don't hold up beta
for it.

-- Simon Riggs           www.2ndQuadrant.com

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Robert Haas

Date:

19 April 2010, 13:09:53

On Mon, Apr 19, 2010 at 11:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 22, 2010 at 9:32 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> OK,  I re-read it and still don't understand, but I don't have to.
>
>> I re-read it too and I don't understand either.
>
> The point is that a standalone backend will fail to execute recovery
> correctly:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01297.php
> This is bad enough now but seems likely to be an even bigger foot-gun
> in an HS/SR world.

OK.

>> This is LISTED as an
>> open item for 9.0, but it is apparently not a new regression, so I
>> think we should move it to the Todo list instead.  This problem was
>> discovered six months ago, is not a new regression, and there is
>> apparently no movement toward a fix, so it doesn't make sense to me
>> that we should hold up either 9.0 beta or 9.0 final on account of it.
>
> If you think we're at the point where this item is the main thing
> standing between us and beta, I'll go do something about it.  I've
> been waiting for the HS code to settle before trying to design a
> solution...

I'm not sure if this is the main thing, but I think it's probably in the top 5.

At present there are 8 items (not counting documentation issues) listed at:

http://wiki.postgresql.org/wiki/PostgreSQL_9.0_Open_Items

...not all of which seem likely to get fixed, and probably 1-3
additional patches that are floating around out there without having
formally gotten added to the list.  I think it's realistic to think
that we could be within 10 commits of beta.

...Robert

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

19 April 2010, 14:01:20

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2010-04-19 at 11:19 -0400, Tom Lane wrote:
>> If you think we're at the point where this item is the main thing
>> standing between us and beta, I'll go do something about it.  I've
>> been waiting for the HS code to settle before trying to design a
>> solution...

> I'm not hugely interested in supporting HS in standalone backends.

I'm not either.  The type of scenario that I'm worried about is someone
trying to use a standalone backend to clean up after a recovery failure.
Right now I'm afraid that could make things worse (ie create additional
corruption).
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

19 April 2010, 15:34:36

I wrote:
> The point is that a standalone backend will fail to execute recovery
> correctly:
> http://archives.postgresql.org/pgsql-hackers/2009-09/msg01297.php

After digging around a bit, it seems like the cleanest solution would be
to move the responsibility for calling StartupXLOG in a standalone
backend into InitPostgres.  At the point where the latter currently has
/* * Initialize local process's access to XLOG, if appropriate.  In * bootstrap case we skip this since StartupXLOG()
wasrun instead. */if (!bootstrap)    (void) RecoveryInProgress();

we'd add a couple of lines to call StartupXLOG if !IsUnderPostmaster,
and then remove the call from postgres.c.  I haven't tested this yet
but it looks like the correct state has been set up at that point.
Anyone see any obvious holes in the idea?
        regards, tom lane

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Simon Riggs

Date:

19 April 2010, 16:15:02

On Mon, 2010-04-19 at 14:34 -0400, Tom Lane wrote:

> Anyone see any obvious holes in the idea?

Nothing springs to mind, so worth a try.

-- Simon Riggs           www.2ndQuadrant.com

Re: Standalone backends run StartupXLOG in an incorrect environment

From

Tom Lane

Date:

19 April 2010, 17:18:54

BTW, just for the archives' sake: it took a good long time to develop a
reproducible test case for this.  It seems that 99% of the WAL replay
code does *not* depend on the missing state.  I was eventually able to
reproduce the case originally reported, namely a crash during
btree_xlog_cleanup; but to get that you need (a) WAL to end between a
btree page split and insertion of the new parent record, and (b) have
the resulting insertion need to obtain a new btree page, ie there's
another split or newroot forced then, and (c) not have any available
pages in the index's FSM, so that we have to LockRelationForExtension,
which is what crashes for lack of a PGPROC.

So that probably explains why we went so long without recognizing the
bug.

This again points up the fact that WAL recovery isn't as well tested
as one could wish.  Koichi-san's efforts to create a test with 100%
coverage of all types of WAL records are good, but that'd not have
helped us to find this.  We should think about ways to provide better
test coverage of end-of-WAL cleanup.
        regards, tom lane