Re: Idea for improving buildfarm robustness - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Idea for improving buildfarm robustness
Date
Msg-id 1555.1443556067@sss.pgh.pa.us
Whole thread Raw
In response to Re: Idea for improving buildfarm robustness  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Idea for improving buildfarm robustness  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Idea for improving buildfarm robustness  (Joe Conway <mail@joeconway.com>)
List pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:
> On 09/29/2015 11:48 AM, Tom Lane wrote:
>> But today I thought of another way: suppose that we teach the postmaster
>> to commit hara-kiri if the $PGDATA directory goes away.  Since the
>> buildfarm script definitely does remove all the temporary data directories
>> it creates, this ought to get the job done.

> This would also be useful for production.  I can't count the number of
> times I've accidentally blown away a replica's PGDATA without shutting
> the postmaster down first, and then had to do a bunch of kill -9.

> In general, having the postmaster survive deletion of PGDATA is
> suboptimal.  In rare cases of having it survive installation of a new
> PGDATA (via PITR restore, for example), I've even seen the zombie
> postmaster corrupt the data files.

Side comment on that: if you'd actually removed $PGDATA, I can't see how
that would happen.  The postmaster and children would have open CWD
handles to the now-disconnected-from-anything-else directory inode,
which would not enable them to reach files created under the new directory
inode.  (They don't ever use absolute paths, only relative, or at least
that's the way it's supposed to work.)

However ... if you'd simply deleted everything *under* $PGDATA but not
that directory itself, then this type of failure mode is 100% plausible.
And that's not an unreasonable thing to do, especially if you've set
things up so that $PGDATA's parent is not a writable directory.

Testing accessibility of "global/pg_control" would be enough to catch this
case, but only if we do it before you create a new one.  So that seems
like an argument for making the test relatively often.  The once-a-minute
option is sounding better and better.

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to.  But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: ON CONFLICT issues around whole row vars,
Next
From: Stephen Frost
Date:
Subject: Re: ON CONFLICT issues around whole row vars,