Thread: Idea for improving buildfarm robustness

Idea for improving buildfarm robustness

From
Tom Lane
Date:
A problem the buildfarm has had for a long time is that if for some reason
the scripts fail to stop a test postmaster, the postmaster process will
hang around and cause subsequent runs to fail because of socket conflicts.
This seems to have gotten a lot worse lately due to the influx of very
slow buildfarm machines, but the risk has always been there.

I've been thinking about teaching the buildfarm script to "kill -9"
any postmasters left around at the end of the run, but that's fairly
problematic: how do you find such processes (since "ps" output isn't
hugely portable, especially not to Windows), and how do you tell them
apart from postmasters not started by the script?  So the idea was on
hold.

But today I thought of another way: suppose that we teach the postmaster
to commit hara-kiri if the $PGDATA directory goes away.  Since the
buildfarm script definitely does remove all the temporary data directories
it creates, this ought to get the job done.

An easy way to do that would be to have it check every so often if
pg_control can still be read.  We should not have it fail on ENFILE or
EMFILE, since that would create a new failure hazard under heavy load,
but ENOENT or similar would be reasonable grounds for deciding that
something is horribly broken.  (At least on Windows, failing on EPERM
doesn't seem wise either, since we've seen antivirus products randomly
causing such errors.)

I wouldn't want to do this every time through the postmaster's main loop,
but we could do this once an hour for no added cost by adding the check
where it does TouchSocketLockFiles; or once every few minutes if we
carried a separate variable like last_touch_time.  Once an hour would be
plenty to fix the buildfarm's problem, I should think.

Another question is what exactly "commit hara-kiri" should consist of.
We could just abort() or _exit(1) and leave it to child processes to
notice that the postmaster is gone, or we could make an effort to clean
up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
kill all the children and exit.  The children are probably having
problems of their own if the data directory's gone, so forcing
termination might be best to keep them from getting stuck.

Also, perhaps we'd only enable this behavior in --enable-cassert builds,
to avoid any risk of a postmaster incorrectly choosing to suicide in a
production scenario.  Or maybe that's overly conservative.

Thoughts?
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Josh Berkus
Date:
On 09/29/2015 11:48 AM, Tom Lane wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.

This would also be useful for production.  I can't count the number of
times I've accidentally blown away a replica's PGDATA without shutting
the postmaster down first, and then had to do a bunch of kill -9.

In general, having the postmaster survive deletion of PGDATA is
suboptimal.  In rare cases of having it survive installation of a new
PGDATA (via PITR restore, for example), I've even seen the zombie
postmaster corrupt the data files.

So if you want this change to be useful beyond the buildfarm, it should
check every few minutes, and you'd SIGQUIT.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Idea for improving buildfarm robustness

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.

Yes, please.

> An easy way to do that would be to have it check every so often if
> pg_control can still be read.  We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken.  (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)

Sounds pretty reasonable to me.

> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time.  Once an hour would be
> plenty to fix the buildfarm's problem, I should think.

I have a bad (?) habit of doing exactly this during development and
would really like it to be a bit more often than once/hour, unless
there's a particular problem with that.

> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit.  The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.

I like the idea of killing all the children and then exiting.

> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario.  Or maybe that's overly conservative.

That would work for my use-case.  Perhaps only on --enable-cassert
builds for back-branches but enable it in master and see how things go
for 9.6?  I agree that it feels overly conservative, but given our
recent history, we should be overly cautious with the back branches.

> Thoughts?

Thanks!

Stephen

Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> I wouldn't want to do this every time through the postmaster's main loop,
>> but we could do this once an hour for no added cost by adding the check
>> where it does TouchSocketLockFiles; or once every few minutes if we
>> carried a separate variable like last_touch_time.  Once an hour would be
>> plenty to fix the buildfarm's problem, I should think.

> I have a bad (?) habit of doing exactly this during development and
> would really like it to be a bit more often than once/hour, unless
> there's a particular problem with that.  

Yeah, Josh mentioned the same.  It would only take another three or four
lines of code to decouple it from TouchSocketLockFiles, and then it's
just a question of how much are you worried about the performance cost of
additional file-open attempts.  I think either one-minute or five-minute
intervals would be pretty defensible.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Andrew Dunstan
Date:

On 09/29/2015 02:48 PM, Tom Lane wrote:
> A problem the buildfarm has had for a long time is that if for some reason
> the scripts fail to stop a test postmaster, the postmaster process will
> hang around and cause subsequent runs to fail because of socket conflicts.
> This seems to have gotten a lot worse lately due to the influx of very
> slow buildfarm machines, but the risk has always been there.
>
> I've been thinking about teaching the buildfarm script to "kill -9"
> any postmasters left around at the end of the run, but that's fairly
> problematic: how do you find such processes (since "ps" output isn't
> hugely portable, especially not to Windows), and how do you tell them
> apart from postmasters not started by the script?  So the idea was on
> hold.
>
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.
>
> An easy way to do that would be to have it check every so often if
> pg_control can still be read.  We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken.  (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)
>
> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time.  Once an hour would be
> plenty to fix the buildfarm's problem, I should think.
>
> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit.  The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.
>
> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario.  Or maybe that's overly conservative.
>
> Thoughts?
>
>             



It's a fine idea. This is much more likely to be robust than any 
buildfarm client fix.

Not every buildfarm member uses cassert, so I'm not sure that's the best 
way to go. axolotl doesn't, and it's one of those that regularly has 
speed problems. Maybe a not-very-well-publicized GUC, or an environment 
setting? Or maybe just enable this anyway. If the data directory is gone 
what's the point in keeping the postmaster around? Shutting it down 
doesn't seem likely to cause any damage.


cheers

andrew



Re: Idea for improving buildfarm robustness

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> I wouldn't want to do this every time through the postmaster's main loop,
> >> but we could do this once an hour for no added cost by adding the check
> >> where it does TouchSocketLockFiles; or once every few minutes if we
> >> carried a separate variable like last_touch_time.  Once an hour would be
> >> plenty to fix the buildfarm's problem, I should think.
>
> > I have a bad (?) habit of doing exactly this during development and
> > would really like it to be a bit more often than once/hour, unless
> > there's a particular problem with that.
>
> Yeah, Josh mentioned the same.  It would only take another three or four
> lines of code to decouple it from TouchSocketLockFiles, and then it's
> just a question of how much are you worried about the performance cost of
> additional file-open attempts.  I think either one-minute or five-minute
> intervals would be pretty defensible.

Perhaps I'm missing something, but it doesn't strike me as a terribly
expensive operation, and once a minute would work out quite well for my
needs, at least.

Running for long after pg_control has disappeared doesn't strike me as a
great idea anyway..

Thanks!

Stephen

Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 09/29/2015 02:48 PM, Tom Lane wrote:
>> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
>> to avoid any risk of a postmaster incorrectly choosing to suicide in a
>> production scenario.  Or maybe that's overly conservative.

> Not every buildfarm member uses cassert, so I'm not sure that's the best 
> way to go. axolotl doesn't, and it's one of those that regularly has 
> speed problems. Maybe a not-very-well-publicized GUC, or an environment 
> setting? Or maybe just enable this anyway. If the data directory is gone 
> what's the point in keeping the postmaster around? Shutting it down 
> doesn't seem likely to cause any damage.

The only argument I can see against just turning it on all the time is
the possibility of false positives.  I mentioned ENFILE and EPERM as
foreseeable false-positive conditions, and I'm worried that there might be
others.  It might be good if we have a small list of specific errnos that
cause us to conclude we should die, rather than a small list that cause us
not to.  But as long as we're reasonably confident that we're seeing an
error that means somebody deleted pg_control, I think abandoning ship
is just fine.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Josh Berkus
Date:
On 09/29/2015 12:18 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 09/29/2015 02:48 PM, Tom Lane wrote:
>>> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
>>> to avoid any risk of a postmaster incorrectly choosing to suicide in a
>>> production scenario.  Or maybe that's overly conservative.
> 
>> Not every buildfarm member uses cassert, so I'm not sure that's the best 
>> way to go. axolotl doesn't, and it's one of those that regularly has 
>> speed problems. Maybe a not-very-well-publicized GUC, or an environment 
>> setting? Or maybe just enable this anyway. If the data directory is gone 
>> what's the point in keeping the postmaster around? Shutting it down 
>> doesn't seem likely to cause any damage.
> 
> The only argument I can see against just turning it on all the time is
> the possibility of false positives.  I mentioned ENFILE and EPERM as
> foreseeable false-positive conditions, and I'm worried that there might be
> others.  It might be good if we have a small list of specific errnos that
> cause us to conclude we should die, rather than a small list that cause us
> not to.  But as long as we're reasonably confident that we're seeing an
> error that means somebody deleted pg_control, I think abandoning ship
> is just fine.

Give me source with the change, and I'll put it on a cheap, low-bandwith
AWS instance and hammer the heck out of it.  That should raise any false
positives we can expect.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 09/29/2015 11:48 AM, Tom Lane wrote:
>> But today I thought of another way: suppose that we teach the postmaster
>> to commit hara-kiri if the $PGDATA directory goes away.  Since the
>> buildfarm script definitely does remove all the temporary data directories
>> it creates, this ought to get the job done.

> This would also be useful for production.  I can't count the number of
> times I've accidentally blown away a replica's PGDATA without shutting
> the postmaster down first, and then had to do a bunch of kill -9.

> In general, having the postmaster survive deletion of PGDATA is
> suboptimal.  In rare cases of having it survive installation of a new
> PGDATA (via PITR restore, for example), I've even seen the zombie
> postmaster corrupt the data files.

Side comment on that: if you'd actually removed $PGDATA, I can't see how
that would happen.  The postmaster and children would have open CWD
handles to the now-disconnected-from-anything-else directory inode,
which would not enable them to reach files created under the new directory
inode.  (They don't ever use absolute paths, only relative, or at least
that's the way it's supposed to work.)

However ... if you'd simply deleted everything *under* $PGDATA but not
that directory itself, then this type of failure mode is 100% plausible.
And that's not an unreasonable thing to do, especially if you've set
things up so that $PGDATA's parent is not a writable directory.

Testing accessibility of "global/pg_control" would be enough to catch this
case, but only if we do it before you create a new one.  So that seems
like an argument for making the test relatively often.  The once-a-minute
option is sounding better and better.

We could possibly add additional checks, like trying to verify that
pg_control has the same inode number it used to.  But I'm afraid that
would add portability issues and false-positive hazards that would
outweigh the value.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Alvaro Herrera
Date:
Tom Lane wrote:

> Testing accessibility of "global/pg_control" would be enough to catch this
> case, but only if we do it before you create a new one.  So that seems
> like an argument for making the test relatively often.  The once-a-minute
> option is sounding better and better.

If we weren't afraid of portability issues or checks that only work on
certain platforms, we could use inotify on linux and get it to signal
postmaster when pg_control is deleted.  There are various
implementations of similar things in different platforms (kqueues on
BSD, surely there's gotta be something in Linux) -- though admittedly
that code may quickly become worse that the select/poll loops (which are
ugly enough).  Maybe it'd be okay if we just use a descriptor that sets
the process latch when signalled.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea for improving buildfarm robustness

From
Joe Conway
Date:
On 09/29/2015 12:47 PM, Tom Lane wrote:
> We could possibly add additional checks, like trying to verify that
> pg_control has the same inode number it used to.  But I'm afraid that
> would add portability issues and false-positive hazards that would
> outweigh the value.

Not sure you remember the incident, but I think years ago that would
have saved me some heartache ;-)

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Re: Idea for improving buildfarm robustness

From
Alvaro Herrera
Date:
Joe Conway wrote:
> On 09/29/2015 12:47 PM, Tom Lane wrote:
> > We could possibly add additional checks, like trying to verify that
> > pg_control has the same inode number it used to.  But I'm afraid that
> > would add portability issues and false-positive hazards that would
> > outweigh the value.
> 
> Not sure you remember the incident, but I think years ago that would
> have saved me some heartache ;-)

I remember it, but I'm not sure it would have helped you.  As I recall,
your trouble was that after a reboot the init script decided to initdb
the mount point -- postmaster wouldn't have been running at all ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea for improving buildfarm robustness

From
Joe Conway
Date:
On 09/29/2015 01:48 PM, Alvaro Herrera wrote:
> Joe Conway wrote:
>> On 09/29/2015 12:47 PM, Tom Lane wrote:
>>> We could possibly add additional checks, like trying to verify that
>>> pg_control has the same inode number it used to.  But I'm afraid that
>>> would add portability issues and false-positive hazards that would
>>> outweigh the value.
>>
>> Not sure you remember the incident, but I think years ago that would
>> have saved me some heartache ;-)
>
> I remember it, but I'm not sure it would have helped you.  As I recall,
> your trouble was that after a reboot the init script decided to initdb
> the mount point -- postmaster wouldn't have been running at all ...

Right, which the init script non longer does as far as I'm aware, so
hopefully will never happen again to anyone.

But it was still a case where the postmaster started on one copy of
PGDATA (the newly init'd copy), and then the contents of the real PGDATA
was swapped in (when the filesystem was finally mounted), causing
corruption to the production data.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Re: Idea for improving buildfarm robustness

From
Alvaro Herrera
Date:
Joe Conway wrote:
> On 09/29/2015 01:48 PM, Alvaro Herrera wrote:

> > I remember it, but I'm not sure it would have helped you.  As I recall,
> > your trouble was that after a reboot the init script decided to initdb
> > the mount point -- postmaster wouldn't have been running at all ...
> 
> Right, which the init script non longer does as far as I'm aware, so
> hopefully will never happen again to anyone.

Yeah.

> But it was still a case where the postmaster started on one copy of
> PGDATA (the newly init'd copy), and then the contents of the real PGDATA
> was swapped in (when the filesystem was finally mounted), causing
> corruption to the production data.

Ah, I didn't remember that part of it, but it makes sense.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Give me source with the change, and I'll put it on a cheap, low-bandwith
> AWS instance and hammer the heck out of it.  That should raise any false
> positives we can expect.

Here's a draft patch against HEAD (looks like it will work on 9.5 or
9.4 without modifications, too).

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index baa43b2..52c9acd 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** static void CloseServerPorts(int status,
*** 373,378 ****
--- 373,379 ----
  static void unlink_external_pid_file(int status, Datum arg);
  static void getInstallationPaths(const char *argv0);
  static void checkDataDir(void);
+ static bool recheckDataDir(void);
  static Port *ConnCreate(int serverFd);
  static void ConnFree(Port *port);
  static void reset_shared(int port);
*************** checkDataDir(void)
*** 1490,1495 ****
--- 1491,1539 ----
  }

  /*
+  * Revalidate the data directory; return TRUE if OK, FALSE if not
+  *
+  * We don't try to check everything that checkDataDir() does.  Ideally, we'd
+  * return FALSE *only* if the data directory has been deleted.  As a proxy
+  * for that that matches a condition that checkDataDir() checked, verify that
+  * pg_control still exists.  Because the postmaster will quit if we return
+  * FALSE, do not do so if there is any doubt or possibly-transient failure.
+  * Better to wait till we're sure.
+  *
+  * Unlike checkDataDir(), we assume we've chdir'd into the data directory.
+  */
+ static bool
+ recheckDataDir(void)
+ {
+     const char *path = "global/pg_control";
+     FILE       *fp;
+
+     fp = AllocateFile(path, PG_BINARY_R);
+     if (fp != NULL)
+     {
+         FreeFile(fp);
+         return true;
+     }
+
+     /*
+      * There are many foreseeable false-positive error conditions, for example
+      * EINTR or ENFILE should not cause us to fail.  For safety, fail only on
+      * enumerated clearly-something-is-wrong conditions.
+      */
+     switch (errno)
+     {
+         case ENOENT:
+         case ENOTDIR:
+             elog(LOG, "could not open file \"%s\": %m", path);
+             return false;
+         default:
+             elog(LOG, "could not open file \"%s\": %m; continuing anyway",
+                  path);
+             return true;
+     }
+ }
+
+ /*
   * Determine how long should we let ServerLoop sleep.
   *
   * In normal conditions we wait at most one minute, to ensure that the other
*************** ServerLoop(void)
*** 1602,1610 ****
      fd_set        readmask;
      int            nSockets;
      time_t        now,
                  last_touch_time;

!     last_touch_time = time(NULL);

      nSockets = initMasks(&readmask);

--- 1646,1655 ----
      fd_set        readmask;
      int            nSockets;
      time_t        now,
+                 last_dir_recheck_time,
                  last_touch_time;

!     last_dir_recheck_time = last_touch_time = time(NULL);

      nSockets = initMasks(&readmask);

*************** ServerLoop(void)
*** 1754,1772 ****
          if (StartWorkerNeeded || HaveCrashedWorker)
              maybe_start_bgworker();

-         /*
-          * Touch Unix socket and lock files every 58 minutes, to ensure that
-          * they are not removed by overzealous /tmp-cleaning tasks.  We assume
-          * no one runs cleaners with cutoff times of less than an hour ...
-          */
-         now = time(NULL);
-         if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
-         {
-             TouchSocketFiles();
-             TouchSocketLockFiles();
-             last_touch_time = now;
-         }
-
  #ifdef HAVE_PTHREAD_IS_THREADED_NP

          /*
--- 1799,1804 ----
*************** ServerLoop(void)
*** 1793,1798 ****
--- 1825,1868 ----
              /* reset flag so we don't SIGKILL again */
              AbortStartTime = 0;
          }
+
+         /*
+          * Lastly, check to see if it's time to do some things that we don't
+          * want to do every single time through the loop, because they're a
+          * bit expensive.  Note that there's up to a minute of slop in when
+          * these tasks will be performed, since DetermineSleepTime() will let
+          * us sleep at most that long.
+          */
+         now = time(NULL);
+
+         /*
+          * Once a minute, verify that $PGDATA hasn't been removed.  If it has,
+          * we want to commit hara-kiri.  This avoids having postmasters and
+          * child processes hanging around after their database is gone, and
+          * maybe causing problems if a new database cluster is created in the
+          * same place.
+          */
+         if (now - last_dir_recheck_time >= 1 * SECS_PER_MINUTE)
+         {
+             if (!recheckDataDir())
+             {
+                 elog(LOG, "shutting down because data directory is gone");
+                 kill(MyProcPid, SIGQUIT);
+             }
+             last_dir_recheck_time = now;
+         }
+
+         /*
+          * Touch Unix socket and lock files every 58 minutes, to ensure that
+          * they are not removed by overzealous /tmp-cleaning tasks.  We assume
+          * no one runs cleaners with cutoff times of less than an hour ...
+          */
+         if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
+         {
+             TouchSocketFiles();
+             TouchSocketLockFiles();
+             last_touch_time = now;
+         }
      }
  }


Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
I wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> Give me source with the change, and I'll put it on a cheap, low-bandwith
>> AWS instance and hammer the heck out of it.  That should raise any false
>> positives we can expect.

> Here's a draft patch against HEAD (looks like it will work on 9.5 or
> 9.4 without modifications, too).

BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
someone tried it on Windows.  AFAIK, the self-kill() should work in the
postmaster on Windows, but that should be checked.  Also, does the set of
errnos it checks cover typical deletion cases on Windows?  Try both
removal of $PGDATA in toto and removal of just pg_control or just global/.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Josh Berkus
Date:
On 09/29/2015 12:47 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> In general, having the postmaster survive deletion of PGDATA is
>> suboptimal.  In rare cases of having it survive installation of a new
>> PGDATA (via PITR restore, for example), I've even seen the zombie
>> postmaster corrupt the data files.
> 
> However ... if you'd simply deleted everything *under* $PGDATA but not
> that directory itself, then this type of failure mode is 100% plausible.
> And that's not an unreasonable thing to do, especially if you've set
> things up so that $PGDATA's parent is not a writable directory.

I don't remember the exact setup, but this is likely the case.  Probably
1/3 of the systems I monitor have a root-owned mount point for PGDATA's
parent directory.

> Testing accessibility of "global/pg_control" would be enough to catch this
> case, but only if we do it before you create a new one.  So that seems
> like an argument for making the test relatively often.  The once-a-minute
> option is sounding better and better.
> 
> We could possibly add additional checks, like trying to verify that
> pg_control has the same inode number it used to.  But I'm afraid that
> would add portability issues and false-positive hazards that would
> outweigh the value.

It's not worth doing extra stuff for this.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Idea for improving buildfarm robustness

From
Michael Paquier
Date:
On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Josh Berkus <josh@agliodbs.com> writes:
>>> Give me source with the change, and I'll put it on a cheap, low-bandwith
>>> AWS instance and hammer the heck out of it.  That should raise any false
>>> positives we can expect.
>
>> Here's a draft patch against HEAD (looks like it will work on 9.5 or
>> 9.4 without modifications, too).
>
> BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
> someone tried it on Windows.  AFAIK, the self-kill() should work in the
> postmaster on Windows, but that should be checked.  Also, does the set of
> errnos it checks cover typical deletion cases on Windows?  Try both
> removal of $PGDATA in toto and removal of just pg_control or just global/.

Just tested on Windows, and this is working fine for me. It seems to
me as well that looking only for ENOENT and ENOTDIR is fine (here is
what I looked at for reference, note the extra EXDEV or STRUNCATE for
example with MS 2015):
https://msdn.microsoft.com/en-us/library/5814770t.aspx
-- 
Michael



Re: Idea for improving buildfarm robustness

From
Jim Nasby
Date:
On 9/29/15 4:13 PM, Alvaro Herrera wrote:
> Joe Conway wrote:
>> On 09/29/2015 01:48 PM, Alvaro Herrera wrote:
>
>>> I remember it, but I'm not sure it would have helped you.  As I recall,
>>> your trouble was that after a reboot the init script decided to initdb
>>> the mount point -- postmaster wouldn't have been running at all ...
>>
>> Right, which the init script non longer does as far as I'm aware, so
>> hopefully will never happen again to anyone.
>
> Yeah.
>
>> But it was still a case where the postmaster started on one copy of
>> PGDATA (the newly init'd copy), and then the contents of the real PGDATA
>> was swapped in (when the filesystem was finally mounted), causing
>> corruption to the production data.
>
> Ah, I didn't remember that part of it, but it makes sense.

Ouch. So it sounds like there's value to seeing if pg_control isn't what 
we expect it to be.

Instead of looking at the inode (portability problem), what if 
pg_control contained a random number that was created at initdb time? On 
startup postmaster would read that value and then if it ever changed 
after that you'd know something just went wrong.

Perhaps even stronger would be to write a new random value on startup; 
that way you'd know if an old copy accidentally got put in place.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Idea for improving buildfarm robustness

From
Andrew Dunstan
Date:

On 09/30/2015 01:18 AM, Michael Paquier wrote:
> On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wrote:
>>> Josh Berkus <josh@agliodbs.com> writes:
>>>> Give me source with the change, and I'll put it on a cheap, low-bandwith
>>>> AWS instance and hammer the heck out of it.  That should raise any false
>>>> positives we can expect.
>>> Here's a draft patch against HEAD (looks like it will work on 9.5 or
>>> 9.4 without modifications, too).
>> BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if
>> someone tried it on Windows.  AFAIK, the self-kill() should work in the
>> postmaster on Windows, but that should be checked.  Also, does the set of
>> errnos it checks cover typical deletion cases on Windows?  Try both
>> removal of $PGDATA in toto and removal of just pg_control or just global/.
> Just tested on Windows, and this is working fine for me. It seems to
> me as well that looking only for ENOENT and ENOTDIR is fine (here is
> what I looked at for reference, note the extra EXDEV or STRUNCATE for
> example with MS 2015):
> https://msdn.microsoft.com/en-us/library/5814770t.aspx


Incidentally, AWS and Windows are not mutually exclusive. I used an AWS 
Windows instance the other day when I validated the instructions for 
building with Mingw.

cheers

andrew




Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> Ouch. So it sounds like there's value to seeing if pg_control isn't what 
> we expect it to be.

> Instead of looking at the inode (portability problem), what if 
> pg_control contained a random number that was created at initdb time? On 
> startup postmaster would read that value and then if it ever changed 
> after that you'd know something just went wrong.

> Perhaps even stronger would be to write a new random value on startup; 
> that way you'd know if an old copy accidentally got put in place.

Or maybe better than an otherwise-useless random value, write the
postmaster's start time.

But none of these would offer that much added safety IMV.  If you don't
restart the postmaster very often, it's not unlikely that what you were
trying to restore is a backup from earlier in the current postmaster's
run.  Another problem with checking the contents of pg_control, rather
than only its presence, is that the checkpointer will overwrite it every
so often, and thereby put back whatever we were expecting to find there.
If the postmaster's recheck interval is significantly less than the
checkpoint interval, then you'll *probably* notice before the evidence
vanishes, but it's hardly guaranteed.

It strikes me that a different approach that might be of value would
be to re-read postmaster.pid and make sure that (a) it's still there
and (b) it still contains the current postmaster's PID.  This would
be morally equivalent to what Jim suggests above, and it would dodge
the checkpointer-destroys-the-evidence problem, and it would have the
additional advantage that we'd notice when a brain-dead DBA decides
to manually remove postmaster.pid so he can start a new postmaster.
(It's probably far too late to avoid data corruption at that point,
but better late than never.)

This is still not bulletproof against all overwrite-with-a-backup
scenarios, but it seems like a noticeable improvement over what we
discussed yesterday.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Josh Berkus
Date:
So, testing:

1. I tested running an AWS instance (Ubuntu 14.04) into 100% IOWAIT, and
the shutdown didn't kick in even when storage went full "d" state.  It's
possible that other kinds of remote storage failures would cause a
shutdown, but don't we want them to?

2. I tested deleting /pgdata/* several times (with pgbench running), and
Postgres shut down within 20 seconds each time.

3. I tested messing with the permissions on pg_control and global, and
Postgres threw other errors but continued running.

4. I mv'd the files and that didn't trigger a shutdown.

5. I did a fast swap:

rm -rf /pgdata/*
cp -p -r /pgdata2/* /pgdata/

... as expected, this did NOT cause postgres to shut down.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
I wrote:
> It strikes me that a different approach that might be of value would
> be to re-read postmaster.pid and make sure that (a) it's still there
> and (b) it still contains the current postmaster's PID.  This would
> be morally equivalent to what Jim suggests above, and it would dodge
> the checkpointer-destroys-the-evidence problem, and it would have the
> additional advantage that we'd notice when a brain-dead DBA decides
> to manually remove postmaster.pid so he can start a new postmaster.
> (It's probably far too late to avoid data corruption at that point,
> but better late than never.)

> This is still not bulletproof against all overwrite-with-a-backup
> scenarios, but it seems like a noticeable improvement over what we
> discussed yesterday.

Here's a rewritten patch that looks at postmaster.pid instead of
pg_control.  It should be effectively the same as the prior patch in terms
of response to directory-removal cases, and it should also catch many
overwrite cases.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index baa43b2..498ebfa 100644
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
*************** ServerLoop(void)
*** 1602,1610 ****
      fd_set        readmask;
      int            nSockets;
      time_t        now,
                  last_touch_time;

!     last_touch_time = time(NULL);

      nSockets = initMasks(&readmask);

--- 1602,1611 ----
      fd_set        readmask;
      int            nSockets;
      time_t        now,
+                 last_lockfile_recheck_time,
                  last_touch_time;

!     last_lockfile_recheck_time = last_touch_time = time(NULL);

      nSockets = initMasks(&readmask);

*************** ServerLoop(void)
*** 1754,1772 ****
          if (StartWorkerNeeded || HaveCrashedWorker)
              maybe_start_bgworker();

-         /*
-          * Touch Unix socket and lock files every 58 minutes, to ensure that
-          * they are not removed by overzealous /tmp-cleaning tasks.  We assume
-          * no one runs cleaners with cutoff times of less than an hour ...
-          */
-         now = time(NULL);
-         if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
-         {
-             TouchSocketFiles();
-             TouchSocketLockFiles();
-             last_touch_time = now;
-         }
-
  #ifdef HAVE_PTHREAD_IS_THREADED_NP

          /*
--- 1755,1760 ----
*************** ServerLoop(void)
*** 1793,1798 ****
--- 1781,1828 ----
              /* reset flag so we don't SIGKILL again */
              AbortStartTime = 0;
          }
+
+         /*
+          * Lastly, check to see if it's time to do some things that we don't
+          * want to do every single time through the loop, because they're a
+          * bit expensive.  Note that there's up to a minute of slop in when
+          * these tasks will be performed, since DetermineSleepTime() will let
+          * us sleep at most that long.
+          */
+         now = time(NULL);
+
+         /*
+          * Once a minute, verify that postmaster.pid hasn't been removed or
+          * overwritten.  If it has, we commit hara-kiri.  This avoids having
+          * postmasters and child processes hanging around after their database
+          * is gone, and maybe causing problems if a new database cluster is
+          * created in the same place.  It also provides some protection
+          * against a DBA foolishly removing postmaster.pid and manually
+          * starting a new postmaster.  Data corruption is likely to ensue from
+          * that anyway, but we can minimize the damage by aborting ASAP.
+          */
+         if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE)
+         {
+             if (!RecheckDataDirLockFile())
+             {
+                 ereport(LOG,
+                         (errmsg("performing immediate shutdown because data directory lock file is invalid")));
+                 kill(MyProcPid, SIGQUIT);
+             }
+             last_lockfile_recheck_time = now;
+         }
+
+         /*
+          * Touch Unix socket and lock files every 58 minutes, to ensure that
+          * they are not removed by overzealous /tmp-cleaning tasks.  We assume
+          * no one runs cleaners with cutoff times of less than an hour ...
+          */
+         if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
+         {
+             TouchSocketFiles();
+             TouchSocketLockFiles();
+             last_touch_time = now;
+         }
      }
  }

diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index f0099d3..b6270e1 100644
*** a/src/backend/utils/init/miscinit.c
--- b/src/backend/utils/init/miscinit.c
*************** AddToDataDirLockFile(int target_line, co
*** 1202,1207 ****
--- 1202,1277 ----
  }


+ /*
+  * Recheck that the data directory lock file still exists with expected
+  * content.  Return TRUE if the lock file appears OK, FALSE if it isn't.
+  *
+  * We call this periodically in the postmaster.  The idea is that if the
+  * lock file has been removed or replaced by another postmaster, we should
+  * do a panic database shutdown.  Therefore, we should return TRUE if there
+  * is any doubt: we do not want to cause a panic shutdown unnecessarily.
+  * Transient failures like EINTR or ENFILE should not cause us to fail.
+  * (If there really is something wrong, we'll detect it on a future recheck.)
+  */
+ bool
+ RecheckDataDirLockFile(void)
+ {
+     int            fd;
+     int            len;
+     long        file_pid;
+     char        buffer[BLCKSZ];
+
+     fd = open(DIRECTORY_LOCK_FILE, O_RDWR | PG_BINARY, 0);
+     if (fd < 0)
+     {
+         /*
+          * There are many foreseeable false-positive error conditions.  For
+          * safety, fail only on enumerated clearly-something-is-wrong
+          * conditions.
+          */
+         switch (errno)
+         {
+             case ENOENT:
+             case ENOTDIR:
+                 /* disaster */
+                 ereport(LOG,
+                         (errcode_for_file_access(),
+                          errmsg("could not open file \"%s\": %m",
+                                 DIRECTORY_LOCK_FILE)));
+                 return false;
+             default:
+                 /* non-fatal, at least for now */
+                 ereport(LOG,
+                         (errcode_for_file_access(),
+                   errmsg("could not open file \"%s\": %m; continuing anyway",
+                          DIRECTORY_LOCK_FILE)));
+                 return true;
+         }
+     }
+     len = read(fd, buffer, sizeof(buffer) - 1);
+     if (len < 0)
+     {
+         ereport(LOG,
+                 (errcode_for_file_access(),
+                  errmsg("could not read from file \"%s\": %m",
+                         DIRECTORY_LOCK_FILE)));
+         close(fd);
+         return true;            /* treat read failure as nonfatal */
+     }
+     buffer[len] = '\0';
+     close(fd);
+     file_pid = atol(buffer);
+     if (file_pid == getpid())
+         return true;            /* all is well */
+
+     /* Trouble: someone's overwritten the lock file */
+     ereport(LOG,
+             (errmsg("lock file \"%s\" contains wrong PID: %ld instead of %ld",
+                     DIRECTORY_LOCK_FILE, file_pid, (long) getpid())));
+     return false;
+ }
+
+
  /*-------------------------------------------------------------------------
   *                Version checking support
   *-------------------------------------------------------------------------
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 80ac732..87133bd 100644
*** a/src/include/miscadmin.h
--- b/src/include/miscadmin.h
*************** extern void CreateSocketLockFile(const c
*** 450,455 ****
--- 450,456 ----
                       const char *socketDir);
  extern void TouchSocketLockFiles(void);
  extern void AddToDataDirLockFile(int target_line, const char *str);
+ extern bool RecheckDataDirLockFile(void);
  extern void ValidatePgVersion(const char *path);
  extern void process_shared_preload_libraries(void);
  extern void process_session_preload_libraries(void);

Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
I wrote:
> Here's a rewritten patch that looks at postmaster.pid instead of
> pg_control.  It should be effectively the same as the prior patch in terms
> of response to directory-removal cases, and it should also catch many
> overwrite cases.

BTW, my thought at the moment is to wait till after next week's releases
to push this in.  I think it's probably solid, but it doesn't seem like
it's worth taking the risk of pushing shortly before a wrap date.

If anyone wants to argue for including it in the releases, speak up ...
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Michael Paquier
Date:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane
<spandir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">I
wrote:<br/> > Here's a rewritten patch that looks at postmaster.pid instead of<br /> > pg_control.  It should be
effectivelythe same as the prior patch in terms<br /> > of response to directory-removal cases, and it should also
catchmany<br /> > overwrite cases.<br /><br /></span>BTW, my thought at the moment is to wait till after next week's
releases<br/> to push this in.  I think it's probably solid, but it doesn't seem like<br /> it's worth taking the risk
ofpushing shortly before a wrap date.<br /></blockquote></div><br /></div><div class="gmail_extra">That seems a wiser
approachto me. Down to which version are you planning a backpatch? As this is aimed for the buildfarm stability with
TAPstuff, 9.4?<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div> 

Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, my thought at the moment is to wait till after next week's releases
>> to push this in.  I think it's probably solid, but it doesn't seem like
>> it's worth taking the risk of pushing shortly before a wrap date.

> That seems a wiser approach to me. Down to which version are you planning a
> backpatch? As this is aimed for the buildfarm stability with TAP stuff, 9.4?

What we'd discussed was applying this to all branches that contain the
5-second-timeout logic, which is everything back to 9.1.  The branches
that have TAP tests have a wider cross-section for failure in the
buildfarm because more postmaster starts are involved, but all of them
are capable of getting burnt this way --- see shearwater's results for
instance.
        regards, tom lane



Re: Idea for improving buildfarm robustness

From
Josh Berkus
Date:
On 10/02/2015 09:39 PM, Tom Lane wrote:
> I wrote:
>> Here's a rewritten patch that looks at postmaster.pid instead of
>> pg_control.  It should be effectively the same as the prior patch in terms
>> of response to directory-removal cases, and it should also catch many
>> overwrite cases.
> 
> BTW, my thought at the moment is to wait till after next week's releases
> to push this in.  I think it's probably solid, but it doesn't seem like
> it's worth taking the risk of pushing shortly before a wrap date.
> 
> If anyone wants to argue for including it in the releases, speak up ...

Wait, we're backpatching this?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Idea for improving buildfarm robustness

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> On 10/02/2015 09:39 PM, Tom Lane wrote:
>> I wrote:
>>> Here's a rewritten patch that looks at postmaster.pid instead of
>>> pg_control.  It should be effectively the same as the prior patch in terms
>>> of response to directory-removal cases, and it should also catch many
>>> overwrite cases.

>> BTW, my thought at the moment is to wait till after next week's releases
>> to push this in.  I think it's probably solid, but it doesn't seem like
>> it's worth taking the risk of pushing shortly before a wrap date.
>> 
>> If anyone wants to argue for including it in the releases, speak up ...

> Wait, we're backpatching this?

Of course.  It's not going to do much for buildfarm stability if it's
only in HEAD.
        regards, tom lane