Thread: Idea for improving buildfarm robustness
A problem the buildfarm has had for a long time is that if for some reason the scripts fail to stop a test postmaster, the postmaster process will hang around and cause subsequent runs to fail because of socket conflicts. This seems to have gotten a lot worse lately due to the influx of very slow buildfarm machines, but the risk has always been there. I've been thinking about teaching the buildfarm script to "kill -9" any postmasters left around at the end of the run, but that's fairly problematic: how do you find such processes (since "ps" output isn't hugely portable, especially not to Windows), and how do you tell them apart from postmasters not started by the script? So the idea was on hold. But today I thought of another way: suppose that we teach the postmaster to commit hara-kiri if the $PGDATA directory goes away. Since the buildfarm script definitely does remove all the temporary data directories it creates, this ought to get the job done. An easy way to do that would be to have it check every so often if pg_control can still be read. We should not have it fail on ENFILE or EMFILE, since that would create a new failure hazard under heavy load, but ENOENT or similar would be reasonable grounds for deciding that something is horribly broken. (At least on Windows, failing on EPERM doesn't seem wise either, since we've seen antivirus products randomly causing such errors.) I wouldn't want to do this every time through the postmaster's main loop, but we could do this once an hour for no added cost by adding the check where it does TouchSocketLockFiles; or once every few minutes if we carried a separate variable like last_touch_time. Once an hour would be plenty to fix the buildfarm's problem, I should think. Another question is what exactly "commit hara-kiri" should consist of. We could just abort() or _exit(1) and leave it to child processes to notice that the postmaster is gone, or we could make an effort to clean up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie kill all the children and exit. The children are probably having problems of their own if the data directory's gone, so forcing termination might be best to keep them from getting stuck. Also, perhaps we'd only enable this behavior in --enable-cassert builds, to avoid any risk of a postmaster incorrectly choosing to suicide in a production scenario. Or maybe that's overly conservative. Thoughts? regards, tom lane
On 09/29/2015 11:48 AM, Tom Lane wrote: > But today I thought of another way: suppose that we teach the postmaster > to commit hara-kiri if the $PGDATA directory goes away. Since the > buildfarm script definitely does remove all the temporary data directories > it creates, this ought to get the job done. This would also be useful for production. I can't count the number of times I've accidentally blown away a replica's PGDATA without shutting the postmaster down first, and then had to do a bunch of kill -9. In general, having the postmaster survive deletion of PGDATA is suboptimal. In rare cases of having it survive installation of a new PGDATA (via PITR restore, for example), I've even seen the zombie postmaster corrupt the data files. So if you want this change to be useful beyond the buildfarm, it should check every few minutes, and you'd SIGQUIT. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > But today I thought of another way: suppose that we teach the postmaster > to commit hara-kiri if the $PGDATA directory goes away. Since the > buildfarm script definitely does remove all the temporary data directories > it creates, this ought to get the job done. Yes, please. > An easy way to do that would be to have it check every so often if > pg_control can still be read. We should not have it fail on ENFILE or > EMFILE, since that would create a new failure hazard under heavy load, > but ENOENT or similar would be reasonable grounds for deciding that > something is horribly broken. (At least on Windows, failing on EPERM > doesn't seem wise either, since we've seen antivirus products randomly > causing such errors.) Sounds pretty reasonable to me. > I wouldn't want to do this every time through the postmaster's main loop, > but we could do this once an hour for no added cost by adding the check > where it does TouchSocketLockFiles; or once every few minutes if we > carried a separate variable like last_touch_time. Once an hour would be > plenty to fix the buildfarm's problem, I should think. I have a bad (?) habit of doing exactly this during development and would really like it to be a bit more often than once/hour, unless there's a particular problem with that. > Another question is what exactly "commit hara-kiri" should consist of. > We could just abort() or _exit(1) and leave it to child processes to > notice that the postmaster is gone, or we could make an effort to clean > up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie > kill all the children and exit. The children are probably having > problems of their own if the data directory's gone, so forcing > termination might be best to keep them from getting stuck. I like the idea of killing all the children and then exiting. > Also, perhaps we'd only enable this behavior in --enable-cassert builds, > to avoid any risk of a postmaster incorrectly choosing to suicide in a > production scenario. Or maybe that's overly conservative. That would work for my use-case. Perhaps only on --enable-cassert builds for back-branches but enable it in master and see how things go for 9.6? I agree that it feels overly conservative, but given our recent history, we should be overly cautious with the back branches. > Thoughts? Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> I wouldn't want to do this every time through the postmaster's main loop, >> but we could do this once an hour for no added cost by adding the check >> where it does TouchSocketLockFiles; or once every few minutes if we >> carried a separate variable like last_touch_time. Once an hour would be >> plenty to fix the buildfarm's problem, I should think. > I have a bad (?) habit of doing exactly this during development and > would really like it to be a bit more often than once/hour, unless > there's a particular problem with that. Yeah, Josh mentioned the same. It would only take another three or four lines of code to decouple it from TouchSocketLockFiles, and then it's just a question of how much are you worried about the performance cost of additional file-open attempts. I think either one-minute or five-minute intervals would be pretty defensible. regards, tom lane
On 09/29/2015 02:48 PM, Tom Lane wrote: > A problem the buildfarm has had for a long time is that if for some reason > the scripts fail to stop a test postmaster, the postmaster process will > hang around and cause subsequent runs to fail because of socket conflicts. > This seems to have gotten a lot worse lately due to the influx of very > slow buildfarm machines, but the risk has always been there. > > I've been thinking about teaching the buildfarm script to "kill -9" > any postmasters left around at the end of the run, but that's fairly > problematic: how do you find such processes (since "ps" output isn't > hugely portable, especially not to Windows), and how do you tell them > apart from postmasters not started by the script? So the idea was on > hold. > > But today I thought of another way: suppose that we teach the postmaster > to commit hara-kiri if the $PGDATA directory goes away. Since the > buildfarm script definitely does remove all the temporary data directories > it creates, this ought to get the job done. > > An easy way to do that would be to have it check every so often if > pg_control can still be read. We should not have it fail on ENFILE or > EMFILE, since that would create a new failure hazard under heavy load, > but ENOENT or similar would be reasonable grounds for deciding that > something is horribly broken. (At least on Windows, failing on EPERM > doesn't seem wise either, since we've seen antivirus products randomly > causing such errors.) > > I wouldn't want to do this every time through the postmaster's main loop, > but we could do this once an hour for no added cost by adding the check > where it does TouchSocketLockFiles; or once every few minutes if we > carried a separate variable like last_touch_time. Once an hour would be > plenty to fix the buildfarm's problem, I should think. > > Another question is what exactly "commit hara-kiri" should consist of. > We could just abort() or _exit(1) and leave it to child processes to > notice that the postmaster is gone, or we could make an effort to clean > up. I'd be a bit inclined to treat it like a SIGQUIT situation, ie > kill all the children and exit. The children are probably having > problems of their own if the data directory's gone, so forcing > termination might be best to keep them from getting stuck. > > Also, perhaps we'd only enable this behavior in --enable-cassert builds, > to avoid any risk of a postmaster incorrectly choosing to suicide in a > production scenario. Or maybe that's overly conservative. > > Thoughts? > > It's a fine idea. This is much more likely to be robust than any buildfarm client fix. Not every buildfarm member uses cassert, so I'm not sure that's the best way to go. axolotl doesn't, and it's one of those that regularly has speed problems. Maybe a not-very-well-publicized GUC, or an environment setting? Or maybe just enable this anyway. If the data directory is gone what's the point in keeping the postmaster around? Shutting it down doesn't seem likely to cause any damage. cheers andrew
* Tom Lane (tgl@sss.pgh.pa.us) wrote: > Stephen Frost <sfrost@snowman.net> writes: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: > >> I wouldn't want to do this every time through the postmaster's main loop, > >> but we could do this once an hour for no added cost by adding the check > >> where it does TouchSocketLockFiles; or once every few minutes if we > >> carried a separate variable like last_touch_time. Once an hour would be > >> plenty to fix the buildfarm's problem, I should think. > > > I have a bad (?) habit of doing exactly this during development and > > would really like it to be a bit more often than once/hour, unless > > there's a particular problem with that. > > Yeah, Josh mentioned the same. It would only take another three or four > lines of code to decouple it from TouchSocketLockFiles, and then it's > just a question of how much are you worried about the performance cost of > additional file-open attempts. I think either one-minute or five-minute > intervals would be pretty defensible. Perhaps I'm missing something, but it doesn't strike me as a terribly expensive operation, and once a minute would work out quite well for my needs, at least. Running for long after pg_control has disappeared doesn't strike me as a great idea anyway.. Thanks! Stephen
Andrew Dunstan <andrew@dunslane.net> writes: > On 09/29/2015 02:48 PM, Tom Lane wrote: >> Also, perhaps we'd only enable this behavior in --enable-cassert builds, >> to avoid any risk of a postmaster incorrectly choosing to suicide in a >> production scenario. Or maybe that's overly conservative. > Not every buildfarm member uses cassert, so I'm not sure that's the best > way to go. axolotl doesn't, and it's one of those that regularly has > speed problems. Maybe a not-very-well-publicized GUC, or an environment > setting? Or maybe just enable this anyway. If the data directory is gone > what's the point in keeping the postmaster around? Shutting it down > doesn't seem likely to cause any damage. The only argument I can see against just turning it on all the time is the possibility of false positives. I mentioned ENFILE and EPERM as foreseeable false-positive conditions, and I'm worried that there might be others. It might be good if we have a small list of specific errnos that cause us to conclude we should die, rather than a small list that cause us not to. But as long as we're reasonably confident that we're seeing an error that means somebody deleted pg_control, I think abandoning ship is just fine. regards, tom lane
On 09/29/2015 12:18 PM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 09/29/2015 02:48 PM, Tom Lane wrote: >>> Also, perhaps we'd only enable this behavior in --enable-cassert builds, >>> to avoid any risk of a postmaster incorrectly choosing to suicide in a >>> production scenario. Or maybe that's overly conservative. > >> Not every buildfarm member uses cassert, so I'm not sure that's the best >> way to go. axolotl doesn't, and it's one of those that regularly has >> speed problems. Maybe a not-very-well-publicized GUC, or an environment >> setting? Or maybe just enable this anyway. If the data directory is gone >> what's the point in keeping the postmaster around? Shutting it down >> doesn't seem likely to cause any damage. > > The only argument I can see against just turning it on all the time is > the possibility of false positives. I mentioned ENFILE and EPERM as > foreseeable false-positive conditions, and I'm worried that there might be > others. It might be good if we have a small list of specific errnos that > cause us to conclude we should die, rather than a small list that cause us > not to. But as long as we're reasonably confident that we're seeing an > error that means somebody deleted pg_control, I think abandoning ship > is just fine. Give me source with the change, and I'll put it on a cheap, low-bandwith AWS instance and hammer the heck out of it. That should raise any false positives we can expect. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 09/29/2015 11:48 AM, Tom Lane wrote: >> But today I thought of another way: suppose that we teach the postmaster >> to commit hara-kiri if the $PGDATA directory goes away. Since the >> buildfarm script definitely does remove all the temporary data directories >> it creates, this ought to get the job done. > This would also be useful for production. I can't count the number of > times I've accidentally blown away a replica's PGDATA without shutting > the postmaster down first, and then had to do a bunch of kill -9. > In general, having the postmaster survive deletion of PGDATA is > suboptimal. In rare cases of having it survive installation of a new > PGDATA (via PITR restore, for example), I've even seen the zombie > postmaster corrupt the data files. Side comment on that: if you'd actually removed $PGDATA, I can't see how that would happen. The postmaster and children would have open CWD handles to the now-disconnected-from-anything-else directory inode, which would not enable them to reach files created under the new directory inode. (They don't ever use absolute paths, only relative, or at least that's the way it's supposed to work.) However ... if you'd simply deleted everything *under* $PGDATA but not that directory itself, then this type of failure mode is 100% plausible. And that's not an unreasonable thing to do, especially if you've set things up so that $PGDATA's parent is not a writable directory. Testing accessibility of "global/pg_control" would be enough to catch this case, but only if we do it before you create a new one. So that seems like an argument for making the test relatively often. The once-a-minute option is sounding better and better. We could possibly add additional checks, like trying to verify that pg_control has the same inode number it used to. But I'm afraid that would add portability issues and false-positive hazards that would outweigh the value. regards, tom lane
Tom Lane wrote: > Testing accessibility of "global/pg_control" would be enough to catch this > case, but only if we do it before you create a new one. So that seems > like an argument for making the test relatively often. The once-a-minute > option is sounding better and better. If we weren't afraid of portability issues or checks that only work on certain platforms, we could use inotify on linux and get it to signal postmaster when pg_control is deleted. There are various implementations of similar things in different platforms (kqueues on BSD, surely there's gotta be something in Linux) -- though admittedly that code may quickly become worse that the select/poll loops (which are ugly enough). Maybe it'd be okay if we just use a descriptor that sets the process latch when signalled. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/29/2015 12:47 PM, Tom Lane wrote: > We could possibly add additional checks, like trying to verify that > pg_control has the same inode number it used to. But I'm afraid that > would add portability issues and false-positive hazards that would > outweigh the value. Not sure you remember the incident, but I think years ago that would have saved me some heartache ;-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Joe Conway wrote: > On 09/29/2015 12:47 PM, Tom Lane wrote: > > We could possibly add additional checks, like trying to verify that > > pg_control has the same inode number it used to. But I'm afraid that > > would add portability issues and false-positive hazards that would > > outweigh the value. > > Not sure you remember the incident, but I think years ago that would > have saved me some heartache ;-) I remember it, but I'm not sure it would have helped you. As I recall, your trouble was that after a reboot the init script decided to initdb the mount point -- postmaster wouldn't have been running at all ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 09/29/2015 01:48 PM, Alvaro Herrera wrote: > Joe Conway wrote: >> On 09/29/2015 12:47 PM, Tom Lane wrote: >>> We could possibly add additional checks, like trying to verify that >>> pg_control has the same inode number it used to. But I'm afraid that >>> would add portability issues and false-positive hazards that would >>> outweigh the value. >> >> Not sure you remember the incident, but I think years ago that would >> have saved me some heartache ;-) > > I remember it, but I'm not sure it would have helped you. As I recall, > your trouble was that after a reboot the init script decided to initdb > the mount point -- postmaster wouldn't have been running at all ... Right, which the init script non longer does as far as I'm aware, so hopefully will never happen again to anyone. But it was still a case where the postmaster started on one copy of PGDATA (the newly init'd copy), and then the contents of the real PGDATA was swapped in (when the filesystem was finally mounted), causing corruption to the production data. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Joe Conway wrote: > On 09/29/2015 01:48 PM, Alvaro Herrera wrote: > > I remember it, but I'm not sure it would have helped you. As I recall, > > your trouble was that after a reboot the init script decided to initdb > > the mount point -- postmaster wouldn't have been running at all ... > > Right, which the init script non longer does as far as I'm aware, so > hopefully will never happen again to anyone. Yeah. > But it was still a case where the postmaster started on one copy of > PGDATA (the newly init'd copy), and then the contents of the real PGDATA > was swapped in (when the filesystem was finally mounted), causing > corruption to the production data. Ah, I didn't remember that part of it, but it makes sense. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Josh Berkus <josh@agliodbs.com> writes: > Give me source with the change, and I'll put it on a cheap, low-bandwith > AWS instance and hammer the heck out of it. That should raise any false > positives we can expect. Here's a draft patch against HEAD (looks like it will work on 9.5 or 9.4 without modifications, too). regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index baa43b2..52c9acd 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** static void CloseServerPorts(int status, *** 373,378 **** --- 373,379 ---- static void unlink_external_pid_file(int status, Datum arg); static void getInstallationPaths(const char *argv0); static void checkDataDir(void); + static bool recheckDataDir(void); static Port *ConnCreate(int serverFd); static void ConnFree(Port *port); static void reset_shared(int port); *************** checkDataDir(void) *** 1490,1495 **** --- 1491,1539 ---- } /* + * Revalidate the data directory; return TRUE if OK, FALSE if not + * + * We don't try to check everything that checkDataDir() does. Ideally, we'd + * return FALSE *only* if the data directory has been deleted. As a proxy + * for that that matches a condition that checkDataDir() checked, verify that + * pg_control still exists. Because the postmaster will quit if we return + * FALSE, do not do so if there is any doubt or possibly-transient failure. + * Better to wait till we're sure. + * + * Unlike checkDataDir(), we assume we've chdir'd into the data directory. + */ + static bool + recheckDataDir(void) + { + const char *path = "global/pg_control"; + FILE *fp; + + fp = AllocateFile(path, PG_BINARY_R); + if (fp != NULL) + { + FreeFile(fp); + return true; + } + + /* + * There are many foreseeable false-positive error conditions, for example + * EINTR or ENFILE should not cause us to fail. For safety, fail only on + * enumerated clearly-something-is-wrong conditions. + */ + switch (errno) + { + case ENOENT: + case ENOTDIR: + elog(LOG, "could not open file \"%s\": %m", path); + return false; + default: + elog(LOG, "could not open file \"%s\": %m; continuing anyway", + path); + return true; + } + } + + /* * Determine how long should we let ServerLoop sleep. * * In normal conditions we wait at most one minute, to ensure that the other *************** ServerLoop(void) *** 1602,1610 **** fd_set readmask; int nSockets; time_t now, last_touch_time; ! last_touch_time = time(NULL); nSockets = initMasks(&readmask); --- 1646,1655 ---- fd_set readmask; int nSockets; time_t now, + last_dir_recheck_time, last_touch_time; ! last_dir_recheck_time = last_touch_time = time(NULL); nSockets = initMasks(&readmask); *************** ServerLoop(void) *** 1754,1772 **** if (StartWorkerNeeded || HaveCrashedWorker) maybe_start_bgworker(); - /* - * Touch Unix socket and lock files every 58 minutes, to ensure that - * they are not removed by overzealous /tmp-cleaning tasks. We assume - * no one runs cleaners with cutoff times of less than an hour ... - */ - now = time(NULL); - if (now - last_touch_time >= 58 * SECS_PER_MINUTE) - { - TouchSocketFiles(); - TouchSocketLockFiles(); - last_touch_time = now; - } - #ifdef HAVE_PTHREAD_IS_THREADED_NP /* --- 1799,1804 ---- *************** ServerLoop(void) *** 1793,1798 **** --- 1825,1868 ---- /* reset flag so we don't SIGKILL again */ AbortStartTime = 0; } + + /* + * Lastly, check to see if it's time to do some things that we don't + * want to do every single time through the loop, because they're a + * bit expensive. Note that there's up to a minute of slop in when + * these tasks will be performed, since DetermineSleepTime() will let + * us sleep at most that long. + */ + now = time(NULL); + + /* + * Once a minute, verify that $PGDATA hasn't been removed. If it has, + * we want to commit hara-kiri. This avoids having postmasters and + * child processes hanging around after their database is gone, and + * maybe causing problems if a new database cluster is created in the + * same place. + */ + if (now - last_dir_recheck_time >= 1 * SECS_PER_MINUTE) + { + if (!recheckDataDir()) + { + elog(LOG, "shutting down because data directory is gone"); + kill(MyProcPid, SIGQUIT); + } + last_dir_recheck_time = now; + } + + /* + * Touch Unix socket and lock files every 58 minutes, to ensure that + * they are not removed by overzealous /tmp-cleaning tasks. We assume + * no one runs cleaners with cutoff times of less than an hour ... + */ + if (now - last_touch_time >= 58 * SECS_PER_MINUTE) + { + TouchSocketFiles(); + TouchSocketLockFiles(); + last_touch_time = now; + } } }
I wrote: > Josh Berkus <josh@agliodbs.com> writes: >> Give me source with the change, and I'll put it on a cheap, low-bandwith >> AWS instance and hammer the heck out of it. That should raise any false >> positives we can expect. > Here's a draft patch against HEAD (looks like it will work on 9.5 or > 9.4 without modifications, too). BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if someone tried it on Windows. AFAIK, the self-kill() should work in the postmaster on Windows, but that should be checked. Also, does the set of errnos it checks cover typical deletion cases on Windows? Try both removal of $PGDATA in toto and removal of just pg_control or just global/. regards, tom lane
On 09/29/2015 12:47 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> In general, having the postmaster survive deletion of PGDATA is >> suboptimal. In rare cases of having it survive installation of a new >> PGDATA (via PITR restore, for example), I've even seen the zombie >> postmaster corrupt the data files. > > However ... if you'd simply deleted everything *under* $PGDATA but not > that directory itself, then this type of failure mode is 100% plausible. > And that's not an unreasonable thing to do, especially if you've set > things up so that $PGDATA's parent is not a writable directory. I don't remember the exact setup, but this is likely the case. Probably 1/3 of the systems I monitor have a root-owned mount point for PGDATA's parent directory. > Testing accessibility of "global/pg_control" would be enough to catch this > case, but only if we do it before you create a new one. So that seems > like an argument for making the test relatively often. The once-a-minute > option is sounding better and better. > > We could possibly add additional checks, like trying to verify that > pg_control has the same inode number it used to. But I'm afraid that > would add portability issues and false-positive hazards that would > outweigh the value. It's not worth doing extra stuff for this. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Josh Berkus <josh@agliodbs.com> writes: >>> Give me source with the change, and I'll put it on a cheap, low-bandwith >>> AWS instance and hammer the heck out of it. That should raise any false >>> positives we can expect. > >> Here's a draft patch against HEAD (looks like it will work on 9.5 or >> 9.4 without modifications, too). > > BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if > someone tried it on Windows. AFAIK, the self-kill() should work in the > postmaster on Windows, but that should be checked. Also, does the set of > errnos it checks cover typical deletion cases on Windows? Try both > removal of $PGDATA in toto and removal of just pg_control or just global/. Just tested on Windows, and this is working fine for me. It seems to me as well that looking only for ENOENT and ENOTDIR is fine (here is what I looked at for reference, note the extra EXDEV or STRUNCATE for example with MS 2015): https://msdn.microsoft.com/en-us/library/5814770t.aspx -- Michael
On 9/29/15 4:13 PM, Alvaro Herrera wrote: > Joe Conway wrote: >> On 09/29/2015 01:48 PM, Alvaro Herrera wrote: > >>> I remember it, but I'm not sure it would have helped you. As I recall, >>> your trouble was that after a reboot the init script decided to initdb >>> the mount point -- postmaster wouldn't have been running at all ... >> >> Right, which the init script non longer does as far as I'm aware, so >> hopefully will never happen again to anyone. > > Yeah. > >> But it was still a case where the postmaster started on one copy of >> PGDATA (the newly init'd copy), and then the contents of the real PGDATA >> was swapped in (when the filesystem was finally mounted), causing >> corruption to the production data. > > Ah, I didn't remember that part of it, but it makes sense. Ouch. So it sounds like there's value to seeing if pg_control isn't what we expect it to be. Instead of looking at the inode (portability problem), what if pg_control contained a random number that was created at initdb time? On startup postmaster would read that value and then if it ever changed after that you'd know something just went wrong. Perhaps even stronger would be to write a new random value on startup; that way you'd know if an old copy accidentally got put in place. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On 09/30/2015 01:18 AM, Michael Paquier wrote: > On Wed, Sep 30, 2015 at 7:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wrote: >>> Josh Berkus <josh@agliodbs.com> writes: >>>> Give me source with the change, and I'll put it on a cheap, low-bandwith >>>> AWS instance and hammer the heck out of it. That should raise any false >>>> positives we can expect. >>> Here's a draft patch against HEAD (looks like it will work on 9.5 or >>> 9.4 without modifications, too). >> BTW: in addition to whatever AWS testing Josh has in mind, it'd be good if >> someone tried it on Windows. AFAIK, the self-kill() should work in the >> postmaster on Windows, but that should be checked. Also, does the set of >> errnos it checks cover typical deletion cases on Windows? Try both >> removal of $PGDATA in toto and removal of just pg_control or just global/. > Just tested on Windows, and this is working fine for me. It seems to > me as well that looking only for ENOENT and ENOTDIR is fine (here is > what I looked at for reference, note the extra EXDEV or STRUNCATE for > example with MS 2015): > https://msdn.microsoft.com/en-us/library/5814770t.aspx Incidentally, AWS and Windows are not mutually exclusive. I used an AWS Windows instance the other day when I validated the instructions for building with Mingw. cheers andrew
Jim Nasby <Jim.Nasby@bluetreble.com> writes: > Ouch. So it sounds like there's value to seeing if pg_control isn't what > we expect it to be. > Instead of looking at the inode (portability problem), what if > pg_control contained a random number that was created at initdb time? On > startup postmaster would read that value and then if it ever changed > after that you'd know something just went wrong. > Perhaps even stronger would be to write a new random value on startup; > that way you'd know if an old copy accidentally got put in place. Or maybe better than an otherwise-useless random value, write the postmaster's start time. But none of these would offer that much added safety IMV. If you don't restart the postmaster very often, it's not unlikely that what you were trying to restore is a backup from earlier in the current postmaster's run. Another problem with checking the contents of pg_control, rather than only its presence, is that the checkpointer will overwrite it every so often, and thereby put back whatever we were expecting to find there. If the postmaster's recheck interval is significantly less than the checkpoint interval, then you'll *probably* notice before the evidence vanishes, but it's hardly guaranteed. It strikes me that a different approach that might be of value would be to re-read postmaster.pid and make sure that (a) it's still there and (b) it still contains the current postmaster's PID. This would be morally equivalent to what Jim suggests above, and it would dodge the checkpointer-destroys-the-evidence problem, and it would have the additional advantage that we'd notice when a brain-dead DBA decides to manually remove postmaster.pid so he can start a new postmaster. (It's probably far too late to avoid data corruption at that point, but better late than never.) This is still not bulletproof against all overwrite-with-a-backup scenarios, but it seems like a noticeable improvement over what we discussed yesterday. regards, tom lane
So, testing: 1. I tested running an AWS instance (Ubuntu 14.04) into 100% IOWAIT, and the shutdown didn't kick in even when storage went full "d" state. It's possible that other kinds of remote storage failures would cause a shutdown, but don't we want them to? 2. I tested deleting /pgdata/* several times (with pgbench running), and Postgres shut down within 20 seconds each time. 3. I tested messing with the permissions on pg_control and global, and Postgres threw other errors but continued running. 4. I mv'd the files and that didn't trigger a shutdown. 5. I did a fast swap: rm -rf /pgdata/* cp -p -r /pgdata2/* /pgdata/ ... as expected, this did NOT cause postgres to shut down. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
I wrote: > It strikes me that a different approach that might be of value would > be to re-read postmaster.pid and make sure that (a) it's still there > and (b) it still contains the current postmaster's PID. This would > be morally equivalent to what Jim suggests above, and it would dodge > the checkpointer-destroys-the-evidence problem, and it would have the > additional advantage that we'd notice when a brain-dead DBA decides > to manually remove postmaster.pid so he can start a new postmaster. > (It's probably far too late to avoid data corruption at that point, > but better late than never.) > This is still not bulletproof against all overwrite-with-a-backup > scenarios, but it seems like a noticeable improvement over what we > discussed yesterday. Here's a rewritten patch that looks at postmaster.pid instead of pg_control. It should be effectively the same as the prior patch in terms of response to directory-removal cases, and it should also catch many overwrite cases. regards, tom lane diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index baa43b2..498ebfa 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** ServerLoop(void) *** 1602,1610 **** fd_set readmask; int nSockets; time_t now, last_touch_time; ! last_touch_time = time(NULL); nSockets = initMasks(&readmask); --- 1602,1611 ---- fd_set readmask; int nSockets; time_t now, + last_lockfile_recheck_time, last_touch_time; ! last_lockfile_recheck_time = last_touch_time = time(NULL); nSockets = initMasks(&readmask); *************** ServerLoop(void) *** 1754,1772 **** if (StartWorkerNeeded || HaveCrashedWorker) maybe_start_bgworker(); - /* - * Touch Unix socket and lock files every 58 minutes, to ensure that - * they are not removed by overzealous /tmp-cleaning tasks. We assume - * no one runs cleaners with cutoff times of less than an hour ... - */ - now = time(NULL); - if (now - last_touch_time >= 58 * SECS_PER_MINUTE) - { - TouchSocketFiles(); - TouchSocketLockFiles(); - last_touch_time = now; - } - #ifdef HAVE_PTHREAD_IS_THREADED_NP /* --- 1755,1760 ---- *************** ServerLoop(void) *** 1793,1798 **** --- 1781,1828 ---- /* reset flag so we don't SIGKILL again */ AbortStartTime = 0; } + + /* + * Lastly, check to see if it's time to do some things that we don't + * want to do every single time through the loop, because they're a + * bit expensive. Note that there's up to a minute of slop in when + * these tasks will be performed, since DetermineSleepTime() will let + * us sleep at most that long. + */ + now = time(NULL); + + /* + * Once a minute, verify that postmaster.pid hasn't been removed or + * overwritten. If it has, we commit hara-kiri. This avoids having + * postmasters and child processes hanging around after their database + * is gone, and maybe causing problems if a new database cluster is + * created in the same place. It also provides some protection + * against a DBA foolishly removing postmaster.pid and manually + * starting a new postmaster. Data corruption is likely to ensue from + * that anyway, but we can minimize the damage by aborting ASAP. + */ + if (now - last_lockfile_recheck_time >= 1 * SECS_PER_MINUTE) + { + if (!RecheckDataDirLockFile()) + { + ereport(LOG, + (errmsg("performing immediate shutdown because data directory lock file is invalid"))); + kill(MyProcPid, SIGQUIT); + } + last_lockfile_recheck_time = now; + } + + /* + * Touch Unix socket and lock files every 58 minutes, to ensure that + * they are not removed by overzealous /tmp-cleaning tasks. We assume + * no one runs cleaners with cutoff times of less than an hour ... + */ + if (now - last_touch_time >= 58 * SECS_PER_MINUTE) + { + TouchSocketFiles(); + TouchSocketLockFiles(); + last_touch_time = now; + } } } diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c index f0099d3..b6270e1 100644 *** a/src/backend/utils/init/miscinit.c --- b/src/backend/utils/init/miscinit.c *************** AddToDataDirLockFile(int target_line, co *** 1202,1207 **** --- 1202,1277 ---- } + /* + * Recheck that the data directory lock file still exists with expected + * content. Return TRUE if the lock file appears OK, FALSE if it isn't. + * + * We call this periodically in the postmaster. The idea is that if the + * lock file has been removed or replaced by another postmaster, we should + * do a panic database shutdown. Therefore, we should return TRUE if there + * is any doubt: we do not want to cause a panic shutdown unnecessarily. + * Transient failures like EINTR or ENFILE should not cause us to fail. + * (If there really is something wrong, we'll detect it on a future recheck.) + */ + bool + RecheckDataDirLockFile(void) + { + int fd; + int len; + long file_pid; + char buffer[BLCKSZ]; + + fd = open(DIRECTORY_LOCK_FILE, O_RDWR | PG_BINARY, 0); + if (fd < 0) + { + /* + * There are many foreseeable false-positive error conditions. For + * safety, fail only on enumerated clearly-something-is-wrong + * conditions. + */ + switch (errno) + { + case ENOENT: + case ENOTDIR: + /* disaster */ + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open file \"%s\": %m", + DIRECTORY_LOCK_FILE))); + return false; + default: + /* non-fatal, at least for now */ + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open file \"%s\": %m; continuing anyway", + DIRECTORY_LOCK_FILE))); + return true; + } + } + len = read(fd, buffer, sizeof(buffer) - 1); + if (len < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not read from file \"%s\": %m", + DIRECTORY_LOCK_FILE))); + close(fd); + return true; /* treat read failure as nonfatal */ + } + buffer[len] = '\0'; + close(fd); + file_pid = atol(buffer); + if (file_pid == getpid()) + return true; /* all is well */ + + /* Trouble: someone's overwritten the lock file */ + ereport(LOG, + (errmsg("lock file \"%s\" contains wrong PID: %ld instead of %ld", + DIRECTORY_LOCK_FILE, file_pid, (long) getpid()))); + return false; + } + + /*------------------------------------------------------------------------- * Version checking support *------------------------------------------------------------------------- diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 80ac732..87133bd 100644 *** a/src/include/miscadmin.h --- b/src/include/miscadmin.h *************** extern void CreateSocketLockFile(const c *** 450,455 **** --- 450,456 ---- const char *socketDir); extern void TouchSocketLockFiles(void); extern void AddToDataDirLockFile(int target_line, const char *str); + extern bool RecheckDataDirLockFile(void); extern void ValidatePgVersion(const char *path); extern void process_shared_preload_libraries(void); extern void process_session_preload_libraries(void);
I wrote: > Here's a rewritten patch that looks at postmaster.pid instead of > pg_control. It should be effectively the same as the prior patch in terms > of response to directory-removal cases, and it should also catch many > overwrite cases. BTW, my thought at the moment is to wait till after next week's releases to push this in. I think it's probably solid, but it doesn't seem like it's worth taking the risk of pushing shortly before a wrap date. If anyone wants to argue for including it in the releases, speak up ... regards, tom lane
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane <spandir="ltr"><<a href="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">I wrote:<br/> > Here's a rewritten patch that looks at postmaster.pid instead of<br /> > pg_control. It should be effectivelythe same as the prior patch in terms<br /> > of response to directory-removal cases, and it should also catchmany<br /> > overwrite cases.<br /><br /></span>BTW, my thought at the moment is to wait till after next week's releases<br/> to push this in. I think it's probably solid, but it doesn't seem like<br /> it's worth taking the risk ofpushing shortly before a wrap date.<br /></blockquote></div><br /></div><div class="gmail_extra">That seems a wiser approachto me. Down to which version are you planning a backpatch? As this is aimed for the buildfarm stability with TAPstuff, 9.4?<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
Michael Paquier <michael.paquier@gmail.com> writes: > On Sat, Oct 3, 2015 at 1:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, my thought at the moment is to wait till after next week's releases >> to push this in. I think it's probably solid, but it doesn't seem like >> it's worth taking the risk of pushing shortly before a wrap date. > That seems a wiser approach to me. Down to which version are you planning a > backpatch? As this is aimed for the buildfarm stability with TAP stuff, 9.4? What we'd discussed was applying this to all branches that contain the 5-second-timeout logic, which is everything back to 9.1. The branches that have TAP tests have a wider cross-section for failure in the buildfarm because more postmaster starts are involved, but all of them are capable of getting burnt this way --- see shearwater's results for instance. regards, tom lane
On 10/02/2015 09:39 PM, Tom Lane wrote: > I wrote: >> Here's a rewritten patch that looks at postmaster.pid instead of >> pg_control. It should be effectively the same as the prior patch in terms >> of response to directory-removal cases, and it should also catch many >> overwrite cases. > > BTW, my thought at the moment is to wait till after next week's releases > to push this in. I think it's probably solid, but it doesn't seem like > it's worth taking the risk of pushing shortly before a wrap date. > > If anyone wants to argue for including it in the releases, speak up ... Wait, we're backpatching this? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 10/02/2015 09:39 PM, Tom Lane wrote: >> I wrote: >>> Here's a rewritten patch that looks at postmaster.pid instead of >>> pg_control. It should be effectively the same as the prior patch in terms >>> of response to directory-removal cases, and it should also catch many >>> overwrite cases. >> BTW, my thought at the moment is to wait till after next week's releases >> to push this in. I think it's probably solid, but it doesn't seem like >> it's worth taking the risk of pushing shortly before a wrap date. >> >> If anyone wants to argue for including it in the releases, speak up ... > Wait, we're backpatching this? Of course. It's not going to do much for buildfarm stability if it's only in HEAD. regards, tom lane