Thread: Revised patch for fixing archiver shutdown behavior
The attached patch fixes archiver shutdown in what seems to me to be a sane way. With the patch, we send SIGQUIT to the archiver only for panic-stop situations (backend crash or immediate-mode shutdown). This is important because the postmaster is coded to send SIGQUIT to the entire process group, meaning we'd also ungracefully terminate any currently-running archive command, which does not seem like a good idea for normal exit. Instead, the shutdown protocol is that *after* the bgwriter has quit, we send SIGUSR1 (ie, the normal archiver wakeup signal) to the archiver. This ensures that it will do a normal archiving cycle after the last possible creation of WAL entries. The archiver is also modified to fall out of its wait loop more quickly when the postmaster has died (this is the same as Simon's previous one-liner patch), and to not exit until it has completed a full archive cycle since the postmaster died. The latter eliminates a race condition that existed before --- depending on timing, the CVS-HEAD archiver might or might not do a final archiving cycle. I also tweaked the postmaster so that the stats collector isn't told to quit until after the bgwriter finishes; this ensures that any stats reported from the bgwriter will be collected. One point needing discussion is that the postmaster is currently coded not to send SIGUSR1 to the archiver if a fast-mode shutdown is under way. I duplicated that in the added SIGUSR1 signal here, but I wonder whether it is sane or not. Comments? regards, tom lane
Attachment
Tom Lane wrote: > The attached patch fixes archiver shutdown in what seems to me to be > a sane way. With the patch, we send SIGQUIT to the archiver only for > panic-stop situations (backend crash or immediate-mode shutdown). > This is important because the postmaster is coded to send SIGQUIT > to the entire process group, meaning we'd also ungracefully terminate > any currently-running archive command, which does not seem like a good > idea for normal exit. Instead, the shutdown protocol is that *after* > the bgwriter has quit, we send SIGUSR1 (ie, the normal archiver wakeup > signal) to the archiver. This ensures that it will do a normal > archiving cycle after the last possible creation of WAL entries. Hmm, so the postmaster is gone during the last archiving cycle? What about syslogger? Is the archiver able to log stuff in the last cycle? The comment in line 2180 seems a bit bogus ...? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Hmm, so the postmaster is gone during the last archiving cycle? What > about syslogger? Is the archiver able to log stuff in the last cycle? The logger is no problem --- it quits when it sees EOF on its input pipe, which means that all upstream processes are gone. > The comment in line 2180 seems a bit bogus ...? Yeah, that could use a bit more work I guess, since "normal children" sounds like it would refer to more than just backends. regards, tom lane
I wrote: > One point needing discussion is that the postmaster is currently > coded not to send SIGUSR1 to the archiver if a fast-mode shutdown > is under way. I duplicated that in the added SIGUSR1 signal here, > but I wonder whether it is sane or not. Comments? After chewing on that for awhile, I decided it was bogus. If we are going to have a policy that the archiver gets a chance to archive everything, that shouldn't depend on fast vs. smart shutdown; those alternatives determine whether we kick clients out ungracefully, not whether we take extra risks with committed data. I think we should allow the archiver to finish out its tasks fully in all non-crash cases except one: if we got SIGTERM from init. In that case there's a very great risk of being SIGKILL'd before we can finish archiving. The postmaster cannot easily tell whether its SIGTERM came from init or not, but we can drive this off the archiver itself getting SIGTERM'd. I propose that if the archiver receives SIGTERM, it should cease to issue any new archive commands, but just wait till it sees the postmaster exit. (It can't exit right away, since there's a race condition: the postmaster might not have been SIGTERM'd yet, and might therefore spawn a new archiver, which would have no idea it's unsafe to do anything more.) There's an obvious failure mode in that, which is that a randomly issued SIGTERM to the archiver would shut down archiving indefinitely. We can guard against that with a timeout: the archiver should exit a minute or two after being SIGTERM'd, even if the postmaster is still there. That should certainly be enough delay to avoid the race condition, and if in fact everything is still hunky-dory the postmaster will immediately spawn a new archiver. Hence, attached revised patch ... regards, tom lane
Attachment
Tom Lane wrote: > Hence, attached revised patch ... Looks good. Something I'm still wondering is about the archiver/logger combination. What happens if a postmaster is stopped by the user and the archiver is still running, and the user starts a new postmaster? This would launch a new archiver and logger; and there are now two loggers possibly writing to the same files, and truncated log lines could occur. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Something I'm still wondering is about the archiver/logger combination. > What happens if a postmaster is stopped by the user and the archiver is > still running, and the user starts a new postmaster? This would launch > a new archiver and logger; and there are now two loggers possibly > writing to the same files, and truncated log lines could occur. I'm not nearly as worried about that as I am about the prospect of two concurrent archivers :-( There was discussion of having a lock file for the archiver, but it's still an open issue. I'm not sure how to solve the problem of stale lockfiles --- unlike the postmaster, the archiver can't assume that it's the only live process with the postgres userid. For example, after a system crash-and-restart, it's entirely likely that the PID formerly held by the archiver is now held by the bgwriter, making the lockfile (if any) look live. Maybe we should go back to the plan of having the postmaster wait for the archiver to exit. regards, tom lane
Tom Lane wrote: > There was discussion of having a lock file for the archiver, but > it's still an open issue. I'm not sure how to solve the problem > of stale lockfiles --- unlike the postmaster, the archiver can't > assume that it's the only live process with the postgres userid. > For example, after a system crash-and-restart, it's entirely > likely that the PID formerly held by the archiver is now held > by the bgwriter, making the lockfile (if any) look live. Ah, right :-( > Maybe we should go back to the plan of having the postmaster > wait for the archiver to exit. Yeah, that seems the safest to me -- the problem is that it complicates the shutdown sequence a fair bit, because postmaster must act differently depending on whether archiving is enabled or not: wait for bgwriter exit if disabled, or for archiver exit otherwise. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> Maybe we should go back to the plan of having the postmaster >> wait for the archiver to exit. > Yeah, that seems the safest to me -- the problem is that it complicates > the shutdown sequence a fair bit, because postmaster must act > differently depending on whether archiving is enabled or not: wait for > bgwriter exit if disabled, or for archiver exit otherwise. Given the recent changes to make the postmaster act as a state machine, I don't think this is really a big deal --- it's just one more state. The bigger part is that the archiver can't wait for postmaster exit. We'll need a proper shutdown signal for the archiver, but since it's not using SIGUSR2 that can be commandeered easily. So it'd be like SIGUSR1 -> do an archive cycle SIGUSR2 -> do an archive cycle and exit no postmaster -> just exit The rationale for the last is that it's a crash situation, and furthermore there's a risk of someone starting a new postmaster and a conflicting archiver. So we should put back the behavior my last patch removed of aborting archiving immediately on postmaster death. I'll respin my patch this way... regards, tom lane
On Thu, 2008-01-10 at 10:13 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Something I'm still wondering is about the archiver/logger combination. > > What happens if a postmaster is stopped by the user and the archiver is > > still running, and the user starts a new postmaster? This would launch > > a new archiver and logger; and there are now two loggers possibly > > writing to the same files, and truncated log lines could occur. > > I'm not nearly as worried about that as I am about the prospect of > two concurrent archivers :-( How strange, I was just looking at that particular possibility when your mail arrived. The earlier patch looks good, but I see you've reverted the PostmasterIsAlive() check in pgarch_ArchiverCopyLoop(). That was put in to prevent multiple archivers. Can we keep that? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2008-01-10 at 12:28 -0300, Alvaro Herrera wrote: > Yeah, that seems the safest to me -- the problem is that it complicates > the shutdown sequence a fair bit, because postmaster must act > differently depending on whether archiving is enabled or not: wait for > bgwriter exit if disabled, or for archiver exit otherwise. If there's nothing to write, archiver will exit quickly, so in most cases they will look the same. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Thu, 2008-01-10 at 10:49 -0500, Tom Lane wrote: > So we should put back the behavior > my last patch removed of aborting archiving immediately on > postmaster death. Excellent -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
I wrote: > I'll respin my patch this way... Third time's the charm? regards, tom lane
Attachment
On Thu, 2008-01-10 at 14:08 -0500, Tom Lane wrote: > I wrote: > > I'll respin my patch this way... > > Third time's the charm? If we do a shutdown immediate on the postmaster *after* the bgwriter has written a shutdown checkpoint, do we have any record that there was a panic stop? Do we enter recovery in that case? I think the answers are yes and no, but just checking. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
Simon Riggs <simon@2ndquadrant.com> writes: > If we do a shutdown immediate on the postmaster *after* the bgwriter has > written a shutdown checkpoint, do we have any record that there was a > panic stop? Do we enter recovery in that case? I think the answers are > yes and no, but just checking. Yeah, the postmaster would complain at exit, but the pg_control state is already SHUTDOWN at that point. There'd be a log entry showing abnormal archiver exit if the panic had had any actual effect on it. regards, tom lane