Thread: Something is fairly whacko about shutdown in CVS HEAD

Something is fairly whacko about shutdown in CVS HEAD

From
Tom Lane
Date:
I'm seeing two sets of shutdown messages, and apparently a second
shutdown checkpoint being forced, during a normal database stop:

2007-06-30 14:21:00 EDT 9644 LOG:  received smart shutdown request
2007-06-30 14:21:00 EDT 9647 LOG:  shutting down
2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint starting: shutdown immediate
2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint complete: wrote 23 buffers (0.6%); 0 transaction log file(s) added, 0
removed,0 recycled; write=0.003 s, sync=0.178 s, total=0.237 s
 
2007-06-30 14:21:00 EDT 9647 LOG:  database system is shut down
2007-06-30 14:21:00 EDT 9644 LOG:  background writer process (PID 9647) exited with exit code 0
2007-06-30 14:21:00 EDT 9644 LOG:  terminating any other active server processes
2007-06-30 14:21:00 EDT 9644 LOG:  all server processes terminated; reinitializing
2007-06-30 14:21:00 EDT 9654 LOG:  database system was shut down at 2007-06-30 14:21:00 EDT
2007-06-30 14:21:01 EDT 9655 LOG:  shutting down
2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint starting: shutdown immediate
2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0
removed,0 recycled; write=0.000 s, sync=0.000 s, total=0.041 s
 
2007-06-30 14:21:01 EDT 9655 LOG:  database system is shut down

I haven't looked for the cause yet but I suspect it's got something to
do with this patch:
http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php
Either that or somebody else broke it recently.
        regards, tom lane


Re: Something is fairly whacko about shutdown in CVS HEAD

From
Alvaro Herrera
Date:
Tom Lane wrote:
> I'm seeing two sets of shutdown messages, and apparently a second
> shutdown checkpoint being forced, during a normal database stop:
> 
> 2007-06-30 14:21:00 EDT 9644 LOG:  received smart shutdown request
> 2007-06-30 14:21:00 EDT 9647 LOG:  shutting down
> 2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint starting: shutdown immediate
> 2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint complete: wrote 23 buffers (0.6%); 0 transaction log file(s) added, 0
removed,0 recycled; write=0.003 s, sync=0.178 s, total=0.237 s
 
> 2007-06-30 14:21:00 EDT 9647 LOG:  database system is shut down
> 2007-06-30 14:21:00 EDT 9644 LOG:  background writer process (PID 9647) exited with exit code 0
> 2007-06-30 14:21:00 EDT 9644 LOG:  terminating any other active server processes
> 2007-06-30 14:21:00 EDT 9644 LOG:  all server processes terminated; reinitializing
> 2007-06-30 14:21:00 EDT 9654 LOG:  database system was shut down at 2007-06-30 14:21:00 EDT
> 2007-06-30 14:21:01 EDT 9655 LOG:  shutting down
> 2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint starting: shutdown immediate
> 2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0
removed,0 recycled; write=0.000 s, sync=0.000 s, total=0.041 s
 
> 2007-06-30 14:21:01 EDT 9655 LOG:  database system is shut down
> 
> I haven't looked for the cause yet but I suspect it's got something to
> do with this patch:
> http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php
> Either that or somebody else broke it recently.

Huh, I can't reproduce it here.

2007-06-30 20:51:27 CLT 23221 LOG:  received smart shutdown request
2007-06-30 20:51:27 CLT 23223 LOG:  shutting down
2007-06-30 20:51:27 CLT 23225 LOG:  autovacuum launcher shutting down
2007-06-30 20:51:27 CLT 23223 LOG:  checkpoint starting: shutdown immediate
2007-06-30 20:51:27 CLT 23223 LOG:  checkpoint complete: wrote 474 buffers (15.4%); 0 transaction log file(s) added, 0
removed,1 recycled; write=0.016 s, sync=0.084 s, total=0.124 s
 
2007-06-30 20:51:27 CLT 23223 LOG:  database system is shut down

In your test output, it looks like the bgwriter is exiting early with code 0
which makes postmaster upset, so it restarts it.  Or maybe bgwriter
exits as expected but the postmaster is denying reality.

-- 
Alvaro Herrera                 http://www.amazon.com/gp/registry/DXLWNGRJD34J
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)


Re: Something is fairly whacko about shutdown in CVS HEAD

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I'm seeing two sets of shutdown messages, and apparently a second
>> shutdown checkpoint being forced, during a normal database stop:

> Huh, I can't reproduce it here.

It looks to me like this is a race condition induced by the
autovacuum-launcher patches.  Observe the following chunk of
postmaster.c, which responds to exit of the bgwriter child:
       /*        * Was it the bgwriter?        */       if (BgWriterPID != 0 && pid == BgWriterPID)       {
BgWriterPID= 0;           if (EXIT_STATUS_0(exitstatus) &&               Shutdown > NoShutdown && !FatalError &&
      !DLGetHead(BackendList) && AutoVacPID == 0)           {               /*                * Normal postmaster exit
ishere: we've seen normal exit of                * the bgwriter after it's been told to shut down. We expect
   * that it wrote a shutdown checkpoint.  (If for some reason                * it didn't, recovery will occur on next
postmasterstart.)                *                * Note: we do not wait around for exit of the archiver or
  * stats processes.  They've been sent SIGQUIT by this point,                * and in any case contain logic to commit
hara-kiriif they                * notice the postmaster is gone.                */               ExitPostmaster(0);
     }
 
           /*            * Any unexpected exit of the bgwriter (including FATAL exit)            * is treated as a
crash.           */           HandleChildCrash(pid, exitstatus,                            _("background writer
process"));

If AutoVacPID is still nonzero when bgwriter exit is detected,
then we think we've seen a crash.  I'm not clear why it happens
reliably for me and not for you, but this is certainly a bug.

To resolve this I think we need a clearer definition of the autovac
launcher's role in life.  I see that it is attached to shared memory;
is it supposed to be able to execute transactions or otherwise do
anything the bgwriter might have to clean up after?  If so we need
to fix things so that we don't tell the bgwriter to exit until after
the launcher is gone.  If not, we could possibly allow these things
to happen asynchronously, though I wonder whether it wouldn't be best
to force the ordering anyway.
        regards, tom lane


Re: Something is fairly whacko about shutdown in CVS HEAD

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> I'm seeing two sets of shutdown messages, and apparently a second
> >> shutdown checkpoint being forced, during a normal database stop:
> 
> > Huh, I can't reproduce it here.
> 
> It looks to me like this is a race condition induced by the
> autovacuum-launcher patches.  Observe the following chunk of
> postmaster.c, which responds to exit of the bgwriter child:

> If AutoVacPID is still nonzero when bgwriter exit is detected,
> then we think we've seen a crash.  I'm not clear why it happens
> reliably for me and not for you, but this is certainly a bug.

Oops, you are right.  So the problem predates the latest autovac patch;
it goes all the way back to when the launcher was introduced.  I can't
imagine the reason, but I haven't ever seen the double shutdown
checkpoint (not that I've been looking too closely, but I almost always
run postmasters with stderr to a console in front of my eyes when I'm
testing autovac).

> To resolve this I think we need a clearer definition of the autovac
> launcher's role in life.  I see that it is attached to shared memory;
> is it supposed to be able to execute transactions or otherwise do
> anything the bgwriter might have to clean up after?  If so we need
> to fix things so that we don't tell the bgwriter to exit until after
> the launcher is gone.  If not, we could possibly allow these things
> to happen asynchronously, though I wonder whether it wouldn't be best
> to force the ordering anyway.

No; the launcher is attached to shared memory, but it doesn't connect to
databases and it doesn't execute transactions either.  In fact, the
launcher cannot do anything when the postmaster is not running (or when
it is not in a state where it doesn't want to start processes, anyway),
because it (the launcher) is only capable of sending signals to
postmaster (apart from that, it ocasionally grabs lwlocks, reads the
pg_database flat file, pgstats, and the Xid counter in shared memory).
It doesn't write anything.  I think it is perfectly acceptable to have
the launcher shut down in parallel with bgwriter.

-- 
Alvaro Herrera                  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
"Amanece.                                               (Ignacio Reyes)El Cerro San Cristóbal me mira, cínicamente, con
ojosde virgen"
 


Re: Something is fairly whacko about shutdown in CVS HEAD

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> To resolve this I think we need a clearer definition of the autovac
>> launcher's role in life.  I see that it is attached to shared memory;
>> is it supposed to be able to execute transactions or otherwise do
>> anything the bgwriter might have to clean up after?

> No; the launcher is attached to shared memory, but it doesn't connect to
> databases and it doesn't execute transactions either.  In fact, the
> launcher cannot do anything when the postmaster is not running (or when
> it is not in a state where it doesn't want to start processes, anyway),
> because it (the launcher) is only capable of sending signals to
> postmaster (apart from that, it ocasionally grabs lwlocks, reads the
> pg_database flat file, pgstats, and the Xid counter in shared memory).
> It doesn't write anything.  I think it is perfectly acceptable to have
> the launcher shut down in parallel with bgwriter.

[ after sleeping on it... ]  That's probably okay as far as bgwriter
goes, but I think it's a bad idea for anything connected to shared
memory to outlive the postmaster.  To do so opens risks that some
platforms might treat the postmaster's shmctl(IPC_RMID) differently
than others.  So I want the postmaster to not exit until it's seen
the launcher quit.  It's quite easy to make this happen if we treat
the launcher more like a backend, and don't start bgwriter shutdown
till it's gone.  (I have a patch I've been testing that does it that
way.)  Allowing the two to shut down in parallel seems more complex,
though if you see a reasonable way to make it work, propose a patch.
        regards, tom lane