Re: Sync Rep and shutdown Re: Sync Rep v19 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Sync Rep and shutdown Re: Sync Rep v19
Date
Msg-id AANLkTi=+0mNL289VGDdT4dKCYWd572toA3-6KyrptXVx@mail.gmail.com
Whole thread Raw
In response to Re: Sync Rep and shutdown Re: Sync Rep v19  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Sync Rep and shutdown Re: Sync Rep v19
List pgsql-hackers
On Thu, Mar 17, 2011 at 2:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> This occurs to me; we should ensure that, in shutdown case, walwriter
> should exit after all the backends have gone out? I'm not sure if it's worth
> thinking of the case, but what if synchronous_standby_names is unset
> and config file is reloaded after smart shutdown is requested? In this
> case, the reload cannot wake up the waiting backends since walwriter
> has already exited. This behavior looks a bit inconsistent.

I agree we need to fix smart shutdown.  I think that's going to have
to be a separate patch, however; it's got more problems than this
patch can fix without expanding into a monster.

> In the patch, in order to read the latest value, you take a light-weight lock.
> But I wonder why taking a lock can ensure that the value is up-to-date.

A lock acquisition acts as a memory sequence point.

> + * WAL writer calls this as needed to update the shared sync_standbys_needed
>
> Typo: s/sync_standbys_needed/sync_standbys_defined

Fixed.

> + * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
>
> Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required.

Fixed.

> +                * So in this case we issue a NOTICE (which some clients may
>
> Typo: s/NOTICE/WARNING

Fixed.

> +               if (ProcDiePending)
> +               {
> +                       ereport(WARNING,
> +                                       (errcode(ERRCODE_ADMIN_SHUTDOWN),
> +                                        errmsg("canceling the wait for replication and terminating
> connection due to administrator command"),
> +                                        errdetail("The transaction has already been committed locally
> but might have not been replicated to the standby.")));
> +                       whereToSendOutput = DestNone;
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
>
> SHMQueueIsDetached() should be checked before calling SHMQueueDelete().
> Walsender can already delete the backend from the queue before reaching here.

Fixed.  But come to think of it, doesn't this mean
SyncRepCleanupAtProcExit() needs to repeat the test after acquiring
the lock?

> +               if (QueryCancelPending)
> +               {
> +                       QueryCancelPending = false;
> +                       ereport(WARNING,
> +                                       (errmsg("canceling wait for synchronous replication due to user request"),
> +                                        errdetail("The transaction has committed locally, but may not
> have replicated to the standby.")));
> +                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
> +                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
> +                       SHMQueueDelete(&(MyProc->syncRepLinks));
> +                       LWLockRelease(SyncRepLock);
>
> Same as above.

Fixed.

> +               if (!PostmasterIsAlive(true))
> +               {
> +                       whereToSendOutput = DestNone;
> +                       proc_exit(1);
>
> proc_exit() should not be called at that point because it leads PANIC.

Fixed, although I'm not sure it matters.

> I think that it's better to check ProcDiePending, QueryCancelPending
> and PostmasterIsAlive *before* waiting on the latch, not after. Because
> those events can occur before reaching there, and it's not worth waiting
> for 60 seconds to detect them.

Not necessary.  Inspired by one of your earlier patches, I made die()
and StatementCancelHandler() set the latch.  We could still wait up to
60 s to detect postmaster death, but that's a very rare situation and
it's not worth slowing down the common case for it, especially since
there is a race condition no matter what.

Thanks for the review!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: Re: really lazy vacuums?
Next
From: Robert Haas
Date:
Subject: Re: Sync Rep and shutdown Re: Sync Rep v19