Re: Sync Rep for 2011CF1 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Sync Rep for 2011CF1
Date
Msg-id AANLkTikG8WMhOocX9AYsRHYPc-PgxPaG6miFDD9QH3i1@mail.gmail.com
Whole thread Raw
In response to Re: Sync Rep for 2011CF1  (Aidan Van Dyk <aidan@highrise.ca>)
Responses Re: Sync Rep for 2011CF1  (Aidan Van Dyk <aidan@highrise.ca>)
Re: Sync Rep for 2011CF1  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Fri, Jan 21, 2011 at 1:09 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
> On Fri, Jan 21, 2011 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Fri, Jan 21, 2011 at 12:23 PM, Aidan Van Dyk <aidan@highrise.ca> wrote:
>>>> When no sync slave is connected, yes, I want to stop things hard.
>>
>>> What you're proposing is to fail things earlier than absolutely
>>> necessary (when they try to XLOG, rather than at commit) but still
>>> later than what I think Simon is proposing (not even letting them log
>>> in).
>>
>> I can't see a reason to disallow login, because read-only transactions
>> can still run in such a situation --- and, indeed, might be fairly
>> essential if you need to inspect the database state on the way to fixing
>> the replication problem.  (Of course, we've already had the discussion
>> about it being a terrible idea to configure replication from inside the
>> database, but that doesn't mean there might not be views or status you
>> would wish to look at.)
>
> And just disallowing new logins is probably not even enough, because
> it allows current logged in clients "forward progress", leading
> towards an eventual hang (with now committed data on the master).
>
> Again, I'm trying to stop "forward progress" as soon as possible when
> a sync slave isn't replicating.  And I'ld like clients to fail with
> errors sooner (hopefully they get to the commit point) rather than
> accumulate the WAL synced to the master and just wait at the commit.
>
> So I think that's a more complete picture of my quick "not do anything
> with no synchronous slave replicating" that I think was what led to
> the no-login approach.

Well, stopping all WAL activity with an error sounds *more* reasonable
than refusing all logins, but I'm not personally sold on it.  For
example, a brief network disruption on the connection between master
and standby would cause the master to grind to a halt... and then
almost immediately resume operations.  More generally, if you have
short-running transactions, there's not much difference between
wait-at-commit and wait-at-WAL, and if you have long-running
transactions, then wait-at-WAL might be gumming up the works more than
necessary.

One idea might be to wait both before and after commit.  If
allow_standalone_primary is off, and a commit is attempted, we check
whether there's a slave connected, and if not, wait for one to
connect.  Then, we write and sync the commit WAL record.  Next, we
wait for the WAL to be ack'd.  Of course, the standby might disappear
between the first check and the second, but it would greatly reduce
the possibility of the master being ahead of the standby after a
crash, which might be useful for some people.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Chris Browne
Date:
Subject: Re: Review: compact fsync request queue on overflow
Next
From: Aidan Van Dyk
Date:
Subject: Re: Sync Rep for 2011CF1