Re: [HACKERS] WIP: Failover Slots - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: [HACKERS] WIP: Failover Slots
Date
Msg-id CAMsr+YEupkmQR2zogMqySeskbNv24AhymWMQHbguJzQxCvZrow@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] WIP: Failover Slots  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] WIP: Failover Slots
List pgsql-hackers
On 12 August 2017 at 08:03, Andres Freund <andres@anarazel.de> wrote:
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.

How could it be incompatible? The idea here is to make physical failover transparent to logical decoding clients. That's not meant to sound confrontational, I mean that I can't personally see any way it would be and could use your ideas.

I understand that it might be *different* and you'd like to see more closely aligned approaches that work more similarly. For which we first need to know more clearly how logical failover will look. But it's hard not to also see this as delaying and blocking until your preferred approach via pure logical rep and logical failover gets in, and physical failover can be dismissed with "we don't need that anymore". I'm sure that's not your intent, I just struggle not to see it that way anyway when there's always another reason not to proceed to solve this problem because of a loosely related development effort on another problem.

I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
  available everywhere, or not?

Personally, I don't care either way.
 
- What kind of PITR integration do we want? Note that simple WAL based
  slots do *NOT* provide proper PITR support, there's not enough
  interlock easily available (you'd have to save slots at the end, then
  increment minRecoveryLSN to a point later than the slot saving)

Interesting. I haven't fully understood this, but think I see what you're getting at.

As outlined in the prior mail, I'd like to have working PITR with logical slots but think it's pretty niche as it can't work usefully without plenty of co-operation from the rest of the logical replication software in use. You can't just restore and resume normal operations. So I don't think it's worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of catalog_xmin on the running master for the life of the PITR base backup using a slot for retention. There's plenty of room for operator error until/unless we add something like catalog_xmin advance xlog'ing, but it can be done now with external tools if you're careful. Details in the prior mail.

I don't think PITR for logical slots is important given there's a workaround and it's not simple to actually do anything with it if you have it.
 
- How much divergence are we going to accept between logical decoding on
  standbys, and failover slots. I'm probably a lot closer to closer than
  than Craig is.

They're different things to me, but I think you're asking "to what extent should failover slots functionality be implemented strictly on top of decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client can expect a slot it creates on a master (or physical streaming replica doing decoding on standby) to continue to exist. The client can ignore physical HA and promotions of the master, which can continue to be managed using normal postgres tools. It's the same as, say, an XA transaction manager expecting that if your master dies and you fail over to a standby, the TM should't have to have been doing special housekeeping on the promotion candidate before promotion in order for 2PC to continue to work. It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as you can use it to extract data from a replica, and now decoding timeline following is in a decoding connection on a replica will survive promotion to master.

But in addition to its main purpose of allowing logical decoding from a standby server to offload work, it can be used to implement client-managed support for failover to physical replicas. For this, the client must have an inventory of promotion-candidates of the master and their connstrings so it can maintain slots on them too. The client must be able to connect to all promotion-candidates and advance their slots via decoding along with the master slots it's actually replaying from. If a client isn't "told" about a promotion candidate, decoding will break when we fail over. If a client cannot connect to a promotion candidate, catalog_xmin will fall behind on master until the replica is discarded (and its physical slot dropped) or the client regains access. Every different logical decoding client application must implement all this logic and management separately.

It may be possible to implement failover-slots like functionality based on decoding on standby in an app transparent way, by having the replica monitor slot states on the master and self-advance its own slots by loopback decoding connection. Or the master could maintain an inventory of replicas and make decoding connections to them where it advances their slots after the masters' slots are advanced by an app. But either way, why would we want to do this? Why actually decode WAL and use the logical decoding machinery when we *know* the state of the system because only the master is writeable?

The way I see it, to provide failover slots functionality we'd land up with something quite similar to what Robert and I just discussed, but the slot advance would be implemented using decoding (on standby) instead of directly setting slot state. What benefit does that offer?

I don't want to block failover slots on decoding on standby just because decoding on standby would be nice to have.
 
- How much divergence are we going to accept between infrastructure for
  logical failover, and logical failover via failover slots (or however
  we're naming this)? Again, I'm probably a lot closer to zero than
  craig is.


We don't have logical failover, let alone mature, tested logical failover that covers most of Pg's available functionality. Nor much of a design for it AFAIK. There is no logical failover to diverge from, and I don't want to block physical failover support on that.

But, putting that aside to look at the details of how logical failover might work, what sort of commonality do you expect to see? Physical failover is by WAL replication using archive recovery/streaming, managed via recovery.conf, with unilateral promotion by trigger file/command. The admin is expected to ensure that any clients and cascading replicas get redirected to the promoted node and the old one is fenced - and we don't care if that's done by IP redirection or connstring updates or what. Per the proposal Robert and I discussed, logical slots will be managed by having the walsender/walreceiver exchange slot state information that cascades up/down the replication tree via mirror slot creations.

How's logical replica promotion going to work? Here's one possible way, of many: the promotion-candidate logical replica consumes an unfiltered xact stream that contains changes from all nodes, not just its immediate upstream. Downstreams of the master can maintain direct connections to the promotion candidate and manage their own slots directly, sending flush confirmations for slots on the promotion candidate as they see their decoding sessions on the replica decode commits for LSNs the clients sent flush confirmations to the master for. On promotion, the master's downstreams would be reconfigured to connect to the node-id of the newly promoted master and would begin decoding from it in catchup mode, where they receive the commits from the old master via the new master, until they reach the new master's end-of-wal at time of promotion. With some tweaks like a logical WAL message recording the moment of promotion, it's not that different to the client-managed physical failover model.

It can also be converted to a more transparent failover-slots like model by having the promotion candidate physical replica clone slots from its upstream, but advance them by loopback decoding - not necessarily actual network loopback. It'd use a filter that discards data and only sees the commit XIDs + LSNs. It'd send confirmations on the slots when the local slot processed a commit for which the upstream's copy of the slot had a confirmation for that lsn. On promotion, replicas would connect with new replorigins (0) and let decoding start at the slot positions on the replica. The master->replica slot state reporting can be done via the walsender too, just as proposed for the physical case, though no replica->master reporting would be needed for logical failover.

So despite my initial expectations they can be moderately similar in broad structure. But I don't think there's going to be much actual code overlap beyond minor things like both wanting a way to query slot state on the upstream. Both *could* use decoding on standby to advance slot positions, but for the physical case that's just a slower (and unfinished) way to do what we already have, wheras it's necessary for logical failover.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: [HACKERS] Simplify ACL handling for large objects and removal of superuser() checks
Next
From: Rushabh Lathia
Date:
Subject: Re: [HACKERS] reload-through-the-top-parent switch the partition table