Re: Timeline following for logical slots - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Timeline following for logical slots
Date
Msg-id CAMsr+YF1Wi=8hAryUM1Sn=5tW64QvWav4quP1k14-M4EKTHNRQ@mail.gmail.com
Whole thread Raw
In response to Re: Timeline following for logical slots  (Andres Freund <andres@anarazel.de>)
Responses Re: Timeline following for logical slots  (Andres Freund <andres@anarazel.de>)
Re: Timeline following for logical slots  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 4 April 2016 at 18:01, Andres Freund <andres@anarazel.de> wrote:
 
> The only way I can think of to do that really reliably right now, without
> full failover slots, is to use the newly committed pluggable WAL mechanism
> and add a hook to SaveSlotToPath() so slot info can be captured, injected
> in WAL, and replayed on the replica.

I personally think the primary answer is to use separate slots on
different machines. Failover slots can be an extension to that at some
point, but I think they're a secondary goal.

Assuming that here you mean separate slots on different machines replicating via physical rep:

We don't currently allow the creation of a logical slot on a standby. Nor replay from it, even to advance it without receiving the decoded changes. Both would be required for that to work, as well as extensions to the hot standby feedback mechanism to allow a standby to ask the master to pin its catalog_xmin if slots on the standby were further behind than that of the master.

I was chatting about that with Petr earlier. What we came up with was to require the standby to connect to the master using a replication slot that, while remaining a physical replication slot, has a catalog_xmin set and updated by the replica using extended standby progress messages. The slot's catalog_xmin the replica pushed up to the master would simply be the min(catalog_xmin) of all slots on the replica, i.e. procArray->replication_slot_catalog_xmin . Same with the slot xmin, if defined for any slot on the replica.

That makes sure that the catalog_xmin required for the standby's slots is preserved even if the standby isn't currently replaying from the master.

Handily this approach would give us cascading, support for intermediate servers, and the option of only having failover slots on some replicas not others. All things that were raised as concerns with failover slots.

However, clients would then have to know about the replica(s) of the master that were failover candidates and would have to send feedback to advance the client's slots on those nodes, not just the master. They'd have to be able to connect to the replicas too. Unless we added some mechanism for the master to lazily relay those feedback messages to replicas, anyway. Not a major roadblock, just a bit fiddlier for clients.

Consistency shouldn't be a problem so long as the slot created on the replica reaches SNAPBUILD_CONSISTENT (or there's enough pending WAL for it to do so) before failover is required.

I think it'd be a somewhat reasonable alternative to failover slots and it'd make it much more practical to decode from a replica. Which would be great. It'd be fiddlier for clients, but probably worth it to get rid of the limitations failover slots impose.

 
> It'd also be necessary to move
> CheckPointReplicationSlots() out of CheckPointGuts()  to the start of a
> checkpoint/restartpoint when WAL writing is still permitted, like the
> failover slots patch does.

Ugh. That makes me rather wary.

Your comments say it's called in CheckPointGuts for convenience... and really there doesn't seem to be anything that makes a slot checkpoint especially tied to a "real" checkpoint.
 
> Basically, failover slots as a plugin using a hook, without the
> additions to base backup commands and the backup label.

I'm going to be *VERY* hard to convince that adding a hook inside
checkpointing code is acceptable.

Yeah... it's in ReplicationSlotSave, but it's still a slot checkpoint even if (per above) it ceases to also be in the middle of a full system checkpoint.
 
> I'd really hate 9.6 to go out with - still - no way to use logical decoding
> in a basic, bog-standard HA/failover environment. It overwhelmingly limits
> their utility and it's becoming a major drag on practical use of the
> feature. That's a difficulty given that the failover slots patch isn't
> especially trivial and you've shown that lazy sync of slot state is not
> sufficient.

I think the right way to do this is to focus on failover for logical
rep, with separate slots. The whole idea of integrating this physical
rep imo makes this a *lot* more complex than necessary. Not all that
many people are going to want to physical rep and logical rep.

If you're saying we should focus on failover between nodes that're themselves connected using logical replication rather than physical replication, I really have to strongly disagree.

TL;DR for book-length below: We're a long, long way from being able to deliver even vaguely decent logical rep based failover. Without that or support for logical decoding to survive physical failover we've got a great tool in logical decoding that can't be used effectively with most real-world systems.


I originally thought logical rep based failover was the way forward too and that mixing physical and logical rep didn't make sense.

The problem is that we're a very, very long way from there, wheras we can deliver failover of logical decoding clients to physical standbys with _relative_ ease and simplicity. Not actually easy or simple, but a lot closer.

To allow logical rep and failover to be a reasonable substitute for physical rep and failover IMO *need*:

* Robust sequence decoding and replication. If you were following the later parts of that discussion you will've seen how fun that's going to be, but it's the simplest of all of the problems.

* Logical decoding and sending of in-progress xacts, so the logical client can already be most of the way through receiving a big xact when it commits. Without this we have a huge lag spike whenever a big xact happens, since we must first finish decoding it in to a reorder buffer and can only then *begin* to send it to the client. During which time no later xacts may be decoded or replayed to the client. If you're running that rare thing, the perfect pure OLTP system, you won't care... but good luck finding one in the real world.

* Either parallel apply on the client side or at least buffering of in-progress xacts on the client side so they can be safely flushed to disk and confirmed, allowing receive to continue while replay is done on the client. Otherwise sync rep is completely impractical... and there's no shortage of systems out there that can't afford to lose any transactions. Or at least have some crucial transactions they can't lose.

* Robust, seamless DDL replication, so things don't just break randomly. This makes the other points above look nice and simple by comparison. Logical decoding of 2PC xacts with DDL would help here, as would the ability to transparently convert an xact into a prepare-xact on client commit and hold the client waiting while we replicate it, confirm the successful prepare on the replica, then commit prepared on the upstream.

* oh, and some way to handle replication of shared catalog changes like pg_authid, so the above DDL replication doesn't just randomly break if it happens to refer to a global object that doesn't exist on the downstream.


Physical rep *works*. Robustly. Reliably. With decent performance. It's proven. It supports sync rep. I'm confident telling people to use it.

I don't think there's any realistic way we're going to get there for logical rep in 9.6+n for n<2 unless a whole lot more people get on board and work on it. Even then.

Right now we can deliver logical failover for DBs that:

(a) only use OWNED BY sequences like SERIAL, and even then only with some hacks;
(b) don't do DDL, ever, or only do some limited DDL via direct admin commands where they can call some kind of helper function to queue and apply the DDL;
(c) don't do big transactions or don't care about unbounded lag;
(d) don't need synchronous replication or don't care about possibly large delays before commit is confirmed;
(e) only manage role creation (among other things) via very strict processes that can be absolutely guaranteed to run on all nodes

... which in my view isn't a great many databases.

Physical rep has *none* of those problems. (Sure, it has others, but we're used to them). It does lack any way for a logical client to follow failover though, meaning that right now it's really hard to use logical rep in conjunction with physical rep. Anything you use logical decoding for has to be able to cope with completely breaking and being resynced from a new snapshot after failover, which removes a lot of the advantages the reliable, ordered replay from logical decoding gives us in the first place.

Just one example: right now BDR doesn't like losing nodes without warning, as you know. We want to add support for doing recovery replay from the most-ahead peer of a lost node to help with that, though the conflict handling implications of that could be interesting. But even then replacing the lost node still hurts when you're working over a WAN. In  real world case I've dealt with it took over 8 hours to bring up a replacement for a lost node over the WAN after the original node's host suffered an abrupt hardware failure. If we could've just had a physical sync standby for each BDR node running locally on the same LAN as the main node (and dealt with the fun issues around the use of the physical timeline in BDR's node identity keys) we could've just promoted it to replace the lost node with minimal disruption.

You could run two nodes on each site, but then you either double your WAN bandwidth use or have to support complex non-mesh topologies with logical failover-candidate standbys hanging off each node in the mesh.

That's just BDR though. You can't really use logical replication for things like collecting an audit change stream, feeding business message buses and integration systems, replicating to non-PostgreSQL databases, etc, if you can't point it at a HA upstream and expect it to still work after failover. Since we can't IMO deliver logical replication based HA in a reasonable timeframe, that means we should really have a way for logical slots to follow physical failover.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Tiny patch: sigmask.diff
Next
From: Julien Rouhaud
Date:
Subject: Re: Choosing parallel_degree