Re: [HACKERS] WIP: Failover Slots - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: [HACKERS] WIP: Failover Slots |
Date | |
Msg-id | CAMsr+YGfaT1N_0cofrDh8ePu604GRnziweJLYQjTk=O3zH=uog@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] WIP: Failover Slots (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] WIP: Failover Slots
|
List | pgsql-hackers |
On 9 August 2017 at 23:42, Robert Haas <robertmhaas@gmail.com> wrote:
You definitely can't just PITR restore and pick up where you left off.
You need a higher level protocol between replicas to recover. For example, in a multi-master configuration, this can be something like (simplified):
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.
OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately. I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is. How would
you imagine using any version of this feature in a PITR scenario? If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
I've realised that it's possible to work around it in app-space anyway. You create a new slot on a node before you snapshot it, and you don't drop this slot until you discard the snapshot. The existence of this slot ensures that any WAL generated by the node (and replayed by PITR after restore) cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or have some other safeguard in place, which we need for logical decoding on standby to be safe anyway, then we can fail gracefully if the user does something dumb.
So no need to care about this.
(What I wrote previously on this was):
You definitely can't just PITR restore and pick up where you left off.
You need a higher level protocol between replicas to recover. For example, in a multi-master configuration, this can be something like (simplified):
* Use the timeline history file to find the lsn at which we diverged from our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin filter for "originating from me", for xacts from the divergence lsn up to the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume replication
To enable that to be possible, since we can't rewind slots once confirmed advanced, maintain a backup slot on the peer corresponding to the point-in-time at which a snapshot was taken.
For most other situations there is little benefit vs just re-creating the slot before you permit write user-initiated write xacts to begin on the restored node.
I can accept an argument that "we" as pgsql-hackers do not consider this something worth caring about, should that be the case. It's niche enough that you could argue it doesn't have to be supportable in stock postgres.
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point?
That can work, but must be done in lock-step. You have to pause apply on both ends for long enough to snapshot both, otherwise the replicaion origins on one end get out of sync with the slots on another.
Interesting, but I really hope nobody's going to need to do it.
But I
can't quite follow what you're thinking about. Can you explain
further?
Gladly.
I've been up to my eyeballs in this for years now, and sometimes it becomes quite hard to see the outside perspective, so thanks for your patience.
> Requiring live replication connections could also be an issue for service
> interruptions, surely? Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."
I don't think the master needs to retain any resources on behalf of
the failover slot. If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately. When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.
OK, so you're envisioning that every slot on a downstream has a mirror slot on the upstream, and that is how the master retains the needed resources.
> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.
I don't see the problem. The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby. Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.
Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1 correspondence, and using globally unique slot names to manage state.
If slot names collide, you presumably fail with "er, don't do that then". Or scrambling data horribly. Both of which we certainly have precedent for in Pg (see, e.g, what happens if two snapshots of the same node are in archive recovery and promote to the same timeline, then start archiving to the same destination...). So not a showstopper.
I'm pretty OK with that.
Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.
Yes, that's my main hesitation with the current failover slots, as mentioned in the prior message.
Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like. With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.
Definitely agreed.
Different standbys don't know about each other so it's the user's job to make sure they ensure uniqueness, using slot name as a key.
I don't want to come on too strong here. I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge.
The flip side is that I've also been staring at the problem, on and off, for WAY too long. So other perspectives can be really valuable.
If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons. I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok". Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.
I mostly agree there. We could have relatively easily converted WAL-based failover slots to something else in a major version bump, and that's why I wanted to get them in place for 9.6 and then later for pg10. Because people were (and are) constantly asking me and others who work on logical replication tools why it doesn't work, and a 90% solution that doesn't paint us into a corner seemed just fine.
I'm quite happy to find a better one. But I cannot spend a lot of time writing something to have it completely knocked back because the scope just got increased again and now it has to do more, so it needs another rewrite.
So, how should this look if we're using the streaming rep protocol?
How about:
A "failover slot" is identified by a field in the slot struct and exposed in pg_replication_slots. It can be null (not a failover slots). It can indicate that the slot was created locally and is "owned" by this node; all downstreams should mirror it. It can also indicate that it is a mirror of an upstream, in which case clients may not replay from it until it's promoted to an owned slot and ceases to be mirrored. Attempts to replay from a mirrored slot just ERROR and will do so even once decoding on standby is supported.
This promotion happens automatically if a standby is promoted to a master, and can also be done manually via sql function call or walsender command to allow for an internal promotion within a cascading replica chain.
When a replica connects to an upstream it asks via a new walsender msg "send me the state of all your failover slots". Any local mirror slots are updated. If they are not listed by the upstream they are known deleted, and the mirror slots are deleted on the downstream.
The upstream walsender then sends periodic slot state updates while connected, so replicas can advance their mirror slots, and in turn send hot_standby_feedback that gets applied to the physical replication slot used by the standby, freeing resources held for the slots on the master.
One possible solution to this is to also mirror slots "up", as you alluded to: when you create an "owned" slot on a replica, it tells the master at connect time / slot creation time "I have this slot X, please copy it up the tree". The slot gets copied "up" to the master via cascading layers with a different failover slot type indicating it's an up-mirror. Decoding clients aren't allowed to replay from an up-mirror slot and it cannot be promoted like a down-mirror slot can, it's only there for resource retention. A node knows its owned slot is safe to actually use, and is fully created, when it sees the walsender report it in the list of failover slots from the master during a slot state update.
This imposes some restrictions:
* failover slot names must be globally unique or things go "kaboom"
* if a replica goes away, its up-mirror slots stay dangling until the admin manually cleans them up
Tolerable, IMO. But we could fix the latter by requiring that failover slots only be enabled when the replica uses a physical slot to talk to the upstream. The up-mirror failover slots then get coupled to the physical slot by an extra field in the slot struct holding the name of the owning physical slot. Dropping that physical slot cascade-drops all up-mirror slots automatically. Admins are prevented from dropping up-mirror slots manually, which protects against screwups.
We could even fix the naming, maybe, with some kind of qualified naming based on the physical slot, but it's not worth the complexity.
It sounds a bit more complex than your sketch, but I think the 4 failover-kinds are necessary to support this. We'll have:
* not a failover slot, purely local
* a failover slot owned by this node (will be usable for decoding on standby once supported)
* an up-mirror slot, not promoteable, resource retention only, linked to a physical slot for a given replica
* a down-mirror slot, promoteable, not linked to a physical slot; this is the true "failover slot"'s representation on a replica.
Thoughts? Feels pretty viable to me.
Thanks for the new perspective.
pgsql-hackers by date: