Re: Timeline following for logical slots - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Timeline following for logical slots
Date
Msg-id CAMsr+YEvDZ2HgbOHj0x3Q_JUbkS88XE8mz+R0TRBDpebZCmDUA@mail.gmail.com
Whole thread Raw
In response to Re: Timeline following for logical slots  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Timeline following for logical slots  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On 7 April 2016 at 23:32, Robert Haas <robertmhaas@gmail.com> wrote:
 
> Yeah. I understand the reasons for that decision. Per an earlier reply I
> think we can avoid making them WAL-logged so they can be used on standbys
> and still achieve usable failover support on physical replicas.

I think at one point we may have discussed doing this via additional
side-channel protocol messages.  Is that what you are thinking about
now, or something else?

Essentially, yes.

The way I'd like to do it in 9.6+1 is:


- Require that the replica(s) use streaming replication with a replication slot to connect to the master

- Extend the feedback protocol to allow the replica to push its required catalog_xmin up to the master so it doesn't vacuum away catalog tuples still needed by a replica. (There's no need for it to push the restart_lsn, and its fine for the master to throw away WAL still needed by a replica).

- Track the replica's catalog_xmin on the replica's slot. So it'll be a slot used for physical replication that also has a catalog_xmin set.

- Allow applications to create their own slots on read-replicas (for apps that want to decode from a standby). 

- For transparent failover sync slot state from master to replica via writes by a helper to a table on the master that get applied by a helper on the standby.



Allowing apps to create slots on a replica can be used by aware apps to do failover but only if they know about and can connect to all the failover candidate replica(s), and they have to maintain and advance a slot on each. Potentially fragile. So this is mostly good for supporting decoding from a standby, rather than failover.

I really want easy, practical failover that doesn't require every app to re-implement logic to keep track of replicas its self, etc. For that I'd use have a bgworker that runs on the replica make a direct libpq connection to the master and snapshot the state of its slots plus its xlog insert position. The worker would wait until the replica's replay reaches/passes that xlog position before applying the slot state copied from the master to the replica. (Adding a "GET SLOT STATE" or whatever to the walsender interface would make this less ugly). This basically emulates what failover slots did, but lazily: with no hook to capture slot state save we have to poll the master. With no ability to insert the changes into WAL and run a custom redo function on the replica we have to manually ensure they're applied at the right time. Unlike with failover slots it's possible for the slot on the replica to be behind where it was on the master at the same LSN - but that's OK because we're protecting against catalog vacuum, per above. 

(I'd really like to have a slot flag that lets us disallow replay from such copied slots on the replica, marking them as usable only if not in recovery.)

The only part of this that isn't possible on 9.6 is having the replica push the catalog_xmin up to the master over feedback. But we can emulate that with a bgworker on the replica that maintains an additional dummy logical slot on the master. It replays the dummy slot to the lowest confirmed_lsn of any slot on the replica. Somewhat inefficient, but will work.

If it sounds like a bit of a pile of hacks, that's because the failover support part is. But unlike failover slots it will bring us closer to being able to do logical decoding from a replica, which is nice. It can be made a lot less ugly if the help of the walsender can be enlisted to report the master's slot state, so we don't have to use normal libpq. (The reason I wouldn't use a bgworker on the master to write it to a table then another worker to apply changes from that table on the replica is mainly that then we can't have failover support for ascading replicas, which can't write WAL).
 
Well, we can agree to disagree on this.  I don't think that it's all
that difficult to figure out how to change your schema in a
step-by-step way that allows logical replication to keep working while
the nodes are out of sync, but you don't have to agree and that's
fine.  I'm not objecting to eventually adding that feature to core.  I
do think it's a bad idea to be polishing that sort of thing before
getting some more basic facility into core.

That much I agree on - I certainly don't want to block this on DDL replication.
 
While I acknowledge that a logical output plugin has applications
beyond replication, I think replication is the preeminent one by a
considerable margin.  Some people want it for other purposes, and we
should facilitate that.  But that number of people is dwarfed by the
number who would use a seamless logical replication facility if we had
one available.  And so that's the thing I think we should focus on
making work.

Agreed.

Really I think we'll want a separate json output plugin for most of those other uses anyway. Though some of the facilities in pglogical_output will need to be extracted, added to into logical decoding its self, and shared.
 
If I were doing that, I think I would attack it from a considerably
different direction than what has so far been proposed.  I would try
to put the stuff in core, not contrib, and I would arrange to control
it using DDL, not function calls.  For version one, I would cut all of
the stuff that allows data to be sent in any format other than text,
and just use in/outfuncs all the time.

I'm very hesistant to try to do this with new DDL. Partly for complexity, partly because I'd really like to be able to carry a backport for 9.4 / 9.5 / 9.6 so people can use it within the next couple of years.
 
I do generally think that logical decoding relies too much on trying
to set up situations where it will never fail, and I've said from the
beginning that it should make more provision to cope with failure
rather than just trying to avoid it.  If logical decoding never
breaks, great.  But the approach I would favor is to set things up so
that it automatically reclones if there is a replication break, and
then as an optimization project, try to eliminate those cases one by
one.

I really can't agree there.

For one thing, sometimes those clones are *massive*. How do you tell someone who's replicating a 10 TiB database that they've got to let the whole thing re-sync, and by the way all replication will completely halt until it does?

It's bad enough with physical standby, though at least there rsync helps a bit and pg_rewind has made a huge difference. Lets not create the same problem again in logical replication.

Then, as Petr points out, there are applications where you can't re-clone, at least not directly. You're using the decoding stream with a transform downstream to insert incoming data into fact tables. You're feeding it into a messaging system.  You're merging data from multiple upstreams into a single downstream. Many of the interesting, exciting things we can make possible with logical replication that simply aren't possible with physical replication really need it not to randomly break.

Also, from an operational experience point of view, BDR has places where it does just break if you do something wrong. Experience with this has left me absolutely adamant that we need not to have such booby-traps in core logical replication, at least in the medium term. It doesn't matter how many times you tell users "don't do that" ... they'll do it. Then get angry when it breaks. Not to mention how hard it can be to discover why something broke. You have to look at the logs. Obvious to you or me, but I spend a lot of time answering questions about BDR and pglogical to the effect of "not working? nothing happening? LOOK AT THE LOG FILES."

I think you're really over-estimating the average user when it comes to analysing and understanding the consequence of specific schema changes, etc. Sure, I think it's fine not to support some things we do support for physical, but as much as possible we should stop the user doing those when logical replication is enabled, and where that's not possible it needs to be really, really obvious what broke and why.
 
That isn't really my perception of how things have gone, but I admit I
may not have been following it as closely as I should have done.  I'd
be happy to talk with you about this in person if you're going to be
at PGCONF.US or PGCon in Ottawa; or there's always Skype.  I don't see
any reason why we can't all put our heads together and agree on a
direction that we can all live with.

Yeah. I'd like to see us able to work on a shared dev tree rather than mailing patches around, at least. 

I'm not going to make it to PgConf.US. I don't know about Ottawa yet, but I doubt it given that I did go to Brussels. Perth is absurdly far from almost everywhere, unfortunately.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [PATCH v12] GSSAPI encryption support
Next
From: Peter Geoghegan
Date:
Subject: Re: Using quicksort for every external sort run