Re: Timeline following for logical slots - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Timeline following for logical slots
Date
Msg-id CAMsr+YGOVpFdh1Xr4j8pAUncLbHR1bHkDX4Q+NnODj-h4hh8kQ@mail.gmail.com
Whole thread Raw
In response to Re: Timeline following for logical slots  (Andres Freund <andres@anarazel.de>)
Responses Re: Timeline following for logical slots
List pgsql-hackers
On 31 March 2016 at 16:09, Andres Freund <andres@anarazel.de> wrote:
 
> This gives us an option for failover of logical replication in 9.6, even if
> it's a bit cumbersome and complex for the client, in case failover slots
> don't make the cut. And, of course, it's a pre-req for failover slots,
> which I'll rebase on top of it shortly.

FWIW, I think it's dangerous to use this that way. If people manipulate
slots that way we'll have hellishly to debug issues.

I think it's pretty unsafe from SQL, to be sure.

Unless failover slots get in to 9.6 we'll need to do exactly that from internal C stuff in pglogical to support following physical failover, though. I'm not thrilled by that, especially since we have no way (in 9.6) to insert generic WAL records in xlog to allow the slot updates to accurately pace replay. I don't mean logical wal messages here, they wouldn't help, it'd actually be pluggable generic WAL that we'd need to sync slots fully consistently in the absence of failover slots.

However: It's safe for the slot state on the replica to be somewhat behind the local replay from the master, so the slot state on the replica is older than what it would've been at an equivalent time on the master... so long as it's not so far behind that the replica has replayed vacuum activity that removes rows still required by the slot's declared catalog_xmin. Even then it should actually be OK in practice because the failing-over client will always have replayed past that point on the master (otherwise catalog_xmin couldn't have advanced on the master), so it'll always ask to start replay at a point at or after the lsn where the catalog_xmin was advanced to its most recent value on the old master before failover. It's only safe for walsender based decoding though, since there's no way to specify the startpoint for sql-level decoding.

My only real worry there is whether invalidations are a concern - if internal replay starts at the restart_lsn and replays over part of history where the catalog entries have been vacuumed before it starts collecting rows for return to the decoding client at the requested decoding startpoint, can that cause problems with invalidation processing? I haven't got my head around what, exactly, logical decoding is doing with its invalidations stuff yet. I didn't find any problems in testing, but wasn't sure how to create the conditions under which failures might occur.

It's also OK for the slot state to be *ahead* of the replica's replay position, i.e. from the future. restart_lsn and catalog_xmin will be higher than they were on the master at the same point in time, but that doesn't matter at all, since the client will always be requesting to start from that point or later, having replayed up to there on the old master before failover.

It's even possible for the restart_lsn and catalog_xmin to be in the replica's future, i.e. the lsn > the current replay lsn and the catalog_xmin greater than the next xid. In that case we know the logical client replayed state from the old master that hasn't been applied to the replica by the time it's promoted. If this state of affairs exists when we promote a replica to a master we should really kill the slot, since the client has obviously replayed from the old master past the point of divergence with the promoted replica and there's a  potentially an unresolvable timeline fork.  I'd like to add this if I can manage it, probably by WARNing a complaint then dropping the slot.

Needless to say I'd really rather just have failover slots, where we know the slot will be consistent with the DB state and updated in sync with it. This is a fallback plan and I don't like it too much.
 
The test code needs
a big disclaimer to never ever be used in production, and we should
"disclaim any warranty" if somebody does that. To the point of not
fixing issues around it in back branches.

I agree. The README has just that, and there are comments above the individual functions like:

 * Note that this is test harness code. You shouldn't expose slot internals
 * to SQL like this for any real world usage. See the README.

> Andres, I tried to address your comments as best I could. The main one that
> I think stayed open was about the loop that finds the last timeline on a
> segment. If you think that's better done by directly scanning the List* of
> timeline history entries I'm happy to prep a follow-up.

Have to look again.

+        * We start reading xlog from the restart lsn, even though in
+        * CreateDecodingContext we set the snapshot builder up using the
+        * slot's confirmed_flush. This means we might read xlog we don't
+        * actually decode rows from, but the snapshot builder might need it
+        * to get to a consistent point. The point we start returning data to
+        * *users* at is the confirmed_flush lsn set up in the decoding
+        * context.
+        */
still seems pretty misleading - and pretty much unrelated to the
callsite.


I think it's pretty confusing that the function uses's the slot's restart_lsn and never refers to confirmed_flush which is where it actually starts decoding from as far as the user's concerned. It took me a while to figure out what was going on there.

I think the comment would be less necessary, and could be moved up to the CreateDecodingContext call, if we passed the slot's confirmed_flush explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr and having the CreateDecodingContext call infer the start point. That way what's going on would be a lot clearer when reading the code.
 
You're thinking from the perspective of someone who knows this code intimately. It's obvious that restart_lsn controls where the xlogreader starts processing and feeding into the snapshot builder to you. Similarly, it'll be obvious that the decoding context defaults to the confirmed_flush point which is where we start returning decoded rows to users. This really will NOT be obvious to most readers though, and I think we could use making the logical decoding code easier to understand as it gets more and more use. I'm trying to help with that, and while I suffer from not knowing it as well as you, that also means I can still see some of the things that might be confusing to newer readers. Then get them wrong when explaining them ;)

Speaking of which, did you see the proposed README I sent for src/backend/replication/logical ?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: OOM in libpq and infinite loop with getCopyStart()
Next
From: Michael Paquier
Date:
Subject: Re: [PATCH v10] GSSAPI encryption support