Re: Timeline following for logical slots - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Timeline following for logical slots
Date
Msg-id 20160401064601.GA9074@awork2.anarazel.de
Whole thread Raw
In response to Re: Timeline following for logical slots  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: Timeline following for logical slots
Re: Timeline following for logical slots
Re: Timeline following for logical slots
List pgsql-hackers
Hi,

On 2016-04-01 13:16:18 +0800, Craig Ringer wrote:
> I think it's pretty unsafe from SQL, to be sure.
> 
> Unless failover slots get in to 9.6 we'll need to do exactly that from
> internal C stuff in pglogical to support following physical failover,

I know. And this makes me scared shitless.


I'll refuse to debug anything related to decoding, when this "feature"
has been used. And I think there'll be plenty; if you notice the issues
that is.


The more I think about it, the more I think we should rip this out
again, until we have the actual feature is in. With proper review.


> However: It's safe for the slot state on the replica to be somewhat behind
> the local replay from the master, so the slot state on the replica is older
> than what it would've been at an equivalent time on the master... so long
> as it's not so far behind that the replica has replayed vacuum activity
> that removes rows still required by the slot's declared catalog_xmin. Even
> then it should actually be OK in practice because the failing-over client
> will always have replayed past that point on the master (otherwise
> catalog_xmin couldn't have advanced on the master), so it'll always ask to
> start replay at a point at or after the lsn where the catalog_xmin was
> advanced to its most recent value on the old master before failover. It's
> only safe for walsender based decoding though, since there's no way to
> specify the startpoint for sql-level decoding.

This whole logic fails entirely flats on its face by the fact that even
if you specify a startpoint, we read older WAL and process the
records. The startpoint filters every transaction that commits earlier
than the "startpoint". But with a long-running transaction you still
will have old changes being processed, which need the corresponding
catalog state.


> My only real worry there is whether invalidations are a concern - if
> internal replay starts at the restart_lsn and replays over part of history
> where the catalog entries have been vacuumed before it starts collecting
> rows for return to the decoding client at the requested decoding
> startpoint, can that cause problems with invalidation processing?

Yes.


> It's also OK for the slot state to be *ahead* of the replica's replay
> position, i.e. from the future. restart_lsn and catalog_xmin will be higher
> than they were on the master at the same point in time, but that doesn't
> matter at all, since the client will always be requesting to start from
> that point or later, having replayed up to there on the old master before
> failover.

I don't think that's true. See my point above about startpoint.



> > Andres, I tried to address your comments as best I could. The main one
> > that
> > > I think stayed open was about the loop that finds the last timeline on a
> > > segment. If you think that's better done by directly scanning the List*
> > of
> > > timeline history entries I'm happy to prep a follow-up.
> >
> > Have to look again.
> >
> > +        * We start reading xlog from the restart lsn, even though in
> > +        * CreateDecodingContext we set the snapshot builder up using the
> > +        * slot's confirmed_flush. This means we might read xlog we don't
> > +        * actually decode rows from, but the snapshot builder might need
> > it
> > +        * to get to a consistent point. The point we start returning data
> > to
> > +        * *users* at is the confirmed_flush lsn set up in the decoding
> > +        * context.
> > +        */
> > still seems pretty misleading - and pretty much unrelated to the
> > callsite.
> >
> 
> 
> I think it's pretty confusing that the function uses's the slot's
> restart_lsn and never refers to confirmed_flush which is where it actually
> starts decoding from as far as the user's concerned. It took me a while to
> figure out what was going on there.

That's a fundamental misunderstanding on your part (perhaps created by
imprecise docs). The startpoint isn't about where to start
decoding. It's about skip calling the output plugin of a transaction if
it *commits* before the startpoint. We almost always will *decode* rows
from before the startpoint.  And that's pretty much unavoidable: The
consumer of a decoding slot only really can do anything with commit LSNs
wrt replay progress: But for a remembered progress LSN there can be a
lot of in-progress xacts (about which said client doesn't know
anything); and they start *earlier* than the commit LSN of the just
replayed xact. So we have to be able to re-process their state and send
it out.


> I think the comment would be less necessary, and could be moved up to the
> CreateDecodingContext call, if we passed the slot's confirmed_flush
> explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
> and having the CreateDecodingContext call infer the start point. That way
> what's going on would be a lot clearer when reading the code.

How would that solve anything? We still need to process the old records,
hence the xlogreader needs to instructed to read them. Hence
logicalfuncs.c needs to know about it.


> You're thinking from the perspective of someone who knows this code
> intimately.

Maybe. But that's not addressed by adding half-true comments to places
they only belong to peripherally. And not to all the relevant places,
since you've not added the same comment to StartLogicalReplication().


> Speaking of which, did you see the proposed README I sent for
> src/backend/replication/logical ?

I skimmed it. But given we have a CF full of patches, some submitted
over a year ago, it seems unfair to spend time on a patch submitted a
few days ago.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Kouhei Kaigai
Date:
Subject: EPQ recheck across HashJoin, what does it actuall check?
Next
From: Andres Freund
Date:
Subject: Re: Timeline following for logical slots