Re: Timeline following for logical slots - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Timeline following for logical slots |
Date | |
Msg-id | 20160401064601.GA9074@awork2.anarazel.de Whole thread Raw |
In response to | Re: Timeline following for logical slots (Craig Ringer <craig@2ndquadrant.com>) |
Responses |
Re: Timeline following for logical slots
Re: Timeline following for logical slots Re: Timeline following for logical slots |
List | pgsql-hackers |
Hi, On 2016-04-01 13:16:18 +0800, Craig Ringer wrote: > I think it's pretty unsafe from SQL, to be sure. > > Unless failover slots get in to 9.6 we'll need to do exactly that from > internal C stuff in pglogical to support following physical failover, I know. And this makes me scared shitless. I'll refuse to debug anything related to decoding, when this "feature" has been used. And I think there'll be plenty; if you notice the issues that is. The more I think about it, the more I think we should rip this out again, until we have the actual feature is in. With proper review. > However: It's safe for the slot state on the replica to be somewhat behind > the local replay from the master, so the slot state on the replica is older > than what it would've been at an equivalent time on the master... so long > as it's not so far behind that the replica has replayed vacuum activity > that removes rows still required by the slot's declared catalog_xmin. Even > then it should actually be OK in practice because the failing-over client > will always have replayed past that point on the master (otherwise > catalog_xmin couldn't have advanced on the master), so it'll always ask to > start replay at a point at or after the lsn where the catalog_xmin was > advanced to its most recent value on the old master before failover. It's > only safe for walsender based decoding though, since there's no way to > specify the startpoint for sql-level decoding. This whole logic fails entirely flats on its face by the fact that even if you specify a startpoint, we read older WAL and process the records. The startpoint filters every transaction that commits earlier than the "startpoint". But with a long-running transaction you still will have old changes being processed, which need the corresponding catalog state. > My only real worry there is whether invalidations are a concern - if > internal replay starts at the restart_lsn and replays over part of history > where the catalog entries have been vacuumed before it starts collecting > rows for return to the decoding client at the requested decoding > startpoint, can that cause problems with invalidation processing? Yes. > It's also OK for the slot state to be *ahead* of the replica's replay > position, i.e. from the future. restart_lsn and catalog_xmin will be higher > than they were on the master at the same point in time, but that doesn't > matter at all, since the client will always be requesting to start from > that point or later, having replayed up to there on the old master before > failover. I don't think that's true. See my point above about startpoint. > > Andres, I tried to address your comments as best I could. The main one > > that > > > I think stayed open was about the loop that finds the last timeline on a > > > segment. If you think that's better done by directly scanning the List* > > of > > > timeline history entries I'm happy to prep a follow-up. > > > > Have to look again. > > > > + * We start reading xlog from the restart lsn, even though in > > + * CreateDecodingContext we set the snapshot builder up using the > > + * slot's confirmed_flush. This means we might read xlog we don't > > + * actually decode rows from, but the snapshot builder might need > > it > > + * to get to a consistent point. The point we start returning data > > to > > + * *users* at is the confirmed_flush lsn set up in the decoding > > + * context. > > + */ > > still seems pretty misleading - and pretty much unrelated to the > > callsite. > > > > > I think it's pretty confusing that the function uses's the slot's > restart_lsn and never refers to confirmed_flush which is where it actually > starts decoding from as far as the user's concerned. It took me a while to > figure out what was going on there. That's a fundamental misunderstanding on your part (perhaps created by imprecise docs). The startpoint isn't about where to start decoding. It's about skip calling the output plugin of a transaction if it *commits* before the startpoint. We almost always will *decode* rows from before the startpoint. And that's pretty much unavoidable: The consumer of a decoding slot only really can do anything with commit LSNs wrt replay progress: But for a remembered progress LSN there can be a lot of in-progress xacts (about which said client doesn't know anything); and they start *earlier* than the commit LSN of the just replayed xact. So we have to be able to re-process their state and send it out. > I think the comment would be less necessary, and could be moved up to the > CreateDecodingContext call, if we passed the slot's confirmed_flush > explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr > and having the CreateDecodingContext call infer the start point. That way > what's going on would be a lot clearer when reading the code. How would that solve anything? We still need to process the old records, hence the xlogreader needs to instructed to read them. Hence logicalfuncs.c needs to know about it. > You're thinking from the perspective of someone who knows this code > intimately. Maybe. But that's not addressed by adding half-true comments to places they only belong to peripherally. And not to all the relevant places, since you've not added the same comment to StartLogicalReplication(). > Speaking of which, did you see the proposed README I sent for > src/backend/replication/logical ? I skimmed it. But given we have a CF full of patches, some submitted over a year ago, it seems unfair to spend time on a patch submitted a few days ago. Greetings, Andres Freund
pgsql-hackers by date: