Thread: Timeline following for logical slots
Hi all
--
Per discussion on the failover slots thread (https://commitfest.postgresql.org/9/488/) I'm splitting timeline following for logical slots into its own separate patch.
The attached patch fixes an issue I found while testing the prior revision: it would read WAL from WAL segments on the old timeline up until the timeline switch boundary, but this doesn't work if the last WAL segment on the timeline has been renamed to append the .partial suffix.
Instead it's necessary to eagerly switch to reading the WAL segment from the newest timeline on that segment. We'll still be reading WAL records from the correct timeline since the partial WAL segment from the old timeline gets copied to a new name on promotion, but we're reading it from the newest copy of that segment, which is either complete and archived or is still being written to by the current timeline.
For example, if the old master was on timeline 1 and writing to 000000010000000000000003 when it dies and we promote a streaming replica, the replica will copy 000000010000000000000003 to 000000020000000000000003 and append its recovery checkpoint to the copy. It renames 000000010000000000000003 to 000000010000000000000003.partial, which means the xlogreader won't find it. If we're reading the record at 0/3000000 then even though 0/3000000 is on timeline 1, we have to read it from the segment on timeline 2.
Fun, eh?
(I'm going to write a README.timelines to document some of this stuff soon, since it has some pretty hairy corners and some of the code paths are a bit special.)
I've written some initial TAP tests for timeline following that exploit the fact that replication slots are preserved on a replica if the replica is created with a filesystem level copy that includes pg_replslot, rather than using pg_basebackup. They are not included here because they rely on TAP support improvements (filesystem backup support, psql enhancements, etc) that I'll submit separately, but they're how I found the .partial issue.
A subsequent patch can add testing of slot creation and advance on replicas using a C test extension to prove that this approach can be used to achieve practical logical failover for extensions.
I think this is ready to go as-is.
I don't want to hold it up waiting for test framework enhancements unless those can be committed fairly easily because I think we need this in 9.6 and the tests demonstrate that it works when run separately.
See for a git tree containing the timeline following patch, TAP enhancements and the tests for timeline following.
Attachment
Here are the tests.
They depend on https://commitfest.postgresql.org/9/569/#
I've also rebased the git tree for failover slots on top of the tree for TAP test improvements.
Attachment
On 1 March 2016 at 21:00, Craig Ringer <craig@2ndquadrant.com> wrote:
Hi allPer discussion on the failover slots thread (https://commitfest.postgresql.org/9/488/) I'm splitting timeline following for logical slots into its own separate patch.
I've updated the logical decoding timeline following patch to fix a bug found as a result of test development related to how Pg renames the last WAL seg on the old timeline to suffix it with .partial on promotion. The xlogreader must switch to reading from the newest-timeline version of a given segment eagerly, for the first page of the segment, since that's the only one guaranteed to actually exist.
I'd really appreciate some review of the logic there by people who know timelines well and preferably know the xlogreader. It's really just one function and 2/3 comments; the code is simple but the reasoning leading to it is not.
I've also attached an updated version of the tests posted a few days ago. The tests depend on the remaining patches from the TAP enhancements tree so it's easiest to just get the whole tree from https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following (subject to regular rebases and force pushes, do not use as a base).
The tests now include a test module that exposes some slots guts to SQL to allow the client to sync slot state from master to replica(s) without needing failover slots and the use of extra WAL as transport. It's very much for-testing-only.
The new test module is used by a second round of tests to demonstrate the practicality of failover of a logical replication client to a physical replica using a base backup taken by pg_basebackup and without the presence of failover slots. I won't pretend it's pretty.
This proves that the approach works barring unforseen showstoppers. It also proves it's pretty ugly - failover slots provide a much, MUCH simpler and safer way for clients to achieve this with way less custom code needed by each client to sync slot state.
I've got a bit of cleanup to do in the test suite and a few more tests to write for cases where the slot on the replica is allowed to fall behind the slot on the master but this is mostly waiting on the remaining two TAP test patches before it can be evaluated for possible push.
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 04/03/16 17:08, Craig Ringer wrote: > I'd really appreciate some review of the logic there by people who know > timelines well and preferably know the xlogreader. It's really just one > function and 2/3 comments; the code is simple but the reasoning leading > to it is not. > I think this will have to be work for committer at this point. I can't find any flaws in the logic myself so I unless somebody protests I am going to mark this as ready for committer. > I've also attached an updated version of the tests posted a few days > ago. The tests depend on the remaining patches from the TAP > enhancements tree so it's easiest to just get the whole tree from > https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following > (subject to regular rebases and force pushes, do not use as a base). > > The tests now include a test module that exposes some slots guts to SQL > to allow the client to sync slot state from master to replica(s) without > needing failover slots and the use of extra WAL as transport. It's very > much for-testing-only. > > The new test module is used by a second round of tests to demonstrate > the practicality of failover of a logical replication client to a > physical replica using a base backup taken by pg_basebackup and without > the presence of failover slots. I won't pretend it's pretty. > Well for testing purposes it's quite fine I think. The TAP framework enhancements needed for this are now in and it works correctly against current master. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek wrote: > On 04/03/16 17:08, Craig Ringer wrote: > >I'd really appreciate some review of the logic there by people who know > >timelines well and preferably know the xlogreader. It's really just one > >function and 2/3 comments; the code is simple but the reasoning leading > >to it is not. > > I think this will have to be work for committer at this point. I can't find > any flaws in the logic myself so I unless somebody protests I am going to > mark this as ready for committer. Great, thanks. I've studied this to the point where I'm confident that it makes sense, so I'm about to push it. I didn't change any logic, only updated comments here and there, both in the patch and in some preexisting code. I also changed the "List *timelineHistory" to be #ifndef FRONTEND, which seems cleaner than having it and insist that it couldn't be used. Also, hopefully you don't have any future callers for the functions I marked static (I checked the failover slots series and couldn't see anything). I will push this early tomorrow. One thing I'm not quite sure about is why this is said to apply to "logical slots" and not just to replication slots in general. In what sense does it not apply to physical replication slots? (I noticed that we have this new line, - randAccess = true; + randAccess = true; /* allow readPageTLI to go backwards */ in which now the comment is an identical copy of an identical line elsewhere; however, randAccess doesn't actually affect readPageTLI in any way, so AFAICS both comments are now wrong.) > Well for testing purposes it's quite fine I think. The TAP framework > enhancements needed for this are now in and it works correctly against > current master. Certainly the new src/test/recovery/t/006whatever.pl file is going to be part of the commit. What I'm not so sure about is src/test/modules/decoding_failover: is there any reason we don't just put the new functions in contrib/test_decoding? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote: > Great, thanks. I've studied this to the point where I'm confident that > it makes sense, so I'm about to push it. I didn't change any logic, > only updated comments here and there, both in the patch and in some > preexisting code. I also changed the "List *timelineHistory" to be > #ifndef FRONTEND, which seems cleaner than having it and insist that it > couldn't be used. Could you perhaps wait till tomorrow? I'd like to do a pass over it. Thanks Andres
On 15 March 2016 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Petr Jelinek wrote:
> On 04/03/16 17:08, Craig Ringer wrote:
> >I'd really appreciate some review of the logic there by people who know
> >timelines well and preferably know the xlogreader. It's really just one
> >function and 2/3 comments; the code is simple but the reasoning leading
> >to it is not.
>
> I think this will have to be work for committer at this point. I can't find
> any flaws in the logic myself so I unless somebody protests I am going to
> mark this as ready for committer.
Great, thanks. I've studied this to the point where I'm confident that
it makes sense, so I'm about to push it.
Thanks, though I'm happy enough to wait for Andres's input as he requested.
Also, hopefully you don't have any future callers for the functions I
marked static (I checked the failover slots series and couldn't see
anything). I will push this early tomorrow.
I don't see it being overly useful for an extension, so that sounds fine.
One thing I'm not quite sure about is why this is said to apply to
"logical slots" and not just to replication slots in general. In what
sense does it not apply to physical replication slots?
It's only useful for logical slots currently because physical slot timeline following is done client side by the walreceiver. When the walsender hits the end of the timeline it sends CopyDone and returns to command mode. The client is expected to request the timeline history file, parse it, work out which timeline to request next and specify that in its next START_REPLICATION call.
None of that happens for logical slots. Andres was quite deliberate about keeping timelines out of the interface for logical slots and logical replication. I tried to retain that when adding the ability to follow physical failover, keeping things simple for the client. Given that logical replication slots are intended to be consumed by more than just a postgres downstream it IMO makes sense to keep as much internal complexity hidden, especially when dealing with something as surprisingly convoluted as timelines.
This does mean that we can't tell if we replayed past the timeline switch point on the old master before we switched to the new master, but I don't think sending a timeline history file is the solution there. I'd rather expose a walsender command and SQL function to let the client say, after connecting, "I last replayed from timeline <x> up to point <y>, if I start replaying from you will I get a consistent stream?". That can be done separately to this patch and IMO isn't crucial since clients can currently work it out, just with more difficulty.
Maybe physical rep can use the same logic, but I'm really not convinced. It already knows how to follow timelines client-side and handles that as part of walreceiver and archive recovery. Because recovery handles following the timeline switches and walreceiver just fetches the WAL as an alternative to archive fetches I'm not sure it makes sense; we still have to have all that logic for archive recovery from restore_command, so doing it differently for streaming just makes things more complicated for no clear benefit.
I certainly didn't want to address it in the same patch, much like I intentionally avoided trying to teach pg_xlogdump to be able to follow timelines. Partly to keep it simpler and focused, partly because it'd require timeline.c to be made frontend-clean which means no List, no elog, etc...
(I noticed that we have this new line,
- randAccess = true;
+ randAccess = true; /* allow readPageTLI to go backwards */
in which now the comment is an identical copy of an identical line
elsewhere; however, randAccess doesn't actually affect readPageTLI in
any way, so AFAICS both comments are now wrong.)
Yeah, that's completely bogus. I have a clear image in my head of randAccess being tested by ValidXLogPageHeader where it sanity checks state->latestPageTLI, the TLI of the *prior* page, against the TLI of the most recent page read, to make sure the TLI advances. But nope, I'm apparently imagining things 'cos it just isn't there, it just tests to see if we read a LSN later than the prior page instead.
> Well for testing purposes it's quite fine I think. The TAP framework
> enhancements needed for this are now in and it works correctly against
> current master.
Certainly the new src/test/recovery/t/006whatever.pl file is going to be
part of the commit. What I'm not so sure about is
src/test/modules/decoding_failover: is there any reason we don't just
put the new functions in contrib/test_decoding?
Because contrib/test_decoding isn't an extension. It is a sample logical decoding output plugin. I didn't want to mess it up by adding an extension control file, extension script and some extension functions. Sure, you *can* use the same shared library as an extension and as a logical decoding output plugin but I think that makes it rather more confusing as an example.
Also, because I don't want people reading test_decoding to find those functions and think it's a good idea to actually use them. It's hidden away in src/test/modules for a reason ;)
I have no particular objection to stripping the test module and the parts of the t/ script associated with it if you don't think it's worthwhile. I'd be happy to have it as an ongoing test showing that "hot" timeline following on a logical slot does work, but the tests that use only base backups are probably sufficient to detect obvious breakage/regressions.
If Michael Paquier's decoder_raw and receiver_raw were in contrib/ I'd add support for doing it into reciever_raw without exposing SQL functions. They aren't in core and can't form part of the core test suite, so I needed something minimalist to prove the concept, and figured it might as well serve as a regression test as well.
Hi, On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote: > diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c > index fcb0872..7b60b8f 100644 > --- a/src/backend/access/transam/xlogreader.c > +++ b/src/backend/access/transam/xlogreader.c > @@ -10,9 +10,11 @@ > * > * NOTES > * See xlogreader.h for more notes on this facility. > + * > + * This file is compiled as both front-end and backend code, so it > + * may not use ereport, server-defined static variables, etc. > *------------------------------------------------------------------------- > */ > - Huh? > #include "postgres.h" > > #include "access/transam.h" > @@ -116,6 +118,11 @@ XLogReaderAllocate(XLogPageReadCB pagereadfunc, void *private_data) > return NULL; > } > > +#ifndef FRONTEND > + /* Will be loaded on first read */ > + state->timelineHistory = NIL; > +#endif > + > return state; > } > > @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state) > pfree(state->errormsg_buf); > if (state->readRecordBuf) > pfree(state->readRecordBuf); > +#ifndef FRONTEND > + if (state->timelineHistory) > + list_free_deep(state->timelineHistory); > +#endif Hm. So we don't support timelines following for frontend code, although it'd be rather helpful for pg_xlogdump. And possibly pg_rewind. > pfree(state->readBuf); > pfree(state); > } > @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) > > if (RecPtr == InvalidXLogRecPtr) > { > + /* No explicit start point; read the record after the one we just read */ > RecPtr = state->EndRecPtr; > > if (state->ReadRecPtr == InvalidXLogRecPtr) > - randAccess = true; > + randAccess = true; /* allow readPageTLI to go backwards */ randAccess is doing more than that, so I'm doubtful that comment is an improvment. > /* > * RecPtr is pointing to end+1 of the previous WAL record. If we're > @@ -223,6 +235,8 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) > else > { > /* > + * Caller supplied a position to start at. > + * > * In this case, the passed-in record pointer should already be > * pointing to a valid record starting position. > */ > @@ -309,8 +323,10 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) > /* XXX: more validation should be done here */ > if (total_len < SizeOfXLogRecord) > { > - report_invalid_record(state, "invalid record length at %X/%X", > - (uint32) (RecPtr >> 32), (uint32) RecPtr); > + report_invalid_record(state, > + "invalid record length at %X/%X: wanted %lu, got %u", > + (uint32) (RecPtr >> 32), (uint32) RecPtr, > + SizeOfXLogRecord, total_len); > goto err; > } > gotheader = false; > @@ -466,9 +482,7 @@ err: > * Invalidate the xlog page we've cached. We might read from a different > * source after failure. > */ > - state->readSegNo = 0; > - state->readOff = 0; > - state->readLen = 0; > + XLogReaderInvalCache(state); I don't think that "cache" is the right way to describe this. > #include <unistd.h> > > -#include "miscadmin.h" > - spurious change imo. > /* > - * TODO: This is duplicate code with pg_xlogdump, similar to walsender.c, but > - * we currently don't have the infrastructure (elog!) to share it. > + * Read 'count' bytes from WAL into 'buf', starting at location 'startptr' > + * in timeline 'tli'. > + * > + * Will open, and keep open, one WAL segment stored in the static file > + * descriptor 'sendFile'. This means if XLogRead is used once, there will > + * always be one descriptor left open until the process ends, but never > + * more than one. > + * > + * XXX This is very similar to pg_xlogdump's XLogDumpXLogRead and to XLogRead > + * in walsender.c but for small differences (such as lack of elog() in > + * frontend). Probably these should be merged at some point. > */ > static void > XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count) > @@ -648,8 +657,12 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count) > XLogRecPtr recptr; > Size nbytes; > > + /* > + * Cached state across calls. > + */ One line? > static int sendFile = -1; > static XLogSegNo sendSegNo = 0; > + static TimeLineID sendTLI = 0; > static uint32 sendOff = 0; > > p = buf; > @@ -664,11 +677,12 @@ XLogRead(char *buf, TimeLineID tli, XLogRecPtr startptr, Size count) > > startoff = recptr % XLogSegSize; > > - if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo)) > + /* Do we need to open a new xlog segment? */ > + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) || > + sendTLI != tli) > { s/open a new/open a different/? New imo has connotations that we don't really want here. > char path[MAXPGPATH]; > > - /* Switch to another logfile segment */ > if (sendFile >= 0) > close(sendFile); E.g. you could just have moved the above comment. > /* Need to seek in the file? */ > if (sendOff != startoff) > { > if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0) > - { > - char path[MAXPGPATH]; > - > - XLogFilePath(path, tli, sendSegNo); > - > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not seek in log segment %s to offset %u: %m", > - path, startoff))); > - } > + XLogFileNameP(tli, sendSegNo), startoff))); > sendOff = startoff; > } Not a serious issue, more a general remark: I'm doubtful that going for palloc in error situations is good practice. This will be allocated in the current memory context; without access to the emergency error reserves. I'm also getting the feeling that the patch is bordering on doing some relatively random cleanups mixed in with architectural changes. Makes things a bit harder to review. > +static void > +XLogReadDetermineTimeline(XLogReaderState *state) > +{ > + /* Read the history on first time through */ > + if (state->timelineHistory == NIL) > + state->timelineHistory = readTimeLineHistory(ThisTimeLineID); > + > + /* > + * Are we reading the record immediately following the one we read last > + * time? If not, then don't use the cached timeline info. > + */ > + if (state->currRecPtr != state->EndRecPtr) > + { > + state->currTLI = 0; > + state->currTLIValidUntil = InvalidXLogRecPtr; > + } Hm. So we grow essentially a second version of the last end position and the randAccess stuff in XLogReadRecord(). > + if (state->currTLI == 0) > + { > + /* > + * Something changed; work out what timeline this record is on. We > + * might read it from the segment on this TLI or, if the segment is > + * also contained by newer timelines, the copy from a newer TLI. > + */ > + state->currTLI = tliOfPointInHistory(state->currRecPtr, > + state->timelineHistory); > + > + /* > + * Look for the most recent timeline that's on the same xlog segment > + * as this record, since that's the only one we can assume is still > + * readable. > + */ > + while (state->currTLI != ThisTimeLineID && > + state->currTLIValidUntil == InvalidXLogRecPtr) > + { > + XLogRecPtr tliSwitch; > + TimeLineID nextTLI; > + > + tliSwitch = tliSwitchPoint(state->currTLI, state->timelineHistory, > + &nextTLI); > + > + /* round ValidUntil down to start of seg containing the switch */ > + state->currTLIValidUntil = > + ((tliSwitch / XLogSegSize) * XLogSegSize); > + > + if (state->currRecPtr >= state->currTLIValidUntil) > + { > + /* > + * The new currTLI ends on this WAL segment so check the next > + * TLI to see if it's the last one on the segment. > + * > + * If that's the current TLI we'll stop searching. I don't really understand how we're stopping searching here? > + */ > + state->currTLI = nextTLI; > + state->currTLIValidUntil = InvalidXLogRecPtr; > + } > + } > +} XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder whether there's not a simpler way to write this. > +/* > + * XLogPageReadCB callback for reading local xlog files > * > * Public because it would likely be very helpful for someone writing another > * output method outside walsender, e.g. in a bgworker. > * > - * TODO: The walsender has it's own version of this, but it relies on the > + * TODO: The walsender has its own version of this, but it relies on the > * walsender's latch being set whenever WAL is flushed. No such infrastructure > * exists for normal backends, so we have to do a check/sleep/repeat style of > * loop for now. > @@ -754,46 +897,88 @@ int > read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, > int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI) > { > - XLogRecPtr flushptr, > + XLogRecPtr read_upto, > loc; > int count; > > loc = targetPagePtr + reqLen; > + > + /* Make sure enough xlog is available... */ > while (1) > { > /* > - * TODO: we're going to have to do something more intelligent about > - * timelines on standbys. Use readTimeLineHistory() and > - * tliOfPointInHistory() to get the proper LSN? For now we'll catch > - * that case earlier, but the code and TODO is left in here for when > - * that changes. > + * Check which timeline to get the record from. > + * > + * We have to do it each time through the loop because if we're in > + * recovery as a cascading standby, the current timeline might've > + * become historical. > */ > - if (!RecoveryInProgress()) > + XLogReadDetermineTimeline(state); > + > + if (state->currTLI == ThisTimeLineID) > { > - *pageTLI = ThisTimeLineID; > - flushptr = GetFlushRecPtr(); > + /* > + * We're reading from the current timeline so we might have to > + * wait for the desired record to be generated (or, for a standby, > + * received & replayed) > + */ > + if (!RecoveryInProgress()) > + { > + *pageTLI = ThisTimeLineID; > + read_upto = GetFlushRecPtr(); > + } > + else > + read_upto = GetXLogReplayRecPtr(pageTLI); > + > + if (loc <= read_upto) > + break; > + > + CHECK_FOR_INTERRUPTS(); > + pg_usleep(1000L); > } > else > - flushptr = GetXLogReplayRecPtr(pageTLI); > + { > + /* > + * We're on a historical timeline, so limit reading to the switch > + * point where we moved to the next timeline. > + */ > + read_upto = state->currTLIValidUntil; Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If so, how come? > - if (loc <= flushptr) > + /* > + * Setting pageTLI to our wanted record's TLI is slightly wrong; > + * the page might begin on an older timeline if it contains a > + * timeline switch, since its xlog segment will have been copied > + * from the prior timeline. This is pretty harmless though, as > + * nothing cares so long as the timeline doesn't go backwards. We > + * should read the page header instead; FIXME someday. > + */ > + *pageTLI = state->currTLI; > + > + /* No need to wait on a historical timeline */ > break; > - > - CHECK_FOR_INTERRUPTS(); > - pg_usleep(1000L); > + } > } > > - /* more than one block available */ > - if (targetPagePtr + XLOG_BLCKSZ <= flushptr) > + if (targetPagePtr + XLOG_BLCKSZ <= read_upto) > + { > + /* > + * more than one block available; read only that block, have caller > + * come back if they need more. > + */ > count = XLOG_BLCKSZ; > - /* not enough data there */ > - else if (targetPagePtr + reqLen > flushptr) > + } > + else if (targetPagePtr + reqLen > read_upto) > + { > + /* not enough data there */ > return -1; > - /* part of the page available */ > + } > else > - count = flushptr - targetPagePtr; > + { > + /* enough bytes available to satisfy the request */ > + count = read_upto - targetPagePtr; > + } > > - XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ); > + XLogRead(cur_page, *pageTLI, targetPagePtr, count); When are we reading less than a page? That should afaik never be required. > + /* > + * We start reading xlog from the restart lsn, even though in > + * CreateDecodingContext we set the snapshot builder up using the > + * slot's candidate_restart_lsn. This means we might read xlog we > + * don't actually decode rows from, but the snapshot builder might > + * need it to get to a consistent point. The point we start returning > + * data to *users* at is the candidate restart lsn from the decoding > + * context. > + */ Uh? Where are we using candidate_restart_lsn that way? I seriously doubt it is - candidate_restart_lsn is about a potential future restart_lsn, which we can set once we get reception confirmation from the client. > @@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > CHECK_FOR_INTERRUPTS(); > } > > + /* Make sure timeline lookups use the start of the next record */ > + startptr = ctx->reader->EndRecPtr; Huh? startptr isn't used after this, so I'm not sure what this even means? > + /* > + * The XLogReader will read a page past the valid end of WAL because > + * it doesn't know about timelines. When we switch timelines and ask > + * it for the first page on the new timeline it will think it has it > + * cached, but it'll have the old partial page and say it can't find > + * the next record. So flush the cache. > + */ > + XLogReaderInvalCache(ctx->reader); > + dito. > diff --git a/src/test/modules/decoding_failover/decoding_failover.c b/src/test/modules/decoding_failover/decoding_failover.c > new file mode 100644 > index 0000000..669e6c4 > --- /dev/null > +++ b/src/test/modules/decoding_failover/decoding_failover.c > + > +/* > + * Create a new logical slot, with invalid LSN and xid, directly. This does not > + * use the snapshot builder or logical decoding machinery. It's only intended > + * for creating a slot on a replica that mirrors the state of a slot on an > + * upstream master. > + * > + * You should immediately decoding_failover_advance_logical_slot(...) it > + * after creation. > + */ Uh. I doubt we want this, even if it's formally located in src/test/modules. These comments make it appear not to be only intended for that, and I have serious doubts about the validity of the concept as is. This seems to need some more polishing. Greetings, Andres Freund
Andres Freund wrote: > Could you perhaps wait till tomorrow? I'd like to do a pass over it. Sure thing. I'll wait for your review. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15 March 2016 at 17:12, Andres Freund <andres@anarazel.de> wrote:
Hi
Thanks very much for the review.
This patch was split out from failover slots, which its self underwent quite a few revisions, so I'm really happy to have fresh eyes on it. Especially more experienced ones.
On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote:
> diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
> index fcb0872..7b60b8f 100644
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -10,9 +10,11 @@
> *
> * NOTES
> * See xlogreader.h for more notes on this facility.
> + *
> + * This file is compiled as both front-end and backend code, so it
> + * may not use ereport, server-defined static variables, etc.
> *-------------------------------------------------------------------------
> */
> -
Huh?
I'm not sure what the concern here is. xlogreader *is* compiled as frontend code - the file gets linked into the tree for pg_xlogdump and pg_rewind, at least.
I found that really confusing when working on it and thought it merited a comment.
> @@ -135,6 +142,10 @@ XLogReaderFree(XLogReaderState *state)
> pfree(state->errormsg_buf);
> if (state->readRecordBuf)
> pfree(state->readRecordBuf);
> +#ifndef FRONTEND
> + if (state->timelineHistory)
> + list_free_deep(state->timelineHistory);
> +#endif
Hm. So we don't support timelines following for frontend code, although
it'd be rather helpful for pg_xlogdump. And possibly pg_rewind.
Yes, it would. I don't want to address that in the same patch though. It'd require making timeline.c frontend-clean, dealing with the absence of List on the frontend, etc, and I don't want to complicate this patch with that.
I've intentionally written the timeline logic so it can pretty easily be moved into xlogreader.c as a self-contained unit and used for those utilities once timeline.c can be compiled for frontend too.
> pfree(state->readBuf);
> pfree(state);
> }
> @@ -208,10 +219,11 @@ XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
>
> if (RecPtr == InvalidXLogRecPtr)
> {
> + /* No explicit start point; read the record after the one we just read */
> RecPtr = state->EndRecPtr;
>
> if (state->ReadRecPtr == InvalidXLogRecPtr)
> - randAccess = true;
> + randAccess = true; /* allow readPageTLI to go backwards */
randAccess is doing more than that, so I'm doubtful that comment is an
improvment.
Yeah, I have no idea what I was on about there, per response to Álvaro's post.
> @@ -466,9 +482,7 @@ err:
> * Invalidate the xlog page we've cached. We might read from a different
> * source after failure.
> */
> - state->readSegNo = 0;
> - state->readOff = 0;
> - state->readLen = 0;
> + XLogReaderInvalCache(state);
I don't think that "cache" is the right way to describe this.
Isn't that what it is? It reads a page, caches it, and reuses it for subsequent requests on the same page. The pre-existing comment even calls it a cache above.
I don't mind changing it, but don't have any better ideas.
> #include <unistd.h>
>
> -#include "miscadmin.h"
> -
spurious change imo.
Added in Álvaro's rev; it puts the header in the correct sort order, but I'm not sure it should be bundled with this patch.
> - if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo))
> + /* Do we need to open a new xlog segment? */
> + if (sendFile < 0 || !XLByteInSeg(recptr, sendSegNo) ||
> + sendTLI != tli)
> {
s/open a new/open a different/? New imo has connotations that we don't
really want here.
In my original patch this was:
/* Do we need to switch to a new xlog segment? */
but yeah, "open a different" is better than either.
> /* Need to seek in the file? */
> if (sendOff != startoff)
> {
> if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0)
> - {
> - char path[MAXPGPATH];
> -
> - XLogFilePath(path, tli, sendSegNo);
> -
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not seek in log segment %s to offset %u: %m",
> - path, startoff)));
> - }
> + XLogFileNameP(tli, sendSegNo), startoff)));
> sendOff = startoff;
> }
Not a serious issue, more a general remark: I'm doubtful that going for
palloc in error situations is good practice. This will be allocated in
the current memory context; without access to the emergency error
reserves.
I was getting pretty confused, since I was sure I didn't write that. My memory's bad enough that sometimes I go "huh, ok, guess I did"... but in this case it wasn't in my patch so I think Álvaro's added it.
Agree that it's unrelated and probably better how it was.
> +static void
> +XLogReadDetermineTimeline(XLogReaderState *state)
> +{
> + /* Read the history on first time through */
> + if (state->timelineHistory == NIL)
> + state->timelineHistory = readTimeLineHistory(ThisTimeLineID);
> +
> + /*
> + * Are we reading the record immediately following the one we read last
> + * time? If not, then don't use the cached timeline info.
> + */
> + if (state->currRecPtr != state->EndRecPtr)
> + {
> + state->currTLI = 0;
> + state->currTLIValidUntil = InvalidXLogRecPtr;
> + }
Hm. So we grow essentially a second version of the last end position and
the randAccess stuff in XLogReadRecord().
Yeah, and in an earlier version of this patch that's where it lived.
I landed up moving it into its own self-contained function mainly because the xlog read callback needs to be able to re-read the timeline info and re-check the timeline end if the current timeline becomes historical while it's waiting for new WAL, which could happen if it's a cascading standby and its parent got promoted. Hence the call to XLogReadDetermineTimeline from within read_local_xlog_page(...). That can't actually happen right now since logical decoding can't be done on a standby yet, but I didn't want to introduce new problems to fix for that when adding timeline following support.
XLogReadRecord(...) could clear this info instead of doing it in XLogReadDetermineTimeline(...), but I thought it made more sense to keep use of that state local to XLogReadDetermineTimeline(...) rather than scatter it.
> + if (state->currTLI == 0)
> + {
> + /*
> + * Something changed; work out what timeline this record is on. We
> + * might read it from the segment on this TLI or, if the segment is
> + * also contained by newer timelines, the copy from a newer TLI.
> + */
> + state->currTLI = tliOfPointInHistory(state->currRecPtr,
> + state->timelineHistory);
> +
> + /*
> + * Look for the most recent timeline that's on the same xlog segment
> + * as this record, since that's the only one we can assume is still
> + * readable.
> + */
> + while (state->currTLI != ThisTimeLineID &&
> + state->currTLIValidUntil == InvalidXLogRecPtr)
> + {
> + XLogRecPtr tliSwitch;
> + TimeLineID nextTLI;
> +
> + tliSwitch = tliSwitchPoint(state->currTLI, state->timelineHistory,
> + &nextTLI);
> +
> + /* round ValidUntil down to start of seg containing the switch */
> + state->currTLIValidUntil =
> + ((tliSwitch / XLogSegSize) * XLogSegSize);
> +
> + if (state->currRecPtr >= state->currTLIValidUntil)
> + {
> + /*
> + * The new currTLI ends on this WAL segment so check the next
> + * TLI to see if it's the last one on the segment.
> + *
> + * If that's the current TLI we'll stop searching.
I don't really understand how we're stopping searching here?
What I'm doing here is looking for the newest timeline that exists in this WAL segment. Each time through we advance currTLI if we find that there's a newer timeline on this segment then look again. The loop ends if we find that the newest timeline on the segment is the current timeline being written/replayed by the server (in which case we know that segment must still exist and not have been renamed to .partial) or until we find that currTLI's validity extends past this segment.
There is some explanation for this in the comments at the start of the function since it affects the overall logic.
On reading it again I think that testing against state->currRecPtr is confusing here. It relies on the fact that currTLIValidUntil has been rounded down to the LSN of the start of the segment, so if the current record pointer is greater than it we know the timeline ends somewhere on this segment.
I guess it could be clearer (but verbose) to define an XLogSegNoFromLSN macro then
if ( XLogSegNoFromLSN(state->currRecPtr) >= XLogSegNoFromLSN(state->currTLIValidUntil))
but ... eh.
The alternative seems to be to search the timeline history info directly to find the most recent timeline in the segment, starting from the current timeline being read.
> + */
> + state->currTLI = nextTLI;
> + state->currTLIValidUntil = InvalidXLogRecPtr;
> + }
> + }
> +}
XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder
whether there's not a simpler way to write this.
If there is I'd be quite happy. It took me some time to figure out the wrinkles here.
You can't do this where you'd expect, in XLogReadPage. Partly because it gets used by frontend code too, as discussed above. Partly because the xlogreader callback is responsible for waiting for new WAL rather than XLogReadPage, and as noted above that would cause issues when a cascading standby gets promoted while we're waiting for WAL. There's no way to pass the wanted timeline to the callback anyway, but if that were the only issue I'd have just added it (and did, in earlier versions of this patch).
Additionally, xlogreader and XLogReadPage is used by the physical replication walsender which neither needs nor wants timeline following - it expects to return failure when it runs out of data on the timeline instead, so the client can manage the timeline switch after the CopyBoth data stream ends.
It's not desirable to do the timeline switch for logical decoding at a higher level, before calling into the walsender, because then it has to be done separately in the SQL interface and walsender interface for logical decoding, similarly to how the client does it for physical replication. I actually did it this way in the proof of concept version and it works fine, it's just more intrusive and ugly and duplicates logic in multiple places.
There's more to it than that but I'm tired after a long day. I'll try to write that timelines readme after I review my notes so I can explain better.
As for the actual mechanism by which XLogReadDetermineTimeline operates, the main thing I wonder is whether it can be usefully simplified by having it directly scan the loaded timeline history and determine the last timeline for a segment. I'm not convinced there's much of a way around needing the rest of the logic.
> +/*
> + * XLogPageReadCB callback for reading local xlog files
> *
> * Public because it would likely be very helpful for someone writing another
> * output method outside walsender, e.g. in a bgworker.
> *
> - * TODO: The walsender has it's own version of this, but it relies on the
> + * TODO: The walsender has its own version of this, but it relies on the
> * walsender's latch being set whenever WAL is flushed. No such infrastructure
> * exists for normal backends, so we have to do a check/sleep/repeat style of
> * loop for now.
> @@ -754,46 +897,88 @@ int
> read_local_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr,
> int reqLen, XLogRecPtr targetRecPtr, char *cur_page, TimeLineID *pageTLI)
> {
> - XLogRecPtr flushptr,
> + XLogRecPtr read_upto,
> loc;
> int count;
>
> loc = targetPagePtr + reqLen;
> +
> + /* Make sure enough xlog is available... */
> while (1)
> {
> /*
> - * TODO: we're going to have to do something more intelligent about
> - * timelines on standbys. Use readTimeLineHistory() and
> - * tliOfPointInHistory() to get the proper LSN? For now we'll catch
> - * that case earlier, but the code and TODO is left in here for when
> - * that changes.
> + * Check which timeline to get the record from.
> + *
> + * We have to do it each time through the loop because if we're in
> + * recovery as a cascading standby, the current timeline might've
> + * become historical.
> */
> - if (!RecoveryInProgress())
> + XLogReadDetermineTimeline(state);
> +
> + if (state->currTLI == ThisTimeLineID)
> {
> - *pageTLI = ThisTimeLineID;
> - flushptr = GetFlushRecPtr();
> + /*
> + * We're reading from the current timeline so we might have to
> + * wait for the desired record to be generated (or, for a standby,
> + * received & replayed)
> + */
> + if (!RecoveryInProgress())
> + {
> + *pageTLI = ThisTimeLineID;
> + read_upto = GetFlushRecPtr();
> + }
> + else
> + read_upto = GetXLogReplayRecPtr(pageTLI);
> +
> + if (loc <= read_upto)
> + break;
> +
> + CHECK_FOR_INTERRUPTS();
> + pg_usleep(1000L);
> }
> else
> - flushptr = GetXLogReplayRecPtr(pageTLI);
> + {
> + /*
> + * We're on a historical timeline, so limit reading to the switch
> + * point where we moved to the next timeline.
> + */
> + read_upto = state->currTLIValidUntil;
Hm. Is it ok to not check GetFlushRecPtr/GetXLogReplayRecPtr() here? If
so, how come?
We're reading from a segment where the newest timeline on that segment is historical, i.e. the server has since replayed WAL from a newer timeline on a more recent segment. We therefore know that there must be a complete segment and we can't ever need to wait for new WAL.
An assertion that loc <= GetFlushRecPtr() wouldn't hurt.
> - /* more than one block available */
> - if (targetPagePtr + XLOG_BLCKSZ <= flushptr)
> + if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
> + {
> + /*
> + * more than one block available; read only that block, have caller
> + * come back if they need more.
> + */
> count = XLOG_BLCKSZ;
> - /* not enough data there */
> - else if (targetPagePtr + reqLen > flushptr)
> + }
> + else if (targetPagePtr + reqLen > read_upto)
> + {
> + /* not enough data there */
> return -1;
> - /* part of the page available */
> + }
> else
> - count = flushptr - targetPagePtr;
> + {
> + /* enough bytes available to satisfy the request */
> + count = read_upto - targetPagePtr;
> + }
>
> - XLogRead(cur_page, *pageTLI, targetPagePtr, XLOG_BLCKSZ);
> + XLogRead(cur_page, *pageTLI, targetPagePtr, count);
When are we reading less than a page? That should afaik never be required.
Because pages are pre-allocated and zeroed it's safe to read from the last page past the point we've actually written WAL to. But in that case why do we bother determining 'count' in the first place, only to ignore it?
This is largely cosmetic TBH.
> + /*
> + * We start reading xlog from the restart lsn, even though in
> + * CreateDecodingContext we set the snapshot builder up using the
> + * slot's candidate_restart_lsn. This means we might read xlog we
> + * don't actually decode rows from, but the snapshot builder might
> + * need it to get to a consistent point. The point we start returning
> + * data to *users* at is the candidate restart lsn from the decoding
> + * context.
> + */
Uh? Where are we using candidate_restart_lsn that way?
That's a major brain-fart - it should've been confirmed_flush. I'm glad you caught that, since it took me some time to figure out why logical decoding was trying to read WAL I hadn't asked for, but adding an explanatory comment that's *wrong* sure doesn't help the next person.
I'm quite impressed and disturbed that I managed to write out as "candidate restart lsn" instead of "confirmed lsn" in full as well. (Now, where were my glasses, I swear I had them a minute ago... oh, they're on my face!)
The important bit is that where we start reading xlog is the restart_lsn of the slot, even though we won't return anything we decoded to the user until we reach confirmed_flush, which is looked up by CreateDecodingContext completely independently of what pg_logical_slot_get_changes_guts does, because we passed InvalidXLogRecPtr to CreateDecodingContext to tell it to automagically look up the confirmed_lsn. Maybe it'd be clearer to just pass the confirmed_lsn to CreateDecodingContext from pg_logical_slot_get_changes_guts. When I was debugging this stuff I found it pretty confusing that the LSN we started reading xlog at was different to the LSN the user wants decoded changes from, hence the comment.
> @@ -299,6 +312,18 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
> CHECK_FOR_INTERRUPTS();
> }
>
> + /* Make sure timeline lookups use the start of the next record */
> + startptr = ctx->reader->EndRecPtr;
Huh? startptr isn't used after this, so I'm not sure what this even
means?
That snuck in from a prior revision where timeline following was done in pg_logical_slot_get_changes_guts, before I moved it down into the xlogreader callback so it could be shared between the walsender and SQL interfaces for logical decoding. It's bogus.
> + /*
> + * The XLogReader will read a page past the valid end of WAL because
> + * it doesn't know about timelines. When we switch timelines and ask
> + * it for the first page on the new timeline it will think it has it
> + * cached, but it'll have the old partial page and say it can't find
> + * the next record. So flush the cache.
> + */
> + XLogReaderInvalCache(ctx->reader);
> +
dito.
Ditto ;)
I really should've caught that. The problem is I've spent way, way too long staring at this code over too many revisions as I figured out what T.F. was going on with timeline following and slots. That's exactly why I really appreciate the fresh eyes.
> + * Create a new logical slot, with invalid LSN and xid, directly. This does not
> + * use the snapshot builder or logical decoding machinery. It's only intended
> + * for creating a slot on a replica that mirrors the state of a slot on an
> + * upstream master.
> + *
> + * You should immediately decoding_failover_advance_logical_slot(...) it
> + * after creation.
> + */
Uh. I doubt we want this, even if it's formally located in
src/test/modules. These comments make it appear not to be only intended
for that, and I have serious doubts about the validity of the concept as
is.
As I expressed to Álvaro, I don't really mind if this test module and the associated round of t/ tests that rely on it get cut, so long as the first round of t/ tests that use a filesystem level copy to clone the slot are kept so there's some test coverage for timeline following in logical slots.
I wrote it as a PoC to show that it worked, and the best way to do that was to write it as a new test suite component. Since it seemed useful to test logical decoding timeline following for a slot created after a base backup is made or cloned after pg_basebackup discarded the slots, I included it.
The approach isn't ideal. It's what I plan to do for pglogical if failover slots doesn't make the cut for 9.6, though. The main problem is that because the slot updates don't come through WAL they can lag behind the catalog_xmin the master's using to make decisions about vacuuming the catalogs. If the master advances catalog_xmin on a slot then starts writing the resulting vacuum activity to WAL, and if we replay that WAL *before* we see the slot advance and sync it to the replica, the replica's slot will have an incorrect catalog_xmin that does not reflect the actual on-disk state. Not great. However, if the client is keeping track of its confirmed_lsn (as it must, for correctness) then we know it'll never ask to replay anything older than what it already sent confirmation to the old master for, before failover. Since that's how the slot got advanced and got a new catalog_xmin. That means we won't be attempting to decode anything in the range where the recorded catalog_xmin on the promoted standby after failover would be a problem. The slot will advance to a sensible position when the client specifies the new start lsn.
At least, that's my reading of things, and that's what my tests have shown so far. We do start decoding from restart_lsn, so we'll be decoding WAL in the range where catalog_xmin is lying about what's really in the heap, but I don't see anywhere where we're looking at it. It's just collecting up transaction information at that point, right?
This same issue will occur if we attempt to do failover slots v2 for 9.7 using non-WAL transport to allow decoding from a replica with failover to cascading standbys, as you mentioned wanting earlier. We'd have to have a way to make sure the slot state on the replica was updated before we replayed past the point in WAL where that slot was updated to the same state on the master.
To that end I've thought about proposing a hook to let plugins intercept slot write-out. That way they can take note of the current server LSN and slot state, then make sure they sync that over to the replica before it replays WAL past that LSN. I was really hoping to get failover slots in place so this wouldn't be necessary but it's not looking too promising, and this would provide a somewhat safer way to capture slot advances than just peeking at pg_replication_slots but without having to get the full failover slots stuff in. Having this would at least eliminate the possibility of catalog_xmin being wrong on the replica.
Craig already replied to most points. I think we can sum up as "we've done the best we can with what we have, maybe somebody can figure how to do better in the future but for now this works." Andres Freund wrote: > On 2016-03-14 20:10:58 -0300, Alvaro Herrera wrote: > > +#ifndef FRONTEND > > + if (state->timelineHistory) > > + list_free_deep(state->timelineHistory); > > +#endif > > Hm. So we don't support timelines following for frontend code, although > it'd be rather helpful for pg_xlogdump. And possibly pg_rewind. Exactly. I had the same comment, but the argument that we would need to rewrite a lot of timeline.c won me. That can wait for a future patch; we certainly don't want to be rewriting these things during the last CF. > > if (state->ReadRecPtr == InvalidXLogRecPtr) > > - randAccess = true; > > + randAccess = true; /* allow readPageTLI to go backwards */ > > randAccess is doing more than that, so I'm doubtful that comment is an > improvment. As I said, it's only a copy of what's already a few lines above. I would be happier removing both, and instead adding an explanatory comment about that variable elsewhere. > > #include <unistd.h> > > > > -#include "miscadmin.h" > > - > > spurious change imo. Yeah, in a way this is cleanup after previous patches that have messed up things somewhat. I wouldn't have posted it unless I was close to committing this ... I think it'd be better to push these cleanup changes separately. > > /* Need to seek in the file? */ > > if (sendOff != startoff) > > { > > if (lseek(sendFile, (off_t) startoff, SEEK_SET) < 0) > > - { > > - char path[MAXPGPATH]; > > - > > - XLogFilePath(path, tli, sendSegNo); > > - > > ereport(ERROR, > > (errcode_for_file_access(), > > errmsg("could not seek in log segment %s to offset %u: %m", > > - path, startoff))); > > - } > > + XLogFileNameP(tli, sendSegNo), startoff))); > > sendOff = startoff; > > } > > Not a serious issue, more a general remark: I'm doubtful that going for > palloc in error situations is good practice. This will be allocated in > the current memory context; without access to the emergency error > reserves. Yeah, this code was copied from elsewhere in 422a55a68784f but there is already another version of this routine in walsender which is using XLogFileNameP instead of the stack-allocated one. I just made this one more similar to the walsender's (which is in backend environment just like this one) instead of continuing to use the frontend-limited version (which is where this one was copied from). I'm having a hard time getting concerned about using palloc there, considering that every single call of XLogFileNameP occurs in an error message. If we want to reject this one, surely we should change all the others too. > I'm also getting the feeling that the patch is bordering on doing some > relatively random cleanups mixed in with architectural changes. Makes > things a bit harder to review. Yes. > > +static void > > +XLogReadDetermineTimeline(XLogReaderState *state) > > +{ > > + /* Read the history on first time through */ > > + if (state->timelineHistory == NIL) > > + state->timelineHistory = readTimeLineHistory(ThisTimeLineID); > > + > > + /* > > + * Are we reading the record immediately following the one we read last > > + * time? If not, then don't use the cached timeline info. > > + */ > > + if (state->currRecPtr != state->EndRecPtr) > > + { > > + state->currTLI = 0; > > + state->currTLIValidUntil = InvalidXLogRecPtr; > > + } > > Hm. So we grow essentially a second version of the last end position and > the randAccess stuff in XLogReadRecord(). Craig's argument against that seems reasonable to me. > XLogReadDetermineTimeline() doesn't sit quite right with me, I do wonder > whether there's not a simpler way to write this. Feel free to suggest something :-) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote: > Thanks very much for the review. Are you planning to update the patch? - Andres
On 22 March 2016 at 19:48, Andres Freund <andres@anarazel.de> wrote:
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
> Thanks very much for the review.
Are you planning to update the patch?
Yes. I just spoke to Álvaro earlier. I'll pick up his modified version of my patch, remove the unrelated cleanup stuff he added, amend the other issues and re-post today.
Then I'll try to get an updated pglogical_output up, now that I've knocked off a variety of urgent work elsewhere...
On 22 March 2016 at 19:48, Andres Freund <andres@anarazel.de> wrote:
On 2016-03-15 21:04:12 +0800, Craig Ringer wrote:
> Thanks very much for the review.
Are you planning to update the patch?
Updated patch attached.
It removes the questionable cleanups, fixes the randAccess comment (IMO), changes the "open a new xlog segment" wording and explains why we don't need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline. I removed the change that did less than a whole page at the XLogRead call and instead added a comment explaining it. Fixed the brainfrart with candidate_restart_lsn, removed the outdated startptr comment and update and the unnecessary XLogReaderInvalCache call after it.
Also renamed src/test/modules/decoding_failover to src/test/modules/test_slot_timelines . Best name I could come up with that wasn't 1000chars long.
Passes main 'check', contrib/test_decoding check, src/test/modules/test_slot_timelines check and src/test/recovery check.
Attachment
Craig Ringer wrote: > It removes the questionable cleanups, fixes the randAccess comment (IMO), I pushed cleanup separately, including the addition of a few comments that were documenting the original code. I actually made a mistake in extracting those, so there's one comment that's actually bogus in that commit (the candidate_restart_lsn one); it's fixed in the second commit. Sorry about that. I also introduced XLogReaderInvalReadState here, called XLogReaderInvalCache originally, since Andres objected to the "cache" terminology (though you will note that the word "cache" was used in the original code too.) > changes the "open a new xlog segment" wording and explains why we don't > need GetFlushRecPtr/GetXLogReplayRecPtr on replay of a historical timeline. > I removed the change that did less than a whole page at the XLogRead call > and instead added a comment explaining it. Fixed the brainfrart with > candidate_restart_lsn, removed the outdated startptr comment and update and > the unnecessary XLogReaderInvalCache call after it. > > Also renamed src/test/modules/decoding_failover to > src/test/modules/test_slot_timelines . Best name I could come up with that > wasn't 1000chars long. > > Passes main 'check', contrib/test_decoding check, > src/test/modules/test_slot_timelines check and src/test/recovery check. > > Available attached or at > https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following And pushed this too. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31 March 2016 at 07:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Available attached or at
> https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following
And pushed this too.
Much appreciated. Marked as committed at https://commitfest.postgresql.org/9/568/ .
This gives us an option for failover of logical replication in 9.6, even if it's a bit cumbersome and complex for the client, in case failover slots don't make the cut. And, of course, it's a pre-req for failover slots, which I'll rebase on top of it shortly.
Andres, I tried to address your comments as best I could. The main one that I think stayed open was about the loop that finds the last timeline on a segment. If you think that's better done by directly scanning the List* of timeline history entries I'm happy to prep a follow-up.
Hi, On 2016-03-31 08:52:34 +0800, Craig Ringer wrote: > On 31 March 2016 at 07:15, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > Available attached or at > > > > > https://github.com/2ndQuadrant/postgres/tree/dev/logical-decoding-timeline-following > > > > And pushed this too. > > > > Much appreciated. Marked as committed at > https://commitfest.postgresql.org/9/568/ . > > This gives us an option for failover of logical replication in 9.6, even if > it's a bit cumbersome and complex for the client, in case failover slots > don't make the cut. And, of course, it's a pre-req for failover slots, > which I'll rebase on top of it shortly. FWIW, I think it's dangerous to use this that way. If people manipulate slots that way we'll have hellishly to debug issues. The test code needs a big disclaimer to never ever be used in production, and we should "disclaim any warranty" if somebody does that. To the point of not fixing issues around it in back branches. > Andres, I tried to address your comments as best I could. The main one that > I think stayed open was about the loop that finds the last timeline on a > segment. If you think that's better done by directly scanning the List* of > timeline history entries I'm happy to prep a follow-up. Have to look again. + * We start reading xlog from the restart lsn, even though in + * CreateDecodingContext we set the snapshot builder up using the + * slot's confirmed_flush. This means we might read xlog we don't + * actually decode rows from, but the snapshot builder might need it + * to get to a consistent point. The point we start returning data to + * *users* at is the confirmed_flush lsn set up in the decoding + * context. + */ still seems pretty misleading - and pretty much unrelated to the callsite. Greetings, Andres Freund
On 31 March 2016 at 16:09, Andres Freund <andres@anarazel.de> wrote:
> This gives us an option for failover of logical replication in 9.6, even if
> it's a bit cumbersome and complex for the client, in case failover slots
> don't make the cut. And, of course, it's a pre-req for failover slots,
> which I'll rebase on top of it shortly.
FWIW, I think it's dangerous to use this that way. If people manipulate
slots that way we'll have hellishly to debug issues.
I think it's pretty unsafe from SQL, to be sure.
Unless failover slots get in to 9.6 we'll need to do exactly that from internal C stuff in pglogical to support following physical failover, though. I'm not thrilled by that, especially since we have no way (in 9.6) to insert generic WAL records in xlog to allow the slot updates to accurately pace replay. I don't mean logical wal messages here, they wouldn't help, it'd actually be pluggable generic WAL that we'd need to sync slots fully consistently in the absence of failover slots.
However: It's safe for the slot state on the replica to be somewhat behind the local replay from the master, so the slot state on the replica is older than what it would've been at an equivalent time on the master... so long as it's not so far behind that the replica has replayed vacuum activity that removes rows still required by the slot's declared catalog_xmin. Even then it should actually be OK in practice because the failing-over client will always have replayed past that point on the master (otherwise catalog_xmin couldn't have advanced on the master), so it'll always ask to start replay at a point at or after the lsn where the catalog_xmin was advanced to its most recent value on the old master before failover. It's only safe for walsender based decoding though, since there's no way to specify the startpoint for sql-level decoding.
My only real worry there is whether invalidations are a concern - if internal replay starts at the restart_lsn and replays over part of history where the catalog entries have been vacuumed before it starts collecting rows for return to the decoding client at the requested decoding startpoint, can that cause problems with invalidation processing? I haven't got my head around what, exactly, logical decoding is doing with its invalidations stuff yet. I didn't find any problems in testing, but wasn't sure how to create the conditions under which failures might occur.
It's also OK for the slot state to be *ahead* of the replica's replay position, i.e. from the future. restart_lsn and catalog_xmin will be higher than they were on the master at the same point in time, but that doesn't matter at all, since the client will always be requesting to start from that point or later, having replayed up to there on the old master before failover.
It's even possible for the restart_lsn and catalog_xmin to be in the replica's future, i.e. the lsn > the current replay lsn and the catalog_xmin greater than the next xid. In that case we know the logical client replayed state from the old master that hasn't been applied to the replica by the time it's promoted. If this state of affairs exists when we promote a replica to a master we should really kill the slot, since the client has obviously replayed from the old master past the point of divergence with the promoted replica and there's a potentially an unresolvable timeline fork. I'd like to add this if I can manage it, probably by WARNing a complaint then dropping the slot.
Needless to say I'd really rather just have failover slots, where we know the slot will be consistent with the DB state and updated in sync with it. This is a fallback plan and I don't like it too much.
The test code needs
a big disclaimer to never ever be used in production, and we should
"disclaim any warranty" if somebody does that. To the point of not
fixing issues around it in back branches.
I agree. The README has just that, and there are comments above the individual functions like:
* Note that this is test harness code. You shouldn't expose slot internals
* to SQL like this for any real world usage. See the README.
> Andres, I tried to address your comments as best I could. The main one that
> I think stayed open was about the loop that finds the last timeline on a
> segment. If you think that's better done by directly scanning the List* of
> timeline history entries I'm happy to prep a follow-up.
Have to look again.
+ * We start reading xlog from the restart lsn, even though in
+ * CreateDecodingContext we set the snapshot builder up using the
+ * slot's confirmed_flush. This means we might read xlog we don't
+ * actually decode rows from, but the snapshot builder might need it
+ * to get to a consistent point. The point we start returning data to
+ * *users* at is the confirmed_flush lsn set up in the decoding
+ * context.
+ */
still seems pretty misleading - and pretty much unrelated to the
callsite.
I think it's pretty confusing that the function uses's the slot's restart_lsn and never refers to confirmed_flush which is where it actually starts decoding from as far as the user's concerned. It took me a while to figure out what was going on there.
I think the comment would be less necessary, and could be moved up to the CreateDecodingContext call, if we passed the slot's confirmed_flush explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr and having the CreateDecodingContext call infer the start point. That way what's going on would be a lot clearer when reading the code.
You're thinking from the perspective of someone who knows this code intimately. It's obvious that restart_lsn controls where the xlogreader starts processing and feeding into the snapshot builder to you. Similarly, it'll be obvious that the decoding context defaults to the confirmed_flush point which is where we start returning decoded rows to users. This really will NOT be obvious to most readers though, and I think we could use making the logical decoding code easier to understand as it gets more and more use. I'm trying to help with that, and while I suffer from not knowing it as well as you, that also means I can still see some of the things that might be confusing to newer readers. Then get them wrong when explaining them ;)
Speaking of which, did you see the proposed README I sent for src/backend/replication/logical ?
Hi, On 2016-04-01 13:16:18 +0800, Craig Ringer wrote: > I think it's pretty unsafe from SQL, to be sure. > > Unless failover slots get in to 9.6 we'll need to do exactly that from > internal C stuff in pglogical to support following physical failover, I know. And this makes me scared shitless. I'll refuse to debug anything related to decoding, when this "feature" has been used. And I think there'll be plenty; if you notice the issues that is. The more I think about it, the more I think we should rip this out again, until we have the actual feature is in. With proper review. > However: It's safe for the slot state on the replica to be somewhat behind > the local replay from the master, so the slot state on the replica is older > than what it would've been at an equivalent time on the master... so long > as it's not so far behind that the replica has replayed vacuum activity > that removes rows still required by the slot's declared catalog_xmin. Even > then it should actually be OK in practice because the failing-over client > will always have replayed past that point on the master (otherwise > catalog_xmin couldn't have advanced on the master), so it'll always ask to > start replay at a point at or after the lsn where the catalog_xmin was > advanced to its most recent value on the old master before failover. It's > only safe for walsender based decoding though, since there's no way to > specify the startpoint for sql-level decoding. This whole logic fails entirely flats on its face by the fact that even if you specify a startpoint, we read older WAL and process the records. The startpoint filters every transaction that commits earlier than the "startpoint". But with a long-running transaction you still will have old changes being processed, which need the corresponding catalog state. > My only real worry there is whether invalidations are a concern - if > internal replay starts at the restart_lsn and replays over part of history > where the catalog entries have been vacuumed before it starts collecting > rows for return to the decoding client at the requested decoding > startpoint, can that cause problems with invalidation processing? Yes. > It's also OK for the slot state to be *ahead* of the replica's replay > position, i.e. from the future. restart_lsn and catalog_xmin will be higher > than they were on the master at the same point in time, but that doesn't > matter at all, since the client will always be requesting to start from > that point or later, having replayed up to there on the old master before > failover. I don't think that's true. See my point above about startpoint. > > Andres, I tried to address your comments as best I could. The main one > > that > > > I think stayed open was about the loop that finds the last timeline on a > > > segment. If you think that's better done by directly scanning the List* > > of > > > timeline history entries I'm happy to prep a follow-up. > > > > Have to look again. > > > > + * We start reading xlog from the restart lsn, even though in > > + * CreateDecodingContext we set the snapshot builder up using the > > + * slot's confirmed_flush. This means we might read xlog we don't > > + * actually decode rows from, but the snapshot builder might need > > it > > + * to get to a consistent point. The point we start returning data > > to > > + * *users* at is the confirmed_flush lsn set up in the decoding > > + * context. > > + */ > > still seems pretty misleading - and pretty much unrelated to the > > callsite. > > > > > I think it's pretty confusing that the function uses's the slot's > restart_lsn and never refers to confirmed_flush which is where it actually > starts decoding from as far as the user's concerned. It took me a while to > figure out what was going on there. That's a fundamental misunderstanding on your part (perhaps created by imprecise docs). The startpoint isn't about where to start decoding. It's about skip calling the output plugin of a transaction if it *commits* before the startpoint. We almost always will *decode* rows from before the startpoint. And that's pretty much unavoidable: The consumer of a decoding slot only really can do anything with commit LSNs wrt replay progress: But for a remembered progress LSN there can be a lot of in-progress xacts (about which said client doesn't know anything); and they start *earlier* than the commit LSN of the just replayed xact. So we have to be able to re-process their state and send it out. > I think the comment would be less necessary, and could be moved up to the > CreateDecodingContext call, if we passed the slot's confirmed_flush > explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr > and having the CreateDecodingContext call infer the start point. That way > what's going on would be a lot clearer when reading the code. How would that solve anything? We still need to process the old records, hence the xlogreader needs to instructed to read them. Hence logicalfuncs.c needs to know about it. > You're thinking from the perspective of someone who knows this code > intimately. Maybe. But that's not addressed by adding half-true comments to places they only belong to peripherally. And not to all the relevant places, since you've not added the same comment to StartLogicalReplication(). > Speaking of which, did you see the proposed README I sent for > src/backend/replication/logical ? I skimmed it. But given we have a CF full of patches, some submitted over a year ago, it seems unfair to spend time on a patch submitted a few days ago. Greetings, Andres Freund
Hi, On 2016-04-01 08:46:01 +0200, Andres Freund wrote: > That's a fundamental misunderstanding on your part (perhaps created by > imprecise docs). > > Speaking of which, did you see the proposed README I sent for > > src/backend/replication/logical ? > > I skimmed it. But given we have a CF full of patches, some submitted > over a year ago, it seems unfair to spend time on a patch submitted a > few days ago. For that purpos WRT design readme, it might be interesting to look at 0009 in http://archives.postgresql.org/message-id/20140127162006.GA25670%40awork2.anarazel.de That's not up2date obviously, but it still might help. Andres
On 1 April 2016 at 14:46, Andres Freund <andres@anarazel.de> wrote:
> You're thinking from the perspective of someone who knows this code
> intimately.
Maybe. But that's not addressed by adding half-true comments to places
they only belong to peripherally. And not to all the relevant places,
since you've not added the same comment to StartLogicalReplication().
I want to go over the rest of your reply in more detail, but regarding this and the original comment, I'm going by:
if (start_lsn == InvalidXLogRecPtr)
{
/* continue from last position */
start_lsn = slot->data.confirmed_flush;
}
....
ctx = StartupDecodingContext(output_plugin_options,
start_lsn, InvalidTransactionId,
read_page, prepare_write, do_write);
... in CreateDecodingContext(), which in turn does ...
ctx->snapshot_builder =
AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn);
... in StartupDecodingContext(...).
The snapshot builder controls when the snapshot builder returns SNAPBUILD_CONSISTENT, therefore when we start returning decoded commits to the client. Right?
We pass InvalidXLogRecPtr as the start_lsn in pg_logical_slot_get_changes_guts(...).
So, when I wrote that:
/*
* We start reading xlog from the restart lsn, even though in
* CreateDecodingContext we set the snapshot builder up using the
* slot's confirmed_flush. This means we might read xlog we don't
* actually decode rows from, but the snapshot builder might need it
* to get to a consistent point. The point we start returning data to
* *users* at is the confirmed_flush lsn set up in the decoding
* context.
*/
I don't see what's wrong there at all.
We do "start reading xlog from the slot's restart_lsn". That's what gets passed to set up the xlogreader. That's where we read the first xlog records from.
In CreateDecodingContext we _do_ "set the snapshot builder up using the slot's confirmed_flush" [unless overridden by explicit argument to CreateDecodingContext, which isn't given here].
This is presumably what you're taking issue with:
* This means we might read xlog we don't
* actually decode rows from, but the snapshot builder might need it
* to get to a consistent point.
and would be better worded as something like:
This means we will read and decode xlog from before the point we actually start returning decoded commits to the client, but the snapshot builder may need this additional xlog to get to a consistent point.
right?
I think
The point we start returning data to
* *users* at is the confirmed_flush lsn set up in the decoding
* context.
is still right.
What I was proposing instead is, in pg_logical_slot_get_changes_guts, changing:
ctx = CreateDecodingContext(InvalidXLogRecPtr,
options,
logical_read_local_xlog_page,
LogicalOutputPrepareWrite,
LogicalOutputWrite);
so that instead of InvalidXLogRecPtr we explicitly pass
MyReplicationSlot->data.confirmed_flush;
thus making it way easier to see what's going on without having to chase way deeper to figure out why data isn't returned from the restart_lsn. Where, y'know, you'd expect decoding to restart from... and where it does restart from, just not where it resumes sending output to the client from.
> Speaking of which, did you see the proposed README I sent for
> src/backend/replication/logical ?
I skimmed it. But given we have a CF full of patches, some submitted
over a year ago, it seems unfair to spend time on a patch submitted a
few days ago
Of course. I wasn't expecting to wedge this into the release, nor do I see any value in doing so. I just thought you'd probably rather know about it than not and if it was obviously wrong it'd be good to know. I've found it very hard to get my head around how logical decoding's parts fit together, and now that there are some other folks in 2ndQ getting involved it seemed like a good idea to start working on making that easier for the next guy.
Thanks for linking to your old docs. While a bit outdated as you noted, they'll be simple to update and would be really valuable to have in a README where they're discoverable.
On 1 April 2016 at 14:52, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2016-04-01 08:46:01 +0200, Andres Freund wrote:
> That's a fundamental misunderstanding on your part (perhaps created by
> imprecise docs).
> > Speaking of which, did you see the proposed README I sent for
> > src/backend/replication/logical ?
>
> I skimmed it. But given we have a CF full of patches, some submitted
> over a year ago, it seems unfair to spend time on a patch submitted a
> few days ago.
For that purpos
WRT design readme, it might be interesting to look at 0009 in
http://archives.postgresql.org/message-id/20140127162006.GA25670%40awork2.anarazel.de
That's not up2date obviously, but it still might help.
Thanks, I've been reading it and the posts it references.
Most of it was familiar by this point, but would've been a good reference earlier on. The snapshot builder docs in README.SNAPBUILD.txt are handy and help glue a few separate pieces together better for me, and the invalidations section was brief but informative.
The very last point looks interesting, but only really alludes to what's going on:
+== Restartable Decoding ==
+
+As we want to generate a consistent stream of changes we need to have the
+ability to start from a previously decoded location without waiting possibly
+very long to reach consistency. For that reason we dump the current visibility
+information to disk whenever we read an xl_running_xacts record.
I don't feel like I've grasped this properly yet. I think it's referring to the pg_logical/snapshots/ serialization, the use of which allows us to avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to be crucial for correct function. After all, decoding still restarts at the restart_lsn and feeds relevant xact info into the snapshot builder, accumulates invalidation information, etc.
After 9.6 I'd like to go through that, update it, and get it in as a README for logical decoding. It would've done me a lot of good when getting up to speed.
On 2016-04-04 14:24:52 +0800, Craig Ringer wrote: > I don't feel like I've grasped this properly yet. I think it's referring to > the pg_logical/snapshots/ serialization, the use of which allows us to > avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to > be crucial for correct function. After all, decoding still restarts at the > restart_lsn and feeds relevant xact info into the snapshot builder, > accumulates invalidation information, etc. restart_lsn is set to the last known point where a) all changes for ongoing transactions are available b) we can re-build visiblity information when we start reading from there. As we essentially can only start determining visibility information whenever processing a xl_running_xacts record. Building visibility information means that there has to be a xl_running_xacts to start from. To build full visibility information we also have to wait till we have seen all in-progress transactions finish. So we dump visibility information every now and then, so we can re-use the information we'd already assembled. Greetings, Andres Freund
On 4 April 2016 at 14:30, Andres Freund <andres@anarazel.de> wrote:
On 2016-04-04 14:24:52 +0800, Craig Ringer wrote:
> I don't feel like I've grasped this properly yet. I think it's referring to
> the pg_logical/snapshots/ serialization, the use of which allows us to
> avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to
> be crucial for correct function. After all, decoding still restarts at the
> restart_lsn and feeds relevant xact info into the snapshot builder,
> accumulates invalidation information, etc.
restart_lsn is set to the last known point where a) all changes for
ongoing transactions are available b) we can re-build visiblity
information when we start reading from there.
As we essentially can only start determining visibility information
whenever processing a xl_running_xacts record. Building visibility
information means that there has to be a xl_running_xacts to start
from. To build full visibility information we also have to wait till we
have seen all in-progress transactions finish. So we dump visibility
information every now and then, so we can re-use the information we'd
already assembled.
OK, makes sense. And still resume decoding from restart_lsn, despite having that visibility information stashed, because we also have to rebuild the information on invalidations for running xacts, which is not stored persistently anywhere as decoding progresses. So for now at least it's an optimisation to store the visibility info, since we still have go go back and decode for invalidations anyway. Right?
On 2016-04-04 14:18:53 +0800, Craig Ringer wrote: > I want to go over the rest of your reply in more detail, but regarding this > and the original comment, I'm going by: > > if (start_lsn == InvalidXLogRecPtr) > { > /* continue from last position */ > start_lsn = slot->data.confirmed_flush; > } > .... > ctx = StartupDecodingContext(output_plugin_options, > start_lsn, InvalidTransactionId, > read_page, prepare_write, do_write); > > ... in CreateDecodingContext(), which in turn does ... > > ctx->snapshot_builder = > AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn); > > ... in StartupDecodingContext(...). > > The snapshot builder controls when the snapshot builder returns > SNAPBUILD_CONSISTENT, therefore when we start returning decoded commits to > the client. Right? > We pass InvalidXLogRecPtr as the start_lsn > in pg_logical_slot_get_changes_guts(...). > > So, when I wrote that: > > /* > * We start reading xlog from the restart lsn, even though in > * CreateDecodingContext we set the snapshot builder up using the > * slot's confirmed_flush. This means we might read xlog we don't > * actually decode rows from, but the snapshot builder might need it > * to get to a consistent point. The point we start returning data > to > * *users* at is the confirmed_flush lsn set up in the decoding > * context. > */ > > I don't see what's wrong there at all. They're independent. What the snapshot builder gets as the start position isn't the same as where we start reading WAL. > We do "start reading xlog from the slot's restart_lsn". That's what gets > passed to set up the xlogreader. That's where we read the first xlog > records from. > > In CreateDecodingContext we _do_ "set the snapshot builder up using the > slot's confirmed_flush" [unless overridden by explicit argument to > CreateDecodingContext, which isn't given here]. Yes. But again, that hasn't really got anything to do with where we read WAL from. We always have to read older WAL than were confirmed_flush is set to; to assemble all the changes from transactions that are in-progress at the point of confirmed_flush. > This is presumably what you're taking issue with: > > * This means we might read xlog we don't > * actually decode rows from, but the snapshot builder might need it > * to get to a consistent point. > > and would be better worded as something like: > > This means we will read and decode xlog from before the point we actually > start returning decoded commits to the client, but the snapshot builder may > need this additional xlog to get to a consistent point. > right? No. The snapshot builder really hasn't gotten that much to do with things. The fundamental thing we have to do is re-assemble all in-progress transactions (as in reorderbuffer.c, not snapbuild.c). > I think > > The point we start returning data to > * *users* at is the confirmed_flush lsn set up in the decoding > * context. > > is still right. > > What I was proposing instead is, in pg_logical_slot_get_changes_guts, > changing: > > ctx = CreateDecodingContext(InvalidXLogRecPtr, > options, > logical_read_local_xlog_page, > LogicalOutputPrepareWrite, > LogicalOutputWrite); > > so that instead of InvalidXLogRecPtr we explicitly pass > > MyReplicationSlot->data.confirmed_flush; > > thus making it way easier to see what's going on without having to chase > way deeper to figure out why data isn't returned from the restart_lsn. > Where, y'know, you'd expect decoding to restart from... and where it does > restart from, just not where it resumes sending output to the client from. I just don't buy this. It's absolutely fundamental to understand that the commit LSN isn't the point where all the data of transactions is stored. Andres
On 2016-04-04 14:36:29 +0800, Craig Ringer wrote: > On 4 April 2016 at 14:30, Andres Freund <andres@anarazel.de> wrote: > > > On 2016-04-04 14:24:52 +0800, Craig Ringer wrote: > > > I don't feel like I've grasped this properly yet. I think it's referring > > to > > > the pg_logical/snapshots/ serialization, the use of which allows us to > > > avoid doing extra work in SnapBuildFindSnapshot(...), but doesn't seem to > > > be crucial for correct function. After all, decoding still restarts at > > the > > > restart_lsn and feeds relevant xact info into the snapshot builder, > > > accumulates invalidation information, etc. > > > > restart_lsn is set to the last known point where a) all changes for > > ongoing transactions are available b) we can re-build visiblity > > information when we start reading from there. > > > > As we essentially can only start determining visibility information > > whenever processing a xl_running_xacts record. Building visibility > > information means that there has to be a xl_running_xacts to start > > from. To build full visibility information we also have to wait till we > > have seen all in-progress transactions finish. So we dump visibility > > information every now and then, so we can re-use the information we'd > > already assembled. > > > > OK, makes sense. And still resume decoding from restart_lsn, despite having > that visibility information stashed, because we also have to rebuild the > information on invalidations for running xacts, which is not stored > persistently anywhere as decoding progresses. So for now at least it's an > optimisation to store the visibility info, since we still have go go back > and decode for invalidations anyway. Right? Not really no. The important point isn't invalidation or anything. It's that we don't have the content & metadata of the transactions themselves. Yes, we also do re-read invalidations, but that's just a side effect.
On 4 April 2016 at 14:43, Andres Freund <andres@anarazel.de> wrote:
Not really no. The important point isn't invalidation or anything. It's> OK, makes sense. And still resume decoding from restart_lsn, despite having
> that visibility information stashed, because we also have to rebuild the
> information on invalidations for running xacts, which is not stored
> persistently anywhere as decoding progresses. So for now at least it's an
> optimisation to store the visibility info, since we still have go go back
> and decode for invalidations anyway. Right?
that we don't have the content & metadata of the transactions
themselves. Yes, we also do re-read invalidations, but that's just a
side effect.
I've been looking too hard for the details and somehow managed to completely miss the blindingly obvious right in front of me. I knew that the confirmed lsn was the threshold of confirmed commit LSNs, I knew about the reorder buffer needing to accumulate changes, I knew it wasn't persistent, etc. Yet I can't pretend I didn't overlook that when looking over the xlogreader vs decoding startup, per those comments.
I don't think a succinct comment there will be useful given that it'd really just do a better job of explaining what restart_lsn is and why we start decoding there. So I guess a more detailed comment on the slot struct (not on the field, at the start) would be good. Or that README.
Thanks for your patience.
--
On 1 April 2016 at 14:46, Andres Freund <andres@anarazel.de> wrote:
> However: It's safe for the slot state on the replica to be somewhat behind
> the local replay from the master, so the slot state on the replica is older
> than what it would've been at an equivalent time on the master... so long
> as it's not so far behind that the replica has replayed vacuum activity
> that removes rows still required by the slot's declared catalog_xmin. Even
> then it should actually be OK in practice because the failing-over client
> will always have replayed past that point on the master (otherwise
> catalog_xmin couldn't have advanced on the master), so it'll always ask to
> start replay at a point at or after the lsn where the catalog_xmin was
> advanced to its most recent value on the old master before failover. It's
> only safe for walsender based decoding though, since there's no way to
> specify the startpoint for sql-level decoding.
This whole logic fails entirely flats on its face by the fact that even
if you specify a startpoint, we read older WAL and process the
records. The startpoint filters every transaction that commits earlier
than the "startpoint". But with a long-running transaction you still
will have old changes being processed, which need the corresponding
catalog state.
Right. Having read over those draft docs I see what you mean.
TL;DR: the only way I can see to do this now is full-on failover slots, or something very similar that uses pluggable WAL + a slot save hook + moving CheckPointReplicationSlots to the start of a checkpoint while WAL is still writeable.
To rephrase per my understanding: The client only specifies the point it wants to start seeing decoded commits. Decoding starts from the slot's restart_lsn, and that's the point from which the accumulation of reorder buffer contents begins, the snapshot building process begins, and where accumulation of relcache invalidation information begins. At restart_lsn no xact that is to be emitted to the client may yet be in progress. Decoding, whether or not the xacts will be fed to the output plugin callbacks, requires access to the system catalogs. Therefore catalog_xmin reported by the slot must be >= the real effective catalog_xmin of the heap and valid at the restart_lsn, not just the confirmed flush point or the point the client specifies to resume fetching changes from.
On the original copy of the slot on the pre-failover master the restart_lsn would've been further ahead, as would the catalog_xmin. So catalog rows have been purged. When we decode from the old restart_lsn on the replica after failover we're not just doing excess work, we'd be accumulating probably-incorrect invalidation information for the xacts that we're decoding (but not sending to the client). Or just failing to find entries we expect to find in the caches and dying.
If the restart_lsn was advanced on the master there must be another safe decoding restart point after the one the old copy of the slot points to. But we don't know where, and we don't have any way to know that we *are* in such a post-failover situation so we can't just scan forward looking for it without looking up invalidation info for xacts we know we'll be throwing away.
So it's necessary to ensure that the slot's restart_lsn and catalog_xmin are advanced in a timely, consistent manner on the replica's copy of the slot at a point where no vacuum changes to the catalog that could remove needed tuples have been replayed.
The only way I can think of to do that really reliably right now, without full failover slots, is to use the newly committed pluggable WAL mechanism and add a hook to SaveSlotToPath() so slot info can be captured, injected in WAL, and replayed on the replica. It'd also be necessary to move CheckPointReplicationSlots() out of CheckPointGuts() to the start of a checkpoint/restartpoint when WAL writing is still permitted, like the failover slots patch does. Basically, failover slots as a plugin using a hook, without the additions to base backup commands and the backup label. I'm not convinced that's at all better than the full failover slots, and I'm not sure I can make a solid argument for the general purpose utility of the hook it'd need, but at least it'd avoid committing core to that approach.
I'd really hate 9.6 to go out with - still - no way to use logical decoding in a basic, bog-standard HA/failover environment. It overwhelmingly limits their utility and it's becoming a major drag on practical use of the feature. That's a difficulty given that the failover slots patch isn't especially trivial and you've shown that lazy sync of slot state is not sufficient.
> It's also OK for the slot state to be *ahead* of the replica's replay
> position, i.e. from the future. restart_lsn and catalog_xmin will be higher
> than they were on the master at the same point in time, but that doesn't
> matter at all, since the client will always be requesting to start from
> that point or later, having replayed up to there on the old master before
> failover.
I don't think that's true. See my point above about startpoint.
My explanation is certainly wrong. I suspect the result of doing it is still actually OK, though (though pretty useless, per below).
The restart_lsn from the newer copy of the slot is, as you said, a point we know we can reconstruct visibility info. Also a point where no xact that we want to decode was already in-progress, since we have to accumulate those in the reorder buffer. The client will be requesting (via its local replication origin progress tracking or whatever) that logical decoding start emitting xact contents at a point that would've been reasonable with the newer restart_lsn from the copied slot, since otherwise it wouldn't have advanced on the master. There's no problem with missing catalog entries, we'll have extras we don't need until vacuum catches up but that's all.
The client's requested start point should always be reasonable w.r.t. the latest copy of the master's slot, so long as the replica has replayed up to or past the point the client wants to start receiving decoded commits. Nor should any resources needed have been removed from the replica.
So long as the confirmed lsn (and whatever point the client requests replay resume from) is less than the timeline switchpoint to the current timeline it should be OK.
If I'm right about that then syncing slot state over such that the slot data is always the same as or newer than it was on the master at the same point in history should be OK. We could grab the master's xlog insert ptr, snapshot its slots, sync 'em over, and advance the recovery target lsn on the replica to allow it to replay up to that point in the master's history, repeating periodically. However, it'd be hideous and it'd be completely impossible to use for sync rep, so it's pretty useless even if it would work.
> I think the comment would be less necessary, and could be moved up to the
> CreateDecodingContext call, if we passed the slot's confirmed_flush
> explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr
> and having the CreateDecodingContext call infer the start point. That way
> what's going on would be a lot clearer when reading the code.
How would that solve anything? We still need to process the old records,
hence the xlogreader needs to instructed to read them. Hence
logicalfuncs.c needs to know about it.
Huh?
All that'd do would be making what already happens more visible. Instead of passing InvalidXLogRecPtr as start_lsn having it transparently replaced with the confirmed_lsn inside CreateDecodingContext, we'd pass the slot->data.confirmed_lsn directly.
IMO it'd just make it a little easier to follow what's going on.
On 2016-04-04 17:50:02 +0800, Craig Ringer wrote: > To rephrase per my understanding: The client only specifies the point it > wants to start seeing decoded commits. Decoding starts from the slot's > restart_lsn, and that's the point from which the accumulation of reorder > buffer contents begins, the snapshot building process begins, and where > accumulation of relcache invalidation information begins. At restart_lsn no > xact that is to be emitted to the client may yet be in progress. Decoding, s/yet/already/ > whether or not the xacts will be fed to the output plugin callbacks, > requires access to the system catalogs. Therefore catalog_xmin reported by > the slot must be >= the real effective catalog_xmin of the heap and valid > at the restart_lsn, not just the confirmed flush point or the point the > client specifies to resume fetching changes from. Hm. Maybe I'm misunderstanding you here, but doesn't it have to be <=? > On the original copy of the slot on the pre-failover master the restart_lsn > would've been further ahead, as would the catalog_xmin. So catalog rows > have been purged. +may > So it's necessary to ensure that the slot's restart_lsn and catalog_xmin > are advanced in a timely, consistent manner on the replica's copy of the > slot at a point where no vacuum changes to the catalog that could remove > needed tuples have been replayed. Right. > The only way I can think of to do that really reliably right now, without > full failover slots, is to use the newly committed pluggable WAL mechanism > and add a hook to SaveSlotToPath() so slot info can be captured, injected > in WAL, and replayed on the replica. I personally think the primary answer is to use separate slots on different machines. Failover slots can be an extension to that at some point, but I think they're a secondary goal. > It'd also be necessary to move > CheckPointReplicationSlots() out of CheckPointGuts() to the start of a > checkpoint/restartpoint when WAL writing is still permitted, like the > failover slots patch does. Ugh. That makes me rather wary. > Basically, failover slots as a plugin using a hook, without the > additions to base backup commands and the backup label. I'm going to be *VERY* hard to convince that adding a hook inside checkpointing code is acceptable. > I'd really hate 9.6 to go out with - still - no way to use logical decoding > in a basic, bog-standard HA/failover environment. It overwhelmingly limits > their utility and it's becoming a major drag on practical use of the > feature. That's a difficulty given that the failover slots patch isn't > especially trivial and you've shown that lazy sync of slot state is not > sufficient. I think the right way to do this is to focus on failover for logical rep, with separate slots. The whole idea of integrating this physical rep imo makes this a *lot* more complex than necessary. Not all that many people are going to want to physical rep and logical rep. > The restart_lsn from the newer copy of the slot is, as you said, a point we > know we can reconstruct visibility info. We can on the master. There's absolutely no guarantee that the associated serialized snapshot is present on the standby. Andres
On 4 April 2016 at 18:01, Andres Freund <andres@anarazel.de> wrote:
> The only way I can think of to do that really reliably right now, without
> full failover slots, is to use the newly committed pluggable WAL mechanism
> and add a hook to SaveSlotToPath() so slot info can be captured, injected
> in WAL, and replayed on the replica.
I personally think the primary answer is to use separate slots on
different machines. Failover slots can be an extension to that at some
point, but I think they're a secondary goal.
Assuming that here you mean separate slots on different machines replicating via physical rep:
We don't currently allow the creation of a logical slot on a standby. Nor replay from it, even to advance it without receiving the decoded changes. Both would be required for that to work, as well as extensions to the hot standby feedback mechanism to allow a standby to ask the master to pin its catalog_xmin if slots on the standby were further behind than that of the master.
I was chatting about that with Petr earlier. What we came up with was to require the standby to connect to the master using a replication slot that, while remaining a physical replication slot, has a catalog_xmin set and updated by the replica using extended standby progress messages. The slot's catalog_xmin the replica pushed up to the master would simply be the min(catalog_xmin) of all slots on the replica, i.e. procArray->replication_slot_catalog_xmin . Same with the slot xmin, if defined for any slot on the replica.
That makes sure that the catalog_xmin required for the standby's slots is preserved even if the standby isn't currently replaying from the master.
Handily this approach would give us cascading, support for intermediate servers, and the option of only having failover slots on some replicas not others. All things that were raised as concerns with failover slots.
However, clients would then have to know about the replica(s) of the master that were failover candidates and would have to send feedback to advance the client's slots on those nodes, not just the master. They'd have to be able to connect to the replicas too. Unless we added some mechanism for the master to lazily relay those feedback messages to replicas, anyway. Not a major roadblock, just a bit fiddlier for clients.
Consistency shouldn't be a problem so long as the slot created on the replica reaches SNAPBUILD_CONSISTENT (or there's enough pending WAL for it to do so) before failover is required.
I think it'd be a somewhat reasonable alternative to failover slots and it'd make it much more practical to decode from a replica. Which would be great. It'd be fiddlier for clients, but probably worth it to get rid of the limitations failover slots impose.
> It'd also be necessary to move
> CheckPointReplicationSlots() out of CheckPointGuts() to the start of a
> checkpoint/restartpoint when WAL writing is still permitted, like the
> failover slots patch does.
Ugh. That makes me rather wary.
Your comments say it's called in CheckPointGuts for convenience... and really there doesn't seem to be anything that makes a slot checkpoint especially tied to a "real" checkpoint.
> Basically, failover slots as a plugin using a hook, without the
> additions to base backup commands and the backup label.
I'm going to be *VERY* hard to convince that adding a hook inside
checkpointing code is acceptable.
Yeah... it's in ReplicationSlotSave, but it's still a slot checkpoint even if (per above) it ceases to also be in the middle of a full system checkpoint.
> I'd really hate 9.6 to go out with - still - no way to use logical decoding
> in a basic, bog-standard HA/failover environment. It overwhelmingly limits
> their utility and it's becoming a major drag on practical use of the
> feature. That's a difficulty given that the failover slots patch isn't
> especially trivial and you've shown that lazy sync of slot state is not
> sufficient.
I think the right way to do this is to focus on failover for logical
rep, with separate slots. The whole idea of integrating this physical
rep imo makes this a *lot* more complex than necessary. Not all that
many people are going to want to physical rep and logical rep.
If you're saying we should focus on failover between nodes that're themselves connected using logical replication rather than physical replication, I really have to strongly disagree.
TL;DR for book-length below: We're a long, long way from being able to deliver even vaguely decent logical rep based failover. Without that or support for logical decoding to survive physical failover we've got a great tool in logical decoding that can't be used effectively with most real-world systems.
I originally thought logical rep based failover was the way forward too and that mixing physical and logical rep didn't make sense.
The problem is that we're a very, very long way from there, wheras we can deliver failover of logical decoding clients to physical standbys with _relative_ ease and simplicity. Not actually easy or simple, but a lot closer.
To allow logical rep and failover to be a reasonable substitute for physical rep and failover IMO *need*:
* Robust sequence decoding and replication. If you were following the later parts of that discussion you will've seen how fun that's going to be, but it's the simplest of all of the problems.
* Logical decoding and sending of in-progress xacts, so the logical client can already be most of the way through receiving a big xact when it commits. Without this we have a huge lag spike whenever a big xact happens, since we must first finish decoding it in to a reorder buffer and can only then *begin* to send it to the client. During which time no later xacts may be decoded or replayed to the client. If you're running that rare thing, the perfect pure OLTP system, you won't care... but good luck finding one in the real world.
* Either parallel apply on the client side or at least buffering of in-progress xacts on the client side so they can be safely flushed to disk and confirmed, allowing receive to continue while replay is done on the client. Otherwise sync rep is completely impractical... and there's no shortage of systems out there that can't afford to lose any transactions. Or at least have some crucial transactions they can't lose.
* Robust, seamless DDL replication, so things don't just break randomly. This makes the other points above look nice and simple by comparison. Logical decoding of 2PC xacts with DDL would help here, as would the ability to transparently convert an xact into a prepare-xact on client commit and hold the client waiting while we replicate it, confirm the successful prepare on the replica, then commit prepared on the upstream.
* oh, and some way to handle replication of shared catalog changes like pg_authid, so the above DDL replication doesn't just randomly break if it happens to refer to a global object that doesn't exist on the downstream.
Physical rep *works*. Robustly. Reliably. With decent performance. It's proven. It supports sync rep. I'm confident telling people to use it.
I don't think there's any realistic way we're going to get there for logical rep in 9.6+n for n<2 unless a whole lot more people get on board and work on it. Even then.
Right now we can deliver logical failover for DBs that:
(a) only use OWNED BY sequences like SERIAL, and even then only with some hacks;
(b) don't do DDL, ever, or only do some limited DDL via direct admin commands where they can call some kind of helper function to queue and apply the DDL;
(c) don't do big transactions or don't care about unbounded lag;
(d) don't need synchronous replication or don't care about possibly large delays before commit is confirmed;
(e) only manage role creation (among other things) via very strict processes that can be absolutely guaranteed to run on all nodes
... which in my view isn't a great many databases.
Physical rep has *none* of those problems. (Sure, it has others, but we're used to them). It does lack any way for a logical client to follow failover though, meaning that right now it's really hard to use logical rep in conjunction with physical rep. Anything you use logical decoding for has to be able to cope with completely breaking and being resynced from a new snapshot after failover, which removes a lot of the advantages the reliable, ordered replay from logical decoding gives us in the first place.
Just one example: right now BDR doesn't like losing nodes without warning, as you know. We want to add support for doing recovery replay from the most-ahead peer of a lost node to help with that, though the conflict handling implications of that could be interesting. But even then replacing the lost node still hurts when you're working over a WAN. In real world case I've dealt with it took over 8 hours to bring up a replacement for a lost node over the WAN after the original node's host suffered an abrupt hardware failure. If we could've just had a physical sync standby for each BDR node running locally on the same LAN as the main node (and dealt with the fun issues around the use of the physical timeline in BDR's node identity keys) we could've just promoted it to replace the lost node with minimal disruption.
You could run two nodes on each site, but then you either double your WAN bandwidth use or have to support complex non-mesh topologies with logical failover-candidate standbys hanging off each node in the mesh.
That's just BDR though. You can't really use logical replication for things like collecting an audit change stream, feeding business message buses and integration systems, replicating to non-PostgreSQL databases, etc, if you can't point it at a HA upstream and expect it to still work after failover. Since we can't IMO deliver logical replication based HA in a reasonable timeframe, that means we should really have a way for logical slots to follow physical failover.
On 2016-04-04 22:59:41 +0800, Craig Ringer wrote: > Assuming that here you mean separate slots on different machines > replicating via physical rep: No, I don't. > We don't currently allow the creation of a logical slot on a standby. Nor > replay from it, even to advance it without receiving the decoded > changes. Yes. I know. > > I think the right way to do this is to focus on failover for logical > > rep, with separate slots. The whole idea of integrating this physical > > rep imo makes this a *lot* more complex than necessary. Not all that > > many people are going to want to physical rep and logical rep. > If you're saying we should focus on failover between nodes that're > themselves connected using logical replication rather than physical > replication, I really have to strongly disagree. > > TL;DR for book-length below: We're a long, long way from being able to > deliver even vaguely decent logical rep based failover. I don't buy that. > * Robust sequence decoding and replication. If you were following the later > parts of that discussion you will've seen how fun that's going to be, but > it's the simplest of all of the problems. Unconvinced. People used londiste and slony for years without that, and it's not even remotely at the top of the list of problems with either. > * Logical decoding and sending of in-progress xacts, so the logical client > can already be most of the way through receiving a big xact when it > commits. Without this we have a huge lag spike whenever a big xact happens, > since we must first finish decoding it in to a reorder buffer and can only > then *begin* to send it to the client. During which time no later xacts may > be decoded or replayed to the client. If you're running that rare thing, > the perfect pure OLTP system, you won't care... but good luck finding one > in the real world. So? If you're using logical rep, you've always have that. > * Either parallel apply on the client side or at least buffering of > in-progress xacts on the client side so they can be safely flushed to disk > and confirmed, allowing receive to continue while replay is done on the > client. Otherwise sync rep is completely impractical... and there's no > shortage of systems out there that can't afford to lose any transactions. > Or at least have some crucial transactions they can't lose. That's more or less trivial to implement. In an extension. > * Robust, seamless DDL replication, so things don't just break randomly. > This makes the other points above look nice and simple by comparison. We're talking about a system which involves logical decoding. Whether you have failover via physical rep or not, doesn't have anything to do with this point. > Physical rep *works*. Robustly. Reliably. With decent performance. It's > proven. It supports sync rep. I'm confident telling people to use it. Sure? And nothing prevents you from using it. > I don't think there's any realistic way we're going to get there for > logical rep in 9.6+n for n<2 unless a whole lot more people get on board > and work on it. Even then. I think the primary problem here is that you're focusing on things that just are not very interesting for the majority of users, and which thus won't get very enthusastic help. The way to make progress is to get something basic in, and then iterate from there. Instead you're focussing on the fringes; which nobody cares about, because the basics aren't there. FWIW, I plan to aggressively work on in-core (9.7) logical rep starting in a few weeks. If we can coordinate on that end, I'm very happy, if not then not. > Right now we can deliver logical failover for DBs that: > (b) don't do DDL, ever, or only do some limited DDL via direct admin > commands where they can call some kind of helper function to queue and > apply the DDL; > (c) don't do big transactions or don't care about unbounded lag; > (d) don't need synchronous replication or don't care about possibly large > delays before commit is confirmed; > (e) only manage role creation (among other things) via very strict > processes that can be absolutely guaranteed to run on all nodes And just about nothing of that has to do with any of the recent patches send towards logical rep. I agree about the problems, but you should attack them, instead of building way too complicated workarounds which barely anybody will find interesting. > ... which in my view isn't a great many databases. Meh^2. Slony and londiste have these problems. And the reason they're not used isn't primarily those, but that they're external and waaayyy to complicated to use. > Physical rep has *none* of those problems. (Sure, it has others, but we're > used to them). So? -Andres
On Mon, Apr 4, 2016 at 10:59 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > To allow logical rep and failover to be a reasonable substitute for physical > rep and failover IMO *need*: > > * Robust sequence decoding and replication. If you were following the later > parts of that discussion you will've seen how fun that's going to be, but > it's the simplest of all of the problems. > > * Logical decoding and sending of in-progress xacts, so the logical client > can already be most of the way through receiving a big xact when it commits. > Without this we have a huge lag spike whenever a big xact happens, since we > must first finish decoding it in to a reorder buffer and can only then > *begin* to send it to the client. During which time no later xacts may be > decoded or replayed to the client. If you're running that rare thing, the > perfect pure OLTP system, you won't care... but good luck finding one in the > real world. > > * Either parallel apply on the client side or at least buffering of > in-progress xacts on the client side so they can be safely flushed to disk > and confirmed, allowing receive to continue while replay is done on the > client. Otherwise sync rep is completely impractical... and there's no > shortage of systems out there that can't afford to lose any transactions. Or > at least have some crucial transactions they can't lose. > > * Robust, seamless DDL replication, so things don't just break randomly. > This makes the other points above look nice and simple by comparison. > Logical decoding of 2PC xacts with DDL would help here, as would the ability > to transparently convert an xact into a prepare-xact on client commit and > hold the client waiting while we replicate it, confirm the successful > prepare on the replica, then commit prepared on the upstream. > > * oh, and some way to handle replication of shared catalog changes like > pg_authid, so the above DDL replication doesn't just randomly break if it > happens to refer to a global object that doesn't exist on the downstream. In general, I think we'd be a lot better off if we got some kind of logical replication into core first and then worked on lifting these types of limitations afterwards. If I had to pick an order in which to do the things you list, I'd focus first on the one you list second: being able to stream and begin applying transactions before they've committed is a really big deal for large transactions, and lots of people have some large transactions. DDL replication is nice, but realistically, there are a lot of people who simply don't change their schema all that often, and who could (and might even prefer to) manage that process in other ways - e.g. change nodes one by one while they are off-line, then bring them on-line. I don't necessarily disagree with your statement that we'd need all of this stuff to make logical replication a substitute for physical replication as far as failover is concerned. But I don't think that's the most important goal, and even to the extent that it is the goal, I don't think we need to meet every need before we can consider ourselves to have met some needs. I don't think that we need every physical replication feature plus some before logical replication can start to be useful to PostgreSQL users generally. We do, however, need the functionality to be accessible to people who are using only the PostgreSQL core distribution. The thing that is going to get people excited about making logical replication better is getting to a point where they can use it at all - and that is not going to be true as long as you can't use it without having to download something from an external website. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 04/04/16 22:00, Robert Haas wrote: > On Mon, Apr 4, 2016 at 10:59 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> To allow logical rep and failover to be a reasonable substitute for physical >> rep and failover IMO *need*: >> >> * Robust sequence decoding and replication. If you were following the later >> parts of that discussion you will've seen how fun that's going to be, but >> it's the simplest of all of the problems. >> >> * Logical decoding and sending of in-progress xacts, so the logical client >> can already be most of the way through receiving a big xact when it commits. >> Without this we have a huge lag spike whenever a big xact happens, since we >> must first finish decoding it in to a reorder buffer and can only then >> *begin* to send it to the client. During which time no later xacts may be >> decoded or replayed to the client. If you're running that rare thing, the >> perfect pure OLTP system, you won't care... but good luck finding one in the >> real world. >> >> * Either parallel apply on the client side or at least buffering of >> in-progress xacts on the client side so they can be safely flushed to disk >> and confirmed, allowing receive to continue while replay is done on the >> client. Otherwise sync rep is completely impractical... and there's no >> shortage of systems out there that can't afford to lose any transactions. Or >> at least have some crucial transactions they can't lose. >> >> * Robust, seamless DDL replication, so things don't just break randomly. >> This makes the other points above look nice and simple by comparison. >> Logical decoding of 2PC xacts with DDL would help here, as would the ability >> to transparently convert an xact into a prepare-xact on client commit and >> hold the client waiting while we replicate it, confirm the successful >> prepare on the replica, then commit prepared on the upstream. >> >> * oh, and some way to handle replication of shared catalog changes like >> pg_authid, so the above DDL replication doesn't just randomly break if it >> happens to refer to a global object that doesn't exist on the downstream. > > In general, I think we'd be a lot better off if we got some kind of > logical replication into core first and then worked on lifting these > types of limitations afterwards. If I had to pick an order in which > to do the things you list, I'd focus first on the one you list second: > being able to stream and begin applying transactions before they've > committed is a really big deal for large transactions, and lots of > people have some large transactions. DDL replication is nice, but > realistically, there are a lot of people who simply don't change their > schema all that often, and who could (and might even prefer to) manage > that process in other ways - e.g. change nodes one by one while they > are off-line, then bring them on-line. > > I don't necessarily disagree with your statement that we'd need all of > this stuff to make logical replication a substitute for physical > replication as far as failover is concerned. But I don't think that's > the most important goal, and even to the extent that it is the goal, I > don't think we need to meet every need before we can consider > ourselves to have met some needs. I don't think that we need every > physical replication feature plus some before logical replication can > start to be useful to PostgreSQL users generally. We do, however, > need the functionality to be accessible to people who are using only > the PostgreSQL core distribution. The thing that is going to get > people excited about making logical replication better is getting to a > point where they can use it at all - and that is not going to be true > as long as you can't use it without having to download something from > an external website. > I agree with you and I think Craig does as well. IMHO he merely pointed this out to show that we at the moment need physical replication as HA solution for logical decoding because currently there is no way to have reasonable continuation of service when the original master dies. And waiting until we have logical that's on par with physical for failover will take long time as it needs the above. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 04/04/16 17:15, Andres Freund wrote: > >> * Robust sequence decoding and replication. If you were following the later >> parts of that discussion you will've seen how fun that's going to be, but >> it's the simplest of all of the problems. > > Unconvinced. People used londiste and slony for years without that, and > it's not even remotely at the top of the list of problems with either. > Londiste and Slony also support physical failover unlike logical decoding which is the main point of this discussion, lets not forget that. > >> * Robust, seamless DDL replication, so things don't just break randomly. >> This makes the other points above look nice and simple by comparison. > > We're talking about a system which involves logical decoding. Whether > you have failover via physical rep or not, doesn't have anything to do > with this point. > It is if you are insisting on using logical rep as solution for failover. > >> I don't think there's any realistic way we're going to get there for >> logical rep in 9.6+n for n<2 unless a whole lot more people get on board >> and work on it. Even then. > > I think the primary problem here is that you're focusing on things that > just are not very interesting for the majority of users, and which thus > won't get very enthusastic help. The way to make progress is to get > something basic in, and then iterate from there. Instead you're > focussing on the fringes; which nobody cares about, because the basics > aren't there. Craig is focusing on solving failover for logical slots which is very damn basic issue with logical decoding right now no matter if we have logical replication in core or not. I personally don't think it's ok by any means to not be able to continue logical decoding on failover event 2 versions after logical decoding was introduced. I also don't buy your argument that it's unsafe to use timeline following on logical decoding on replica. You can always keep master from moving too far ahead by other means (even if you just use dummy slot which is only used for this purpose, yes ugly I know). If we got failover slots into 9.6 it would be better but that does not look realistic at this point. I don't think that current design for failover slots is best possible - I think failover slots should be created on replica and send their status up to the master which would then take them into account when calculating oldest needed catalog xmin and lsn (simple way of doing that would be to add this to feedback protocol and let physical slot to keep the xmin/lsn as well), but that does not mean timeline following isn't good thing on it's own (not to mention that iterative development is a thing). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2016-04-05 05:53:53 +0200, Petr Jelinek wrote: > On 04/04/16 17:15, Andres Freund wrote: > > > >>* Robust sequence decoding and replication. If you were following the later > >>parts of that discussion you will've seen how fun that's going to be, but > >>it's the simplest of all of the problems. > > > >Unconvinced. People used londiste and slony for years without that, and > >it's not even remotely at the top of the list of problems with either. > > > > Londiste and Slony also support physical failover unlike logical decoding > which is the main point of this discussion, lets not forget that. Sure. But that level of failover isn't all that hard to implement. I just want to reiterate: I'm not against failover slots per-se, or against timeline following[1], or something like that. I think it's just getting the priorities backwards. Until there's basic features available, you're going to have a hard time fighting for more advanced features. [1]: That said, I don't agree with the approach chosen so far, but that's just a matter of discussion. > > > >>* Robust, seamless DDL replication, so things don't just break randomly. > >>This makes the other points above look nice and simple by comparison. > > > >We're talking about a system which involves logical decoding. Whether > >you have failover via physical rep or not, doesn't have anything to do > >with this point. > It is if you are insisting on using logical rep as solution for failover. How on earth does this follow? If your replicas don't do DDL propagation, doing so on the primary -> new primary, it surely isn't a prerequisite. I agree DDL rep is pretty crucial, but this argument here isn't that. > I also don't buy your argument that it's unsafe to use timeline following on > logical decoding on replica. I'm not saying that generally. I'm saying that you can't realistically do it safely with just the timeline following patch, as committed. > You can always keep master from moving too far > ahead by other means (even if you just use dummy slot which is only used for > this purpose, yes ugly I know). Yes, that somewhat works. And I think this is actually kinda the design "failover" slots should take instead. > If we got failover slots into 9.6 it would > be better but that does not look realistic at this point. I don't think that > current design for failover slots is best possible - I think failover slots > should be created on replica and send their status up to the master which > would then take them into account when calculating oldest needed catalog > xmin and lsn (simple way of doing that would be to add this to feedback > protocol and let physical slot to keep the xmin/lsn as well) Yes, that's not too far away from what I'm thinking of. If we do it right that also solves the important problems for decoding on a standby. > but that does not mean timeline following isn't good thing on it's own > (not to mention that iterative development is a thing). Yes, I'm not saying that. The reason it scares me is that it's not a particularly carefully reviewed patch, and the intended usage as described by Craig. We most definitely need timeline following, independent of failover slots. Most importantly for decoding on a standby. Greetings, Andres Freund
On Mon, Apr 4, 2016 at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I don't necessarily disagree with your statement that we'd need all of > this stuff to make logical replication a substitute for physical > replication as far as failover is concerned. But I don't think that's > the most important goal, and even to the extent that it is the goal, I > don't think we need to meet every need before we can consider > ourselves to have met some needs. I don't think that we need every > physical replication feature plus some before logical replication can > start to be useful to PostgreSQL users generally. We do, however, > need the functionality to be accessible to people who are using only > the PostgreSQL core distribution. The thing that is going to get > people excited about making logical replication better is getting to a > point where they can use it at all - and that is not going to be true > as long as you can't use it without having to download something from > an external website. I rather agree that an in-core system that solved some of the basic problems would be a huge step forward, and would motivate people to work on the harder problems. It's surprising to me that we don't seem to be much closer to that then we were when 9.4 was released. -- Peter Geoghegan
On 5 April 2016 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote:
In general, I think we'd be a lot better off if we got some kind of
logical replication into core first and then worked on lifting these
types of limitations afterwards.
First, I'd like to remind everyone that logical decoding is useful for more than replication. You can consume the change stream for audit logging/archival, to feed into a different DBMS, etc etc. This is not just about replicating from one PostgreSQL to another, though to be sure that'll be the main use in the near term.
The Zalando guys at least are already using it for other things, and interest in the json support suggests they're not alone.
Right now if you're doing any kind of logical deocoding from a master server that fails over to a standby the client just dies. The slot vanishes. You're stuffed. Gee, I hope you didn't need all that nice consistent ordering, because you're going to have to start from scratch and somehow reconcile the data in the new master with what you've already received ... and haven't.
We could certainly require clients to jump through all sorts of extra hoops to make sure they can follow replay over physical failover. Or we could say that you shouldn't actually expect to use logical decoding in real world environments where HA is a necessity until we get around to supporting realistic, usable logical-rep based failover in a few years. Or we could make it "just work" for the physical failover systems everyone already uses and relies on, just like sequences, indexes, and everything else in PostgreSQL that's expected to survive failover.
Would you tell someone to use unlogged tables and then do failover with Londiste? 'cos that's not too far from what's being talked about here. Though frankly, Londiste is more capable than the replication currently delivered with pglogical anyway, and it can follow physical failover too.
I don't understand why it seems to be considered OK for logical slots to just vanish on failover. The only other things I can think of where that's considered OK are unlogged tables (because that's the point and we have failover-safe ones too) and the old hash indexes nobody's quite willing to remove yet.
If I had to pick an order in which
to do the things you list, I'd focus first on the one you list second:
being able to stream and begin applying transactions before they've
committed is a really big deal for large transactions, and lots of
people have some large transactions.
I guess. We can manage DDL externally, however ugly that may be, but we can't do much about big xacts, and all but the very purest OLTP systems do some batch work sometimes.
It's still 9.7 material at very, very best, and things like parallel apply then stack on top of it.
DDL replication is nice, but
realistically, there are a lot of people who simply don't change their
schema all that often, and who could (and might even prefer to) manage
that process in other ways - e.g. change nodes one by one while they
are off-line, then bring them on-line.
Yeah, it's a bit more complex than that. Schema changes *must* be made at a specific point in replay. You can't generally just ALTER TABLE ... DROP COLUMN on the master then do the same thing on the replica. The replica probably still has un-replayed changes from the master that have the now-dropped column in their change stream, but now it can't apply them to the new table structure on the downstream. This particular case can be worked around, but column type changes, addition of non-null columns etc cannot.
You can only do DDL safely by either:
* Freezing all writes to replicated tables on the master and waiting until all logical slots are replayed up to date, then applying the DDL on each node; or
* Passing the desired DDL to a function that inserts it into a special replicated table on the master then executes it on the master. The replica monitors the replicated table for inserts and executes the same DDL on the replica as it replays the change stream. The insert serves as a barrier between old and new table structures.
pglogical supports either approach, but both have major foot-guns attached.
Also, a growing portion of the world uses generated schemas and migrations and can't easily pipe that through some arbitrary function we provide. They may not be too keen on stopping writes to their entire database as an alternative either (and we don't provide a true read-only mode for them to use if they did want to). I know it's fashionable around here to just write them off as idiots for using ORMs and so on, but they're rather widespread idiots who're just as likely to be interested in geographically distributed selective replication, change stream extraction, etc. This has been a persistent problem with people who want to use BDR, too.
There are workarounds available, though, and we can't fix everything at once. So despite the problems I agree that DDL replication is less crucial.
I don't think that we need every
physical replication feature plus some before logical replication can
start to be useful to PostgreSQL users generally.
That much I agree with, though I think our views on the set of features we do need will differ.
Sequences, for example. We don't support those at all right now. We can kind-of support OWNED BY sequences client-side, or use a "snapshot and apply with a fudge factor" approach like Londiste does, so it's not the end of the world. But that's just one of many.
OTOH, I'm sure we'll both agree that not replicating pg_largeobject isn't exactly something to shed tears over.
We do, however,
need the functionality to be accessible to people who are using only
the PostgreSQL core distribution. The thing that is going to get
people excited about making logical replication better is getting to a
point where they can use it at all - and that is not going to be true
as long as you can't use it without having to download something from
an external website.
I guess it's harder for me to see that because I've been working on it for so long.
Review and test responses have been pretty underwhelming for pglogical, and quite a bit seem to have boiled down to "this should live as an extension, we don't need it in core". It often feels like we can't win: if we seek to get it into core we're told it's not wanted/needed, but if we try to focus on solving issues in core to make it work better and let it live as an extension we're told we shouldn't bother until it's in core.
(I do specifically want to thank Andres Freund and Tomasz Rybak for detailed, constructive and helpful review on pglogical though).
Do you want to get a logical replication system into core that doesn't work properly with lots of the other features in PostgreSQL? That's historically not how we've done things here, and sometimes massive amounts of work have been required to make new feature X work with obscure/awkward existing feature Y. Inheritance comes strongly to mind as an exciting challenge for many new features to support properly. Arguments are usually raised that reference things like the mess created by MySQL's storage engines and how we're better off making everything work right once and for all.
OTOH, there's precedent there: inheritance still doesn't work with FKs or properly support UNIQUE constraints on parent relations, for example. And we're increasingly getting to the point where doing everything all at once is just unmanageable.
Still, I don't really want to block work on making logical decoding more real-world usable on inclusion of a logical replication system for PostgreSQL, especially one that'll be lucky to get in for 9.7 at the earliest.
On 5 April 2016 at 14:18, Peter Geoghegan <pg@heroku.com> wrote:
--
I rather agree that an in-core system that solved some of the basic
problems would be a huge step forward, and would motivate people to
work on the harder problems. It's surprising to me that we don't seem
to be much closer to that then we were when 9.4 was released.
I think we are, actually. For one thing a lot of groundwork has gone in, even if it doesn't seem especially highly visible.
Also, pglogical_output might've been able to get into 9.6 if we'd got it submittable a bit earlier, it'd had useful review in the first few weeks (months?), and I'd had more time to more responsively react to review comments by the time it did get review.
Even there, there are a bunch of places (like the relation cache/invalidations) that should really be moved "up" into logical decoding its self. Which while architecturally superior means yet more groundwork patches to write and get committed before a new revision of pglogical_output can go into 9.7.
I thought that getting pglogical_output into 9.6 might help us move closer to that goal, but folks seemed to think it was useless without the downstream client. I disagree, both on the principle of iterative development and because it's plenty useful for other things by its self, but didn't make any progress.
I don't think it helps that there's a real community split between people who want to see the actual logical rep engine live out-of-tree as an extension and those who think it's not interesting until it's in core.
On 2016-04-05 15:51:00 +0800, Craig Ringer wrote: > Review and test responses have been pretty underwhelming for pglogical, and > quite a bit seem to have boiled down to "this should live as an extension, > we don't need it in core". It often feels like we can't win: if we seek to > get it into core we're told it's not wanted/needed, but if we try to focus > on solving issues in core to make it work better and let it live as an > extension we're told we shouldn't bother until it's in core. I think partially that's because it's hard to see the goal from those threads. Leading the intro email with "after applying use these three steps to replicate a database" or something might help. I also want to add that so far, to my knowledge, the feedback hasn't fully been addressed. It's a bit hard to see progress at that pace. > Do you want to get a logical replication system into core that doesn't work > properly with lots of the other features in PostgreSQL? That's historically > not how we've done things here, and sometimes massive amounts of work have > been required to make new feature X work with obscure/awkward existing > feature Y. I think that's a strawman. We have done actual iterative development where the basic feature came at an early stage a lot of times. Most impressively FDWs. And even if we decide that feature X has to be supported, having an otherwise close-to-committable patch series, goes a *LONG* way to motivate people. > Still, I don't really want to block work on making logical decoding more > real-world usable on inclusion of a logical replication system for > PostgreSQL, especially one that'll be lucky to get in for 9.7 at the > earliest. My impression is that unless you *NOW* press very hard to get it into core, there's no way to get it into 9.7. Unless you start aggressively at some point, it'll never get in. Greetings, Andres Freund
On 05 Apr 2016, at 09:51, Craig Ringer <craig@2ndquadrant.com> wrote:On 5 April 2016 at 04:00, Robert Haas <robertmhaas@gmail.com> wrote:In general, I think we'd be a lot better off if we got some kind of
logical replication into core first and then worked on lifting these
types of limitations afterwards.First, I'd like to remind everyone that logical decoding is useful for more than replication. You can consume the change stream for audit logging/archival, to feed into a different DBMS, etc etc. This is not just about replicating from one PostgreSQL to another, though to be sure that'll be the main use in the near term.The Zalando guys at least are already using it for other things, and interest in the json support suggests they're not alone.
We are actually interested in both the streaming part and the logical replication provided at the moment by BDR. The reason we cannot use BDR efficiently is that there is no way to provide a HA for one of the BDR nodes using physical replication, meaning that we have to run 2x nodes with a requirement of each node communicating to each other. Since the only use case we have is to run things in multiple data-centers with latencies of 100ms and above, running without the physical HA limits us to only couple of nodes, with a manual repair mechanism.
The other use case we have is streaming data from many, sometimes rather big (TBs) databases to the consumers interested in storing subsets of that data in order to run analytical queries on them. It’s hard to imagine a robust system like this that is built around the feature that doesn't support a failover between the master and a physical replica, forcing to stream again a set of selected tables at best, and the complete database at most (did I mention the terabytes already?), and, potentially, doing some very funny tricks to merge the newly streamed data with something that is already there.
Right now if you're doing any kind of logical deocoding from a master server that fails over to a standby the client just dies. The slot vanishes. You're stuffed. Gee, I hope you didn't need all that nice consistent ordering, because you're going to have to start from scratch and somehow reconcile the data in the new master with what you've already received ... and haven’t.
+1
We could certainly require clients to jump through all sorts of extra hoops to make sure they can follow replay over physical failover. Or we could say that you shouldn't actually expect to use logical decoding in real world environments where HA is a necessity until we get around to supporting realistic, usable logical-rep based failover in a few years.
And faster than in a few years your typical big organization might decide to move on to some other data storage solution, promising HA right now, at the expense of using SQL and strong consistency and transactions. It would be a bad choice, but the one that developers (especially those looking to build “scalable micro services” with only couple of CRUD queries) will be willing to make.
Or we could make it "just work" for the physical failover systems everyone already uses and relies on, just like sequences, indexes, and everything else in PostgreSQL that's expected to survive failover.
--
Oleksii
On 5 April 2016 at 14:09, Andres Freund <andres@anarazel.de> wrote:
-- On 2016-04-05 05:53:53 +0200, Petr Jelinek wrote:
> On 04/04/16 17:15, Andres Freund wrote:
> >
> >>* Robust sequence decoding and replication. If you were following the later
> >>parts of that discussion you will've seen how fun that's going to be, but
> >>it's the simplest of all of the problems.
> >
> >Unconvinced. People used londiste and slony for years without that, and
> >it's not even remotely at the top of the list of problems with either.
> >
>
> Londiste and Slony also support physical failover unlike logical decoding
> which is the main point of this discussion, lets not forget that.
Sure. But that level of failover isn't all that hard to implement.
Well, they get it mostly for free. Because they work at the SQL level and we have working physical failover, so they don't really notice the change.
Just like I've been trying to make possible for logical decoding and logical replication.
> If we got failover slots into 9.6 it would
> be better but that does not look realistic at this point. I don't think that
> current design for failover slots is best possible - I think failover slots
> should be created on replica and send their status up to the master which
> would then take them into account when calculating oldest needed catalog
> xmin and lsn (simple way of doing that would be to add this to feedback
> protocol and let physical slot to keep the xmin/lsn as well)
Yes, that's not too far away from what I'm thinking of. If we do it
right that also solves the important problems for decoding on a standby.
I'm pretty happy with this approach too.
It puts a bit more burden on the client, but I think that can be solved separately and later with a helper running on the replica.
I've never liked how failover slots can't work with a cascading replica when we get support for that. By contrast, this approach would actually help solve one of the problems needed to get replay from a replica working.
It's a pity it's a no-hoper for 9.6 though.
On Tue, Apr 5, 2016 at 3:51 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > First, I'd like to remind everyone that logical decoding is useful for more > than replication. You can consume the change stream for audit > logging/archival, to feed into a different DBMS, etc etc. This is not just > about replicating from one PostgreSQL to another, though to be sure that'll > be the main use in the near term. > > The Zalando guys at least are already using it for other things, and > interest in the json support suggests they're not alone. I have not forgotten any of that, nor do I consider it unimportant. > Right now if you're doing any kind of logical deocoding from a master server > that fails over to a standby the client just dies. The slot vanishes. You're > stuffed. Gee, I hope you didn't need all that nice consistent ordering, > because you're going to have to start from scratch and somehow reconcile the > data in the new master with what you've already received ... and haven't. Right, but right now you probably *aren't* do doing any kind of logical decoding from a master server to a standby, because there's squat in the core distribution that could make use of that capability. So you never get as far as discovering this problem. > I don't understand why it seems to be considered OK for logical slots to > just vanish on failover. The only other things I can think of where that's > considered OK are unlogged tables (because that's the point and we have > failover-safe ones too) and the old hash indexes nobody's quite willing to > remove yet. First, it wasn't until 9.3 that physical standbys could follow timeline switches, but that doesn't mean that streaming replication was useless in 9.0 - 9.2, or that warm standby was useless in earlier versions. Logical decoding isn't useless without that capability either. Would it be nice if we did have that capability? Of course. Second, I'm not sure whether it was a good design decision to make logical slots a special kind of object that sit off to the side, neither configuration (like postgresql.conf) nor WAL-protected data (like pg_clog and the data files themselves), but it was certainly a very deliberate decision. I sort of expected them to be WAL-logged, but Andres argued (not unconvincingly) that we'd want to have slots on standbys, and making them WAL-logged would preclude that. So I don't really think that this is much like hash indexes, which just never got properly finished. It's more like unlogged tables, where a deliberate design decision to lose data was made in order to meet some other goal. >> realistically, there are a lot of people who simply don't change their >> schema all that often, and who could (and might even prefer to) manage >> that process in other ways - e.g. change nodes one by one while they >> are off-line, then bring them on-line. > > Yeah, it's a bit more complex than that. Schema changes *must* be made at a > specific point in replay. You can't generally just ALTER TABLE ... DROP > COLUMN on the master then do the same thing on the replica. The replica > probably still has un-replayed changes from the master that have the > now-dropped column in their change stream, but now it can't apply them to > the new table structure on the downstream. This particular case can be > worked around, but column type changes, addition of non-null columns etc > cannot. There are certainly problem cases, but I'm not sure they really arise that much in practice. If you retype a column from text to integer, you probably aren't storing anything in it other than integers, in which case it is not necessarily the case that you are locked into applying that change at a particular point in the change stream. If you are storing non-integers in a text column and relying on a USING clause to make them look like integers during the conversion, then, yes, that has to be done at a precise point in the change stream. But that's a pretty strange thing to do, and your application is most likely going to get confused anyway, so you are probably taking a maintenance window for the changeover anyway - in which case, there's not really a big problem. You can run the same change at the same time on both servers while nothing else is happening. > Also, a growing portion of the world uses generated schemas and migrations > and can't easily pipe that through some arbitrary function we provide. They > may not be too keen on stopping writes to their entire database as an > alternative either (and we don't provide a true read-only mode for them to > use if they did want to). I know it's fashionable around here to just write > them off as idiots for using ORMs and so on, but they're rather widespread > idiots who're just as likely to be interested in geographically distributed > selective replication, change stream extraction, etc. This has been a > persistent problem with people who want to use BDR, too. It's not my intent to write anyone off as an idiot. > Review and test responses have been pretty underwhelming for pglogical, and > quite a bit seem to have boiled down to "this should live as an extension, > we don't need it in core". It often feels like we can't win: if we seek to > get it into core we're told it's not wanted/needed, but if we try to focus > on solving issues in core to make it work better and let it live as an > extension we're told we shouldn't bother until it's in core. To be honest, I was shocked that pglogical and pglogical_output didn't go into this release. I assumed that you and other folks at 2ndQuadrant were going to make a big push to get that done. I did take a brief look at one of them - pglogical, I think - a week or two ago but there were unaddressed review comments that had been pending for months and there were a lot of fairly obvious things that needed to be done before it could be seriously considered as a core submission. Like, for example, rewriting the documentation heavily and making it look like the rest of our docs, and putting it in SGML format. The code seemed to need quite a bit of cleanup, too. Now, logical replication is a sufficiently important feature that if the only way it's going to get into core is if I work on it myself, or get other people at EnterpriseDB to do so, then I'll try to make that happen. But I was assuming that that was your/2ndQuadrant's patch, that you were going to get it in shape, and that me poking my nose into it wasn't going to be particularly welcome. Maybe I've misread the whole dynamic here. > Do you want to get a logical replication system into core that doesn't work > properly with lots of the other features in PostgreSQL? That's historically > not how we've done things here, and sometimes massive amounts of work have > been required to make new feature X work with obscure/awkward existing > feature Y. Inheritance comes strongly to mind as an exciting challenge for > many new features to support properly. Arguments are usually raised that > reference things like the mess created by MySQL's storage engines and how > we're better off making everything work right once and for all. So, let me be clear, here. You have every right to decide which feature you want to work on, and if, at the current time, failover slots is that thing and if, at the current time, in-core logical replication solution is not that thing, then that is entirely up to you. I don't have any right to tell you what to work on, and I'm not trying to tell you what to work on. I'm giving you my opinion on what I would work on if I were going to do some work related to logical replication - nothing more. That being said, if we get a logical replication system into core that doesn't do DDL, doesn't do multi-master, doesn't know squat about sequences, and rolls over and dies if a timeline switch happens, I would consider that a huge step forward and I think a lot of other people would, too. We have a long history of building features incrementally. Parallel query in 9.6 doesn't parallelize every query that can be parallelized, postgres_fdw has atrocious performance on simple queries like SELECT count(*) FROM ft, and autovacuum existed for a long time before it was turned on by default. Some people fail to get patches committed because they set their expectations too low, and we come back and say "well, that's nice, but we really need a little more here in order to consider this feature complete". But plenty of people also make the opposite mistake, of thinking that they have to fix everything at once in order to have anything worthwhile, and that's just as much of a trap as the other thing. Even with lots of limitations, built-in logical replication could still be good enough to be used long enough to manage a major version upgrade - people who have this problem today are using Slony, and I believe that even a really basic version of logical rep could have significant advantages over Slony in terms of both performance and ease of configuration. > Still, I don't really want to block work on making logical decoding more > real-world usable on inclusion of a logical replication system for > PostgreSQL, especially one that'll be lucky to get in for 9.7 at the > earliest. Well, as noted above, I think the main thing we need to figure out is who is going to do the work. The main thing blocking other people from working on it is the belief that we are just waiting for you or someone else at 2ndQuadrant to land the necessary patches, but if you aren't really working on that in a focused way, then somebody else can step up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 April 2016 at 09:29, Robert Haas <robertmhaas@gmail.com> wrote:
> Right now if you're doing any kind of logical deocoding from a master server
> that fails over to a standby the client just dies. The slot vanishes. You're
> stuffed. Gee, I hope you didn't need all that nice consistent ordering,
> because you're going to have to start from scratch and somehow reconcile the
> data in the new master with what you've already received ... and haven't.
Right, but right now you probably *aren't* do doing any kind of
logical decoding from a master server to a standby, because there's
squat in the core distribution that could make use of that capability.
So you never get as far as discovering this problem.
OK, I'm starting to understand where this is going.
pglogical and pglogical_output are irrelevant so long as they haven't yet landed in core. There is no logical replication worth considering, therefore anything that improves logical replication is not worth the effort because there's really nothing to improve. Why bother with enhancements like timeline following when nothing uses it.
Right?
That's pretty much what I'm hearing from a number of others too. Obviously I don't wholly agree, but I do see where you're coming from. Especially since pglogical is in a sort of twilight zone where it's neither a public, open project with a proper website, public SCM, etc that encourages external contribution, nor is it strictly and only a submission to core.
It's not like, say, PostGIS, in that it's always been aimed at core (or stated to be) and is something that can be realistically considered for core. It doesn't have much of a life as an independent extension, and at least from the "outside" it doesn't look like there's much energy going into giving it one, right? Yet it's also had limited time and effort put into getting it in core in this release cycle.
While I don't like that, I have to say I understand it.
First, it wasn't until 9.3 that physical standbys could follow
timeline switches, but that doesn't mean that streaming replication
was useless in 9.0 - 9.2
I don't think there's much of a parallel there. It's closer to being like having _cascading_ replication without the ability to follow timeline switches via WAL archive replay or streaming.
Second, I'm not sure whether it was a good design decision to make
logical slots a special kind of object that sit off to the side,
neither configuration (like postgresql.conf) nor WAL-protected data
(like pg_clog and the data files themselves), but it was certainly a
very deliberate decision. I sort of expected them to be WAL-logged,
but Andres argued (not unconvincingly) that we'd want to have slots on
standbys, and making them WAL-logged would preclude that.
Yeah. I understand the reasons for that decision. Per an earlier reply I think we can avoid making them WAL-logged so they can be used on standbys and still achieve usable failover support on physical replicas.
It's more like unlogged tables, where a deliberate
design decision to lose data was made in order to meet some other
goal.
Yeah ... but it's like unlogged tables without having the option of logged tables.
Nice while it lasts but you're in bad trouble when it goes wrong.
(The parallel isn't perfect of course, since unlogged tables aren't crash safe wheras slots are, they just aren't replicated).
If you retype a column from text to integer,
you probably aren't storing anything in it other than integers, in
which case it is not necessarily the case that you are locked into
applying that change at a particular point in the change stream.
... if you're replicating in text format, not send/recv format, yeah. You get away with it in that particular case. But that *only* works where you're using text format and the types are convertable via text representation.
It's also just one example. Adding a NOT NULL column bites you, hard, for example.
Asking users to do this themselves is incredibly fragile and requires much more knowledge of PostgreSQL innards than I think is realistic. It's part of why Slony-I is difficult too.
Sure, you can avoid having DDL replication if you're really, really careful and know exactly what you're doing. I don't think it's a pre-req for getting logical replication into core, but I do think it's a pretty strong requirement for taking the use of logical replication for HA and failover seriously for general use.
To be honest, I was shocked that pglogical and pglogical_output didn't
go into this release. I assumed that you and other folks at
2ndQuadrant were going to make a big push to get that done. I did
take a brief look at one of them - pglogical, I think - a week or two
ago but there were unaddressed review comments that had been pending
for months and there were a lot of fairly obvious things that needed
to be done before it could be seriously considered as a core
submission. Like, for example, rewriting the documentation heavily
and making it look like the rest of our docs, and putting it in SGML
format. The code seemed to need quite a bit of cleanup, too.
Yes, I agree, and I think those comments are largely fair. Both Petr and I had more limited time to work on them than either of us had expected, so timely follow-up was a challenge. We also got them ready for initial submission later in the cycle than we'd hoped. That said, there was also very little review response or interest for a long time after they were published... and by the time there was I think both Petr and I were bogged a bit in other project work.
In retrospect I think the separate submission of the output plugin was a mistake too, at least when called pglogical_output; in part because of the name it was completely ignored as meaningless without the downstream and written off as useful for nothing except logical replication. So no momentum was built by the time the downstream was published.
One thing that'd help a lot from my PoV would be if PGXS supported building SGML docs. It's a major pain that maintaining a version as an extension (so people can use it for older versions) requires a different documentation format. But all in all that's relatively minor.
Now,
logical replication is a sufficiently important feature that if the
only way it's going to get into core is if I work on it myself, or get
other people at EnterpriseDB to do so, then I'll try to make that
happen. But I was assuming that that was your/2ndQuadrant's patch,
that you were going to get it in shape, and that me poking my nose
into it wasn't going to be particularly welcome. Maybe I've misread
the whole dynamic here.
I think getting more people involved is useful in a number of ways. Not just time and resources, but also getting more agreement on the direction things should be going in and making sure that more people have a direct hand in the core and are happy with it. Working from a common repo where interested people can push directly would be an interesting start.
I'll need to discuss this inside 2ndQ and follow up as there are things involved that I can't make determinations for myself.
> Do you want to get a logical replication system into core that doesn't work
> properly with lots of the other features in PostgreSQL? That's historically
> not how we've done things here, and sometimes massive amounts of work have
> been required to make new feature X work with obscure/awkward existing
> feature Y. Inheritance comes strongly to mind as an exciting challenge for
> many new features to support properly. Arguments are usually raised that
> reference things like the mess created by MySQL's storage engines and how
> we're better off making everything work right once and for all.
So, let me be clear, here. You have every right to decide which
feature you want to work on, and if, at the current time, failover
slots is that thing and if, at the current time, in-core logical
replication solution is not that thing, then that is entirely up to
you.
It's not entirely up to me personally, but I take your point.
My own priorities _have_ been influenced by how many people told me not to bother with getting logical rep in core since it can just live as an extension, though. I thought it reasonable to focus on things that can't be done as an extension since pglogical looked like it had little hope of landing in 9.6. Now I'm being told by others that it may as well not exist until it's in core. That's ... a tad frustrating. But that's also part of working in a community of individuals with different views and opinions.
Of course, I'm also biased to see pglogical as a going real-world concern which is relevant whether in or out of tree, so to me it seems obviously that improvements in core to make logical rep using it work better are worthwhile. If the general viewpoint of the wider community is "not in core, doesn't matter" then others will see it very differently.
FWIW, I'm also constantly keeping in mind that while I really, really want pglogical in core, there'll certainly be a backport as an extension for 9.4/9.5 and now 9.6. Given the massive lead times involved in getting anything into user hands I've really been hoping to get some of those foundations into 9.6 so that the backport from 9.7+ works better on 9.6. I suspect this either wasn't obvious or that others here don't see that as interesting.
That being said, if we get a logical replication system into core that
doesn't do DDL, doesn't do multi-master, doesn't know squat about
sequences, and rolls over and dies if a timeline switch happens, I
would consider that a huge step forward and I think a lot of other
people would, too.
I can understand that.
I think it's pretty wrong for it to die on timeline switch. I also think we'd really need *something* in place for sequences (even Londiste-style sync-with-fudge-factor), but the rest is tolerable as far as I'm concerned. I absolutely agree that multi-master isn't a requirement, and I think a "queue some DDL for me" function is ugly but an acceptable workaround for now. That's part of why pglogical's boundaries were set where they were when separating it from BDR.
We have a long history of building features
incrementally. Parallel query in 9.6 doesn't parallelize every query
that can be parallelized
Right. But it doesn't just arbitrarily break on what it doesn't understand. That's the difference here.
But
plenty of people also make the opposite mistake, of thinking that they
have to fix everything at once in order to have anything worthwhile,
and that's just as much of a trap as the other thing.
Yeah, that's fair. Personally I think I pushed to cut pglogical to the bone in terms of the minimum viable feature set that was targeted for it, though. It goes very, very far from solving everything, and intentionally ignores hard problems like DDL.
Even with lots
of limitations, built-in logical replication could still be good
enough to be used long enough to manage a major version upgrade
... with some form of sequence handling added, yes. Not necessarily nice sequence decoding, it can be worked around.
Well, as noted above, I think the main thing we need to figure out is
who is going to do the work. The main thing blocking other people
from working on it is the belief that we are just waiting for you or
someone else at 2ndQuadrant to land the necessary patches, but if you
aren't really working on that in a focused way, then somebody else can
step up.
It's a big job, and there are lots of differing viewpoints on it.
It does seem like having 2ndQ submit a patch, wait for feedback, be told "oh, don't do it that way, come back with <something else>", then wait for us to respond and address the issues isn't working. Certainly isn't going to get us there in a timely manner.
I want to explore some options here, but I have to talk to internal team members. Sorry for the handwavyness of that.
On 06/04/16 03:29, Robert Haas wrote: > >> I don't understand why it seems to be considered OK for logical slots to >> just vanish on failover. The only other things I can think of where that's >> considered OK are unlogged tables (because that's the point and we have >> failover-safe ones too) and the old hash indexes nobody's quite willing to >> remove yet. > > First, it wasn't until 9.3 that physical standbys could follow > timeline switches, but that doesn't mean that streaming replication > was useless in 9.0 - 9.2, or that warm standby was useless in earlier > versions. Logical decoding isn't useless without that capability > either. Would it be nice if we did have that capability? Of course. > Except that in 9.0 - 9.2 there was working workaround for that, there is no such thing for logical decoding. > Second, I'm not sure whether it was a good design decision to make > logical slots a special kind of object that sit off to the side, > neither configuration (like postgresql.conf) nor WAL-protected data > (like pg_clog and the data files themselves), but it was certainly a > very deliberate decision. I sort of expected them to be WAL-logged, > but Andres argued (not unconvincingly) that we'd want to have slots on > standbys, and making them WAL-logged would preclude that. I do think it was good design decision. We just need to make them failoverable bit differently and the failover slots patch IMHO isn't the right way either as I said in another reply in this thread. > >> Review and test responses have been pretty underwhelming for pglogical, and >> quite a bit seem to have boiled down to "this should live as an extension, >> we don't need it in core". It often feels like we can't win: if we seek to >> get it into core we're told it's not wanted/needed, but if we try to focus >> on solving issues in core to make it work better and let it live as an >> extension we're told we shouldn't bother until it's in core. > > To be honest, I was shocked that pglogical and pglogical_output didn't > go into this release. I assumed that you and other folks at > 2ndQuadrant were going to make a big push to get that done. I did > take a brief look at one of them - pglogical, I think - a week or two > ago but there were unaddressed review comments that had been pending > for months and there were a lot of fairly obvious things that needed > to be done before it could be seriously considered as a core > submission. Like, for example, rewriting the documentation heavily > and making it look like the rest of our docs, and putting it in SGML > format. The code seemed to need quite a bit of cleanup, too. Now, > logical replication is a sufficiently important feature that if the > only way it's going to get into core is if I work on it myself, or get > other people at EnterpriseDB to do so, then I'll try to make that > happen. But I was assuming that that was your/2ndQuadrant's patch, > that you were going to get it in shape, and that me poking my nose > into it wasn't going to be particularly welcome. Maybe I've misread > the whole dynamic here. I guess you did, me and I think Craig as well hoped for some feedback on the general ideas in terms of protocol, node setup (I mean catalogs) and general architecture from the community. That didn't really happen. And without any of that happening I didn't feel confident trying to get it right within last month of dev cycle. Especially given the size of the patch and the fact we also had other patches that we worked on and had realistically higher chance of getting in. Not sure how Craig feels about it. Converting documentation, renaming some params in function names etc (those unaddressed comments) seemed like secondary to me. (As a side note I was also 2 weeks without proper working laptop around FOSDEM time which had effect on my responses to -hackers about the topic, especially to Steve Singer who did good job of reviewing the usability at the time, but even if I had it it would not saved the patch) In general I think project of this size requires more attention from committer to help shepherding it and neither Craig or me are that. I am glad that Andres said he plans to give some time in next cycle to logical replication because that should be big help. > > That being said, if we get a logical replication system into core that > doesn't do DDL, doesn't do multi-master, doesn't know squat about > sequences, and rolls over and dies if a timeline switch happens, I > would consider that a huge step forward and I think a lot of other > people would, too. I agree with the exception of working HA. I would consider it very sad if we got logical replication in core without having any provision for continuity of service. Doing that is relatively trivial in comparison to the logical replication itself however. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I tried to address Andres's entirely valid complaints about that comment I added in this patch.
I was going to add a description to restart_lsn in slot.h instead, but it's proving harder to accurately describe restart_lsn than I'd like.
It's a point at which no transactions that commit after confirmed_lsn are yet in progress. Easy enough.
But it's also a point at which visibility information can be reconstructed. It's updated lazily by LogicalIncreaseRestartDecodingForSlot when the client sends confirmation. Looking at SnapBuildProcessRunningXacts that can be the most recent serialized snapshot from the snapshot builder (if there's no xact running at confirmed_lsn) or the restart_decoding_lsn of the oldest xact in progress at the confirmed_lsn. If I've read it correctly the xact's restart_decoding_lsn is the ReorderBuffer's restart_decoding_lsn at the time the xact started, which is in turn the last serialized snapshot at that time.
So I'd describe restart_lsn as something like:
"
The oldest LSN that might be required by this replication slot.
Points to the LSN of the most recent xl_running_xacts record where no transaction that commits after confirmed_flush is already in progress. At this point all xacts committing after confirmed_flush can be read entirely into reorder buffers and all visibility information can be reconstructed.
"
Is that accurate?
I want to get rid of the incorrect comment and fix this up.
Draft patch for comment fix.
Attachment
On Tue, Apr 5, 2016 at 10:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Right, but right now you probably *aren't* do doing any kind of >> logical decoding from a master server to a standby, because there's >> squat in the core distribution that could make use of that capability. >> So you never get as far as discovering this problem. > > OK, I'm starting to understand where this is going. > > pglogical and pglogical_output are irrelevant so long as they haven't yet > landed in core. There is no logical replication worth considering, therefore > anything that improves logical replication is not worth the effort because > there's really nothing to improve. Why bother with enhancements like > timeline following when nothing uses it. > > Right? "Irrelevant" is a strong word, and I wouldn't personally go that far. But they would be a lot *more* relevant if they landed in core, at least IMHO. >> Second, I'm not sure whether it was a good design decision to make >> logical slots a special kind of object that sit off to the side, >> neither configuration (like postgresql.conf) nor WAL-protected data >> (like pg_clog and the data files themselves), but it was certainly a >> very deliberate decision. I sort of expected them to be WAL-logged, >> but Andres argued (not unconvincingly) that we'd want to have slots on >> standbys, and making them WAL-logged would preclude that. > > Yeah. I understand the reasons for that decision. Per an earlier reply I > think we can avoid making them WAL-logged so they can be used on standbys > and still achieve usable failover support on physical replicas. I think at one point we may have discussed doing this via additional side-channel protocol messages. Is that what you are thinking about now, or something else? >> If you retype a column from text to integer, >> you probably aren't storing anything in it other than integers, in >> which case it is not necessarily the case that you are locked into >> applying that change at a particular point in the change stream. > > ... if you're replicating in text format, not send/recv format, yeah. You > get away with it in that particular case. But that *only* works where you're > using text format and the types are convertable via text representation. > > It's also just one example. Adding a NOT NULL column bites you, hard, for > example. > > Asking users to do this themselves is incredibly fragile and requires much > more knowledge of PostgreSQL innards than I think is realistic. It's part of > why Slony-I is difficult too. > > Sure, you can avoid having DDL replication if you're really, really careful > and know exactly what you're doing. I don't think it's a pre-req for getting > logical replication into core, but I do think it's a pretty strong > requirement for taking the use of logical replication for HA and failover > seriously for general use. Well, we can agree to disagree on this. I don't think that it's all that difficult to figure out how to change your schema in a step-by-step way that allows logical replication to keep working while the nodes are out of sync, but you don't have to agree and that's fine. I'm not objecting to eventually adding that feature to core. I do think it's a bad idea to be polishing that sort of thing before getting some more basic facility into core. >> To be honest, I was shocked that pglogical and pglogical_output didn't >> go into this release. I assumed that you and other folks at >> 2ndQuadrant were going to make a big push to get that done. I did >> take a brief look at one of them - pglogical, I think - a week or two >> ago but there were unaddressed review comments that had been pending >> for months and there were a lot of fairly obvious things that needed >> to be done before it could be seriously considered as a core >> submission. Like, for example, rewriting the documentation heavily >> and making it look like the rest of our docs, and putting it in SGML >> format. The code seemed to need quite a bit of cleanup, too. > > Yes, I agree, and I think those comments are largely fair. Both Petr and I > had more limited time to work on them than either of us had expected, so > timely follow-up was a challenge. We also got them ready for initial > submission later in the cycle than we'd hoped. That said, there was also > very little review response or interest for a long time after they were > published... and by the time there was I think both Petr and I were bogged a > bit in other project work. > > In retrospect I think the separate submission of the output plugin was a > mistake too, at least when called pglogical_output; in part because of the > name it was completely ignored as meaningless without the downstream and > written off as useful for nothing except logical replication. So no momentum > was built by the time the downstream was published. While I acknowledge that a logical output plugin has applications beyond replication, I think replication is the preeminent one by a considerable margin. Some people want it for other purposes, and we should facilitate that. But that number of people is dwarfed by the number who would use a seamless logical replication facility if we had one available. And so that's the thing I think we should focus on making work. If I were doing that, I think I would attack it from a considerably different direction than what has so far been proposed. I would try to put the stuff in core, not contrib, and I would arrange to control it using DDL, not function calls. For version one, I would cut all of the stuff that allows data to be sent in any format other than text, and just use in/outfuncs all the time. > I think getting more people involved is useful in a number of ways. Not just > time and resources, but also getting more agreement on the direction things > should be going in and making sure that more people have a direct hand in > the core and are happy with it. Working from a common repo where interested > people can push directly would be an interesting start. > > I'll need to discuss this inside 2ndQ and follow up as there are things > involved that I can't make determinations for myself. Understood. >> We have a long history of building features >> incrementally. Parallel query in 9.6 doesn't parallelize every query >> that can be parallelized > > Right. But it doesn't just arbitrarily break on what it doesn't understand. > That's the difference here. I do generally think that logical decoding relies too much on trying to set up situations where it will never fail, and I've said from the beginning that it should make more provision to cope with failure rather than just trying to avoid it. If logical decoding never breaks, great. But the approach I would favor is to set things up so that it automatically reclones if there is a replication break, and then as an optimization project, try to eliminate those cases one by one. > It's a big job, and there are lots of differing viewpoints on it. > > It does seem like having 2ndQ submit a patch, wait for feedback, be told > "oh, don't do it that way, come back with <something else>", then wait for > us to respond and address the issues isn't working. Certainly isn't going to > get us there in a timely manner. That isn't really my perception of how things have gone, but I admit I may not have been following it as closely as I should have done. I'd be happy to talk with you about this in person if you're going to be at PGCONF.US or PGCon in Ottawa; or there's always Skype. I don't see any reason why we can't all put our heads together and agree on a direction that we can all live with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/04/16 17:32, Robert Haas wrote: >>> Second, I'm not sure whether it was a good design decision to make >>> logical slots a special kind of object that sit off to the side, >>> neither configuration (like postgresql.conf) nor WAL-protected data >>> (like pg_clog and the data files themselves), but it was certainly a >>> very deliberate decision. I sort of expected them to be WAL-logged, >>> but Andres argued (not unconvincingly) that we'd want to have slots on >>> standbys, and making them WAL-logged would preclude that. >> >> Yeah. I understand the reasons for that decision. Per an earlier reply I >> think we can avoid making them WAL-logged so they can be used on standbys >> and still achieve usable failover support on physical replicas. > > I think at one point we may have discussed doing this via additional > side-channel protocol messages. Is that what you are thinking about > now, or something else? > I think the most promising idea was to use pull model instead of push model for slot updates and using feedback to tell master how far the oldest slot on standby is. This has the additional advantage of solving the biggest problem with decoding on standby (keeping old enough catalog xid and lsn). And also solves your concern of propagating through whole cascade as this can be done in more controllable way (from bottom up in the replication chain). > For version one, I would cut all of > the stuff that allows data to be sent in any format other than text, > and just use in/outfuncs all the time. Agreed here, I think doing the binary transfer for base types, etc is just optimization, not necessity. This is one of the reasons why I wanted to get bit broader feedback on the protocol - to get some kind of consensus about what we want now, what we might want in the future and how to handle getting from now to future without too much complexity or breakage. For example currently we have flags for every protocol message which I am not sure are completely necessary there, OTOH we probably don't want to bump protocol version with every new version of Postgres either. > > I do generally think that logical decoding relies too much on trying > to set up situations where it will never fail, and I've said from the > beginning that it should make more provision to cope with failure > rather than just trying to avoid it. If logical decoding never > breaks, great. But the approach I would favor is to set things up so > that it automatically reclones if there is a replication break, and > then as an optimization project, try to eliminate those cases one by > one. > Well that really depends. I've seen way too many cases where people use logical replication as transport mechanism, rather than replication where the destination is same as source, and in those scenarios there is often no way to "reclone" either because the historical data are no longer on the source or because the data on the target were already updated after they've been replicated. But in general the idea of recovering from error rather than being hell bent on preventing it is something I am pushing as well. For example it should be easier to look at what's in replication queue and remove things from there if needed. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 April 2016 at 23:32, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah. I understand the reasons for that decision. Per an earlier reply I
> think we can avoid making them WAL-logged so they can be used on standbys
> and still achieve usable failover support on physical replicas.
I think at one point we may have discussed doing this via additional
side-channel protocol messages. Is that what you are thinking about
now, or something else?
Essentially, yes.
The way I'd like to do it in 9.6+1 is:
- Require that the replica(s) use streaming replication with a replication slot to connect to the master
- Extend the feedback protocol to allow the replica to push its required catalog_xmin up to the master so it doesn't vacuum away catalog tuples still needed by a replica. (There's no need for it to push the restart_lsn, and its fine for the master to throw away WAL still needed by a replica).
- Track the replica's catalog_xmin on the replica's slot. So it'll be a slot used for physical replication that also has a catalog_xmin set.
- Allow applications to create their own slots on read-replicas (for apps that want to decode from a standby).
- For transparent failover sync slot state from master to replica via writes by a helper to a table on the master that get applied by a helper on the standby.
Allowing apps to create slots on a replica can be used by aware apps to do failover but only if they know about and can connect to all the failover candidate replica(s), and they have to maintain and advance a slot on each. Potentially fragile. So this is mostly good for supporting decoding from a standby, rather than failover.
I really want easy, practical failover that doesn't require every app to re-implement logic to keep track of replicas its self, etc. For that I'd use have a bgworker that runs on the replica make a direct libpq connection to the master and snapshot the state of its slots plus its xlog insert position. The worker would wait until the replica's replay reaches/passes that xlog position before applying the slot state copied from the master to the replica. (Adding a "GET SLOT STATE" or whatever to the walsender interface would make this less ugly). This basically emulates what failover slots did, but lazily: with no hook to capture slot state save we have to poll the master. With no ability to insert the changes into WAL and run a custom redo function on the replica we have to manually ensure they're applied at the right time. Unlike with failover slots it's possible for the slot on the replica to be behind where it was on the master at the same LSN - but that's OK because we're protecting against catalog vacuum, per above.
(I'd really like to have a slot flag that lets us disallow replay from such copied slots on the replica, marking them as usable only if not in recovery.)
The only part of this that isn't possible on 9.6 is having the replica push the catalog_xmin up to the master over feedback. But we can emulate that with a bgworker on the replica that maintains an additional dummy logical slot on the master. It replays the dummy slot to the lowest confirmed_lsn of any slot on the replica. Somewhat inefficient, but will work.
If it sounds like a bit of a pile of hacks, that's because the failover support part is. But unlike failover slots it will bring us closer to being able to do logical decoding from a replica, which is nice. It can be made a lot less ugly if the help of the walsender can be enlisted to report the master's slot state, so we don't have to use normal libpq. (The reason I wouldn't use a bgworker on the master to write it to a table then another worker to apply changes from that table on the replica is mainly that then we can't have failover support for ascading replicas, which can't write WAL).
Well, we can agree to disagree on this. I don't think that it's all
that difficult to figure out how to change your schema in a
step-by-step way that allows logical replication to keep working while
the nodes are out of sync, but you don't have to agree and that's
fine. I'm not objecting to eventually adding that feature to core. I
do think it's a bad idea to be polishing that sort of thing before
getting some more basic facility into core.
That much I agree on - I certainly don't want to block this on DDL replication.
While I acknowledge that a logical output plugin has applications
beyond replication, I think replication is the preeminent one by a
considerable margin. Some people want it for other purposes, and we
should facilitate that. But that number of people is dwarfed by the
number who would use a seamless logical replication facility if we had
one available. And so that's the thing I think we should focus on
making work.
Agreed.
Really I think we'll want a separate json output plugin for most of those other uses anyway. Though some of the facilities in pglogical_output will need to be extracted, added to into logical decoding its self, and shared.
If I were doing that, I think I would attack it from a considerably
different direction than what has so far been proposed. I would try
to put the stuff in core, not contrib, and I would arrange to control
it using DDL, not function calls. For version one, I would cut all of
the stuff that allows data to be sent in any format other than text,
and just use in/outfuncs all the time.
I'm very hesistant to try to do this with new DDL. Partly for complexity, partly because I'd really like to be able to carry a backport for 9.4 / 9.5 / 9.6 so people can use it within the next couple of years.
I do generally think that logical decoding relies too much on trying
to set up situations where it will never fail, and I've said from the
beginning that it should make more provision to cope with failure
rather than just trying to avoid it. If logical decoding never
breaks, great. But the approach I would favor is to set things up so
that it automatically reclones if there is a replication break, and
then as an optimization project, try to eliminate those cases one by
one.
I really can't agree there.
For one thing, sometimes those clones are *massive*. How do you tell someone who's replicating a 10 TiB database that they've got to let the whole thing re-sync, and by the way all replication will completely halt until it does?
It's bad enough with physical standby, though at least there rsync helps a bit and pg_rewind has made a huge difference. Lets not create the same problem again in logical replication.
Then, as Petr points out, there are applications where you can't re-clone, at least not directly. You're using the decoding stream with a transform downstream to insert incoming data into fact tables. You're feeding it into a messaging system. You're merging data from multiple upstreams into a single downstream. Many of the interesting, exciting things we can make possible with logical replication that simply aren't possible with physical replication really need it not to randomly break.
Also, from an operational experience point of view, BDR has places where it does just break if you do something wrong. Experience with this has left me absolutely adamant that we need not to have such booby-traps in core logical replication, at least in the medium term. It doesn't matter how many times you tell users "don't do that" ... they'll do it. Then get angry when it breaks. Not to mention how hard it can be to discover why something broke. You have to look at the logs. Obvious to you or me, but I spend a lot of time answering questions about BDR and pglogical to the effect of "not working? nothing happening? LOOK AT THE LOG FILES."
I think you're really over-estimating the average user when it comes to analysing and understanding the consequence of specific schema changes, etc. Sure, I think it's fine not to support some things we do support for physical, but as much as possible we should stop the user doing those when logical replication is enabled, and where that's not possible it needs to be really, really obvious what broke and why.
That isn't really my perception of how things have gone, but I admit I
may not have been following it as closely as I should have done. I'd
be happy to talk with you about this in person if you're going to be
at PGCONF.US or PGCon in Ottawa; or there's always Skype. I don't see
any reason why we can't all put our heads together and agree on a
direction that we can all live with.
Yeah. I'd like to see us able to work on a shared dev tree rather than mailing patches around, at least.
I'm not going to make it to PgConf.US. I don't know about Ottawa yet, but I doubt it given that I did go to Brussels. Perth is absurdly far from almost everywhere, unfortunately.
Hi all
I made an unfortunate oversight in the logical decoding timeline following code: it doesn't work for logical decoding from the walsender, because I left the glue code required out of the final cut of the patch.
The reason the new src/test/recovery/ tests don't notice this is that using pg_recvlogical from the TAP tests is currently pretty awkward. pg_recvlogical has no way to specify a stop-point or return when there's no immediately pending data like the SQL interface does. So you have to read from it until you figure it's not going to return anything more then kill it and look at what it did return and hope you don't lose anything in buffering.I don't much like relying on timeouts as part of normal successful results since they can upset some of the older and slower buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts= option, but it was a bit too late to add those.
Per the separate mail I sent to hackers, xlogreader is currently pretty much timeline-agnostic. The timeline tracking code needs to know when the xlogreader does a random read and do a fresh lookup so its state is part of the XLogReaderState struct. But the walsender's xlogreader page read callback doesn't use the xlogreader's state, and it can't because it's also used for physical replication, which communicates the timeline to read from with the page read function via globals. So for now, I just set the globals before calling the page read function:
+ XLogReadDetermineTimeline(state);
+ sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
+ sendTimeLine = state->currTLI;
+ sendTimeLineValidUpto = state->currTLIValidUntil;
+ sendTimeLineNextTLI = state->nextTLI;
Per separate mail, I'd quite like to tackle the level of duplication in timeline following logic in 9.7 and maybe see if the _three_ separate read xlog page functions can be unified at the same time. But that sure isn't 9.6 material.
Attachment
On Sun, Apr 17, 2016 at 10:01:36PM +0800, Craig Ringer wrote: > I made an unfortunate oversight in the logical decoding timeline following > code: it doesn't work for logical decoding from the walsender, because I > left the glue code required out of the final cut of the patch. > Subject: [PATCH] Enable logical timeline following in the walsender > > --- > src/backend/access/transam/xlogutils.c | 7 +++---- > src/backend/replication/walsender.c | 11 +++++++---- > src/include/access/xlogreader.h | 3 +++ > src/include/access/xlogutils.h | 2 ++ > 4 files changed, 15 insertions(+), 8 deletions(-) [This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Alvaro, since you committed the patch believed to have created it, you own this open item. If that responsibility lies elsewhere, please let us know whose responsibility it is to fix this. Since new open items may be discovered at any time and I want to plan to have them all fixed well in advance of the ship date, I will appreciate your efforts toward speedy resolution. Please present, within 72 hours, a plan to fix the defect within seven days of this message. Thanks.
Noah Misch wrote: > The above-described topic is currently a PostgreSQL 9.6 open item. Alvaro, > since you committed the patch believed to have created it, you own this open > item. If that responsibility lies elsewhere, please let us know whose > responsibility it is to fix this. Since new open items may be discovered at > any time and I want to plan to have them all fixed well in advance of the ship > date, I will appreciate your efforts toward speedy resolution. Please > present, within 72 hours, a plan to fix the defect within seven days of this > message. Thanks. I plan to get Craig's patch applied on Monday 25th, giving me some more time for study. . -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Craig Ringer wrote: > - /* oldest LSN that the client has acked receipt for */ > + /* > + * oldest LSN that the client has acked receipt for > + * > + * Decoding will start calling output plugin callbacks for commits > + * after this LSN unless a different start point is specified by > + * the client. > + */ > XLogRecPtr confirmed_flush; How about this wording? /* * Oldest LSN that the client has acked receipt for. This is used as * the start_lsn point in case the client doesn'tspecify one, and also * as a safety measure to back off in case the client specifies a * start_lsn that's furtherin the future than this value. */XLogRecPtr confirmed_flush; > - /* oldest LSN that might be required by this replication slot */ > + /* > + * oldest LSN that might be required by this replication slot. > + * > + * Points to the LSN of the most recent xl_running_xacts record where > + * no transaction that commits after confirmed_flush is already in > + * progress. At this point all xacts committing after confirmed_flush > + * can be read entirely into reorder buffers and all visibility > + * information can be reconstructed. > + */ > XLogRecPtr restart_lsn; I'm unsure about this one. Why are we speaking of reorder buffers here, if this is generically about replication slots? Is it that slots used by physical replication do not *need* a restart_lsn? I think the comment should be worded in a way that's not specific to logical decoding; or, if the restart_lsn field *is* specific to logical decoding, then the comment should state as much. > diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c > index 5af6751..5f74941 100644 > --- a/src/backend/replication/logical/logicalfuncs.c > +++ b/src/backend/replication/logical/logicalfuncs.c > @@ -236,8 +236,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > PG_TRY(); > { > /* > - * Passing InvalidXLogRecPtr here causes decoding to start returning > - * commited xacts to the client at the slot's confirmed_flush. > + * Passing InvalidXLogRecPtr here causes logical decoding to > + * start calling the output plugin for transactions that commit > + * at or after the slot's confirmed_flush. This filters commits > + * out from the client but they're still decoded. > */ > ctx = CreateDecodingContext(InvalidXLogRecPtr, > options, I this it's weird to have the parameter explained in the callsite rather than in the comment about CreateDecodingContext. I think this patch needs to apply to logical.c, not logicalfuncs. > @@ -262,11 +264,9 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > ctx->output_writer_private = p; > > /* > - * We start reading xlog from the restart lsn, even though we won't > - * start returning decoded data to the user until the point set up in > - * CreateDecodingContext. The restart_lsn is far enough back that we'll > - * see the beginning of any xact we're expected to return to the client > - * so we can assemble a complete reorder buffer for it. > + * Reading and decoding of WAL must start at restart_lsn so that the > + * entirety of each of the xacts that commit after confimed_lsn can be > + * accumulated into reorder buffers. > */ > startptr = MyReplicationSlot->data.restart_lsn; This one looks fine to me, except typo s/confimed/confirmed/ -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-04-26 17:22:49 -0300, Alvaro Herrera wrote: > > - /* oldest LSN that might be required by this replication slot */ > > + /* > > + * oldest LSN that might be required by this replication slot. > > + * > > + * Points to the LSN of the most recent xl_running_xacts record where > > + * no transaction that commits after confirmed_flush is already in > > + * progress. At this point all xacts committing after confirmed_flush > > + * can be read entirely into reorder buffers and all visibility > > + * information can be reconstructed. > > + */ > > XLogRecPtr restart_lsn; > > I'm unsure about this one. Why are we speaking of reorder buffers here, > if this is generically about replication slots? Is it that slots used > by physical replication do not *need* a restart_lsn? It's used in physical slots as well, so I agree. Also I think the xl_running_xacts bit is going into way to much detail; it could very well just be shutdown checkpoint or other similar locations. > > diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c > > index 5af6751..5f74941 100644 > > --- a/src/backend/replication/logical/logicalfuncs.c > > +++ b/src/backend/replication/logical/logicalfuncs.c > > @@ -236,8 +236,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > > PG_TRY(); > > { > > /* > > - * Passing InvalidXLogRecPtr here causes decoding to start returning > > - * commited xacts to the client at the slot's confirmed_flush. > > + * Passing InvalidXLogRecPtr here causes logical decoding to > > + * start calling the output plugin for transactions that commit > > + * at or after the slot's confirmed_flush. This filters commits > > + * out from the client but they're still decoded. > > */ > > ctx = CreateDecodingContext(InvalidXLogRecPtr, > > options, > > I this it's weird to have the parameter explained in the callsite rather > than in the comment about CreateDecodingContext. I think this patch > needs to apply to logical.c, not logicalfuncs. I still think the entire comment ought to be removed. - Andres
Craig Ringer wrote: > The reason the new src/test/recovery/ tests don't notice this is that using > pg_recvlogical from the TAP tests is currently pretty awkward. > pg_recvlogical has no way to specify a stop-point or return when there's no > immediately pending data like the SQL interface does. So you have to read > from it until you figure it's not going to return anything more then kill > it and look at what it did return and hope you don't lose anything in > buffering.I don't much like relying on timeouts as part of normal > successful results since they can upset some of the older and slower > buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts= > option, but it was a bit too late to add those. Considering that pg_recvlogical was introduced mostly as a way to test logical decoding features, I think this is a serious oversight and we should patch it. I suppose we could leave it for 9.7, thought I admit I would prefer it to introduce it in 9.6. Now everyone throwing stones at me in 3 ... 2 ... > Per the separate mail I sent to hackers, xlogreader is currently pretty > much timeline-agnostic. The timeline tracking code needs to know when the > xlogreader does a random read and do a fresh lookup so its state is part of > the XLogReaderState struct. But the walsender's xlogreader page read > callback doesn't use the xlogreader's state, and it can't because it's also > used for physical replication, which communicates the timeline to read from > with the page read function via globals. The globals there are a disaster and I welcome efforts to get rid of them. > So for now, I just set the globals before calling the page read > function: > > + XLogReadDetermineTimeline(state); > + sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID; > + sendTimeLine = state->currTLI; > + sendTimeLineValidUpto = state->currTLIValidUntil; > + sendTimeLineNextTLI = state->nextTLI; Ok, this is pretty ugly but seems acceptable. > Per separate mail, I'd quite like to tackle the level of duplication in > timeline following logic in 9.7 and maybe see if the _three_ separate read > xlog page functions can be unified at the same time. But that sure isn't > 9.6 material. Agreed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > Noah Misch wrote: > > > The above-described topic is currently a PostgreSQL 9.6 open item. Alvaro, > > since you committed the patch believed to have created it, you own this open > > item. If that responsibility lies elsewhere, please let us know whose > > responsibility it is to fix this. Since new open items may be discovered at > > any time and I want to plan to have them all fixed well in advance of the ship > > date, I will appreciate your efforts toward speedy resolution. Please > > present, within 72 hours, a plan to fix the defect within seven days of this > > message. Thanks. > > I plan to get Craig's patch applied on Monday 25th, giving me some more > time for study. I failed to meet this deadline, for which I apologize. This week is going to be hectic around here, so my new deadline is to get these two patches applied on Friday 29th. Ok? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 26, 2016 at 05:37:40PM -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > Noah Misch wrote: > > > The above-described topic is currently a PostgreSQL 9.6 open item. Alvaro, > > > since you committed the patch believed to have created it, you own this open > > > item. If that responsibility lies elsewhere, please let us know whose > > > responsibility it is to fix this. Since new open items may be discovered at > > > any time and I want to plan to have them all fixed well in advance of the ship > > > date, I will appreciate your efforts toward speedy resolution. Please > > > present, within 72 hours, a plan to fix the defect within seven days of this > > > message. Thanks. > > > > I plan to get Craig's patch applied on Monday 25th, giving me some more > > time for study. > > I failed to meet this deadline, for which I apologize. This week is > going to be hectic around here, so my new deadline is to get these two > patches applied on Friday 29th. Ok? Okay; thanks for this notification.
On Tue, Apr 26, 2016 at 4:35 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Craig Ringer wrote: >> The reason the new src/test/recovery/ tests don't notice this is that using >> pg_recvlogical from the TAP tests is currently pretty awkward. >> pg_recvlogical has no way to specify a stop-point or return when there's no >> immediately pending data like the SQL interface does. So you have to read >> from it until you figure it's not going to return anything more then kill >> it and look at what it did return and hope you don't lose anything in >> buffering.I don't much like relying on timeouts as part of normal >> successful results since they can upset some of the older and slower >> buildfarm members. I'd rather be able to pass a --stoppos= or a --n-xacts= >> option, but it was a bit too late to add those. > > Considering that pg_recvlogical was introduced mostly as a way to test > logical decoding features, I think this is a serious oversight and we > should patch it. I suppose we could leave it for 9.7, thought I admit I > would prefer it to introduce it in 9.6. Now everyone throwing stones at > me in 3 ... 2 ... If that's a small and relatively contained change, I don't object to the idea of you making it, provided it's done pretty soon, and definitely before beta1. If it's going to require substantial refactoring or can't be done quickly, then, yes, I object. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-04-27 14:21:53 -0400, Robert Haas wrote: > > Considering that pg_recvlogical was introduced mostly as a way to test > > logical decoding features, I think this is a serious oversight and we > > should patch it. I suppose we could leave it for 9.7, thought I admit I > > would prefer it to introduce it in 9.6. Now everyone throwing stones at > > me in 3 ... 2 ... > > If that's a small and relatively contained change, I don't object to > the idea of you making it, provided it's done pretty soon, and > definitely before beta1. If it's going to require substantial > refactoring or can't be done quickly, then, yes, I object. I don't object either, I was looking for the feature myself a number of times (for tap tests in my case). It requires a some amount of thinking about what the limit applies to though. "messages sent by server", Bytes? TCP messages? xids? Time? - Andres
On 28 April 2016 at 02:29, Andres Freund <andres@anarazel.de> wrote:
I don't object either, I was looking for the feature myself a number of
times (for tap tests in my case).
Exactly what I want it for.
It requires a some amount of thinking about what the limit applies to
though. "messages sent by server", Bytes? TCP messages? xids? Time?
The main thing I think would be useful is a threshold for message LSNs after which it disconnects and exits. It doesn't have to be a commit message; if we receive any message with upstream LSN after the LSN of interest then there can't be any commits with a later LSN anyway, and that way it'll be useful if/when streaming decoding is implemented too.
That way a test can capture the xlog insert lsn after doing the work it wants to replay, do another dummy write to make sure there's something more to replay, and decode up to that point.
I don't think pg_recvlogical can do anything about the need for that dummy write, since the client has no way to determine the exact LSN of the commit record of the xact of interest. It can't rely on pg_current_xlog_insert_location() or pg_current_xlog_location() since autovacuum or a checkpoint might've written xlog since. Logical streaming replication doesn't have a non-blocking mode where it returns immediately if it'd have to wait for more xlog so we can't just send off the most recent server LSN as the endpoint.
Ideally I think this should be done server-side with a limit on replay LSN and a non-blocking option, but that's way too invasive for this stage in the release cycle. Client-side seems fine enough.
Number of xacts received would also be handy, but I don't know if I'll get to that.
pg_receivexlog should send confirmation on exit.
(The other thing I've wanted sometimes is a --peek option, but that's again not really in scope for this.)
On 29 April 2016 at 15:40, Craig Ringer <craig@2ndquadrant.com> wrote:
With this patch pg_recvlogical takes a new --endpos LSN argument, and will exit if either:
I don't think pg_recvlogical can do anything about the need for that dummy write, since the client has no way to determine the exact LSN of the commit record of the xact of interest. It can't rely on pg_current_xlog_insert_location() or pg_current_xlog_location() since autovacuum or a checkpoint might've written xlog since. Logical streaming replication doesn't have a non-blocking mode where it returns immediately if it'd have to wait for more xlog so we can't just send off the most recent server LSN as the endpoint.
(Patch attached. Blah blah explanation blah):
With this patch pg_recvlogical takes a new --endpos LSN argument, and will exit if either:
* it receives an XLogData message with dataStart >= endpos; or
* it receives a keepalive with walEnd >= endpos
The latter allows it to work without needing a dummy transaction to make it see a data message after endpos. If there's nothing to read on the socket until a keepalive we know that the server has nothing to send us, and if walend has passed endpos we know nothing can have committed before endpos.
The way I've written things the endpos is the point where we stop receiving and exit, so if a record with start lsn >= endpos is received we'll exit without writing it.
I thought about writing out the record before exiting if the record start LSN is exactly endpos. That'd be handy in cases where the client knows a commit's LSN and wants everything up to that commit. But it's easy enough in this case for the client to set endpos to the commit start lsn + 1, so it's not like the current behaviour stops you doing anything, and it means the code can just test endpos and exit. pg_current_xlog_insert_location() will return at least the lsn of the last commit + 1, so you'll get the expected behaviour for free there. It does mean we might wait for the next walsender keepalive or status update before we exit, though, so if someone feels strongly that endpos should be an inclusive bound I can do that. It's just a bit uglier in the code.
I can't add a "number of xacts" filter like the SQL interface has because pg_recvlogical has no idea which records represent a commit, so it's not possible without changing the protocol. I'm not convinced a "number of messages" filter is particularly useful. I could add a timeout, but it's easy enough to do that in a wrapper (like IPC::Run). So I'm sticking with just the LSN filter for now.
Also because pg_recvlogical isn't aware of transaction boundaries, setting endpos might result in a partial transaction being output if endpos is after the end of the last xact wanted and some other xact containing changes made before endpos commits after endpos but before the next status update/keepalive is sent. That xact won't be consumed from the server and will just be sent when the slot is next read from. This won't result in unpredictable output for testing since there we control what other xacts run and will generally exit based on walsender status updates/keepalives.
Here's the patch. Docs included. Comments?
Attachment
On Tue, Apr 26, 2016 at 4:37 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera wrote: >> Noah Misch wrote: >> >> > The above-described topic is currently a PostgreSQL 9.6 open item. Alvaro, >> > since you committed the patch believed to have created it, you own this open >> > item. If that responsibility lies elsewhere, please let us know whose >> > responsibility it is to fix this. Since new open items may be discovered at >> > any time and I want to plan to have them all fixed well in advance of the ship >> > date, I will appreciate your efforts toward speedy resolution. Please >> > present, within 72 hours, a plan to fix the defect within seven days of this >> > message. Thanks. >> >> I plan to get Craig's patch applied on Monday 25th, giving me some more >> time for study. > > I failed to meet this deadline, for which I apologize. This week is > going to be hectic around here, so my new deadline is to get these two > patches applied on Friday 29th. Ok? Not that you don't know this already, but you have also failed to meet this deadline. Also, this open item has now been promoted to the much sought-after status of "oldest as-yet-unaddressed PostgreSQL 9.6 open item". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Tue, Apr 26, 2016 at 4:37 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > I failed to meet this deadline, for which I apologize. This week is > > going to be hectic around here, so my new deadline is to get these two > > patches applied on Friday 29th. Ok? > > Not that you don't know this already, but you have also failed to meet > this deadline. Also, this open item has now been promoted to the much > sought-after status of "oldest as-yet-unaddressed PostgreSQL 9.6 open > item". Yay. I thought I was going to have Friday completely available, but that ended up not happening. I'm working on this exclusively now until I get the item closed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I pushed a fix to some comments, including the ones being discussed in this subthread, which should hopefully close things here. I'm now going to go over Craig's pg_recvlogical changes and the proposed for that problem. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-05-02 16:12:53 -0300, Alvaro Herrera wrote: > I pushed a fix to some comments, including the ones being discussed in > this subthread, which should hopefully close things here. > > I'm now going to go over Craig's pg_recvlogical changes and the proposed > for that problem. - /* oldest LSN that the client has acked receipt for */ + /* + * Oldest LSN that the client has acked receipt for. This is used as the + * start_lsn point in case the client doesn't specify one, and also as a + * safety measure to back off in case the client specifies a start_lsn + * that's further in the future than this value. + */ XLogRecPtr confirmed_flush; This is the wrong way round. confirmed_flush is used if the client's start_lsn is further in the *past* than this value.
Andres Freund wrote: > - /* oldest LSN that the client has acked receipt for */ > + /* > + * Oldest LSN that the client has acked receipt for. This is used as the > + * start_lsn point in case the client doesn't specify one, and also as a > + * safety measure to back off in case the client specifies a start_lsn > + * that's further in the future than this value. > + */ > XLogRecPtr confirmed_flush; > > This is the wrong way round. confirmed_flush is used if the client's > start_lsn is further in the *past* than this value. Bah. Funnily enough I got this one right in the comment for CreateDecodingContext. Thanks for reading through the commit. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Craig Ringer wrote: > With this patch pg_recvlogical takes a new --endpos LSN argument, and will > exit if either: > > * it receives an XLogData message with dataStart >= endpos; or > * it receives a keepalive with walEnd >= endpos > > The latter allows it to work without needing a dummy transaction to make it > see a data message after endpos. If there's nothing to read on the socket > until a keepalive we know that the server has nothing to send us, and if > walend has passed endpos we know nothing can have committed before endpos. Here's a slightly revised version of this patch, for later consideration. Changes: - added -E as short form of --endpos (consistent with -I as --startpos) - refactored some repetitive code in two auxilliary functions - allow --endpos to work with --create-slot. - revert some unrelated changes, such as message additions in verbose mode and changes to existing messages - documentation reworked. I didn't spot any bugs in Craig's patch, but it needs more testing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
I don't like reverting patches, but this patch is making me more and more uncomfortable. We have two open items, one of which requires writing new test code that doesn't exist yet; and we have the pg_recvlogical changes that were approved post-feature freeze, but that I now have second thoughts about pushing right away. Craig has also commented on some followup patch to change this whole area in 9.7. Per https://www.postgresql.org/message-id/20160503165812.GA29604%40alvherre.pgsql I think the best course of action is to revert the whole feature and revisit for 9.7. Here's a proposed revert patch. Many minor changes (mostly comment additions) that were applied as part of this series are kept intact. The test_slot_timeline test module and corresponding recovery test script are also reverted. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera wrote: > Here's a proposed revert patch. Many minor changes (mostly comment > additions) that were applied as part of this series are kept intact. > The test_slot_timeline test module and corresponding recovery test > script are also reverted. Pushed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services