Thread: Logical decoding on standby
Hi all I've prepared a working initial, somewhat raw implementation for logical decoding on physical standbys. Since it's a series of 20 smallish patches at the moment I won't attach it. You can find the current version at time of writing here: https://github.com/postgres/postgres/compare/c5f365f3ab...2ndQuadrant:dev/logical-decoding-on-standby-pg10-v1?expand=1 i.e the tag dev/logical-decoding-on-standby-pg10-v1 in github repo 2ndQuadrant/postgres. and whatever I'm working on (subject to rebase, breakage, etc) lives in the branch dev/logical-decoding-on-standby-pg10 . Quickstart === Compile and install like usual; make sure to install test_decoding too. To see the functionality in action, configure with --enable-tap-tests and: make -C src/test/recovery check To try manually, initdb a master, set pg_hba.conf to 'trust' on all replication connections, append to postgresql.conf: wal_level = 'logical' max_replication_slots = 4 max_wal_senders = 4 hot_standby_feedback = on then start the master. Now psql -d 'master_connstr' -c "SELECT pg_create_physical_replication_slot('standby1');" and pg_basebackup -d 'master_connstr' -X stream -R --slot=standby1 and start the replica. You can now use pg_recvlogical to create a slot on the replica and decode changes from it, e.g. pg_recvlogical -d 'replica_connstr' -S test -P test_decoding --create-slot pg_recvlogical -d 'replica_connstr' -S 'test'-f - --start and you'll (hopefully) see subsequent changes you make on the master. If not, tell me. Patch series contents === This patch series incorporates the following changes: * Timeline following for logical slots, so they can start decoding on the correct timeline and follow timeline switches (with tests); originally [3] * Add --endpos to pg_recvlogical, with tests; originally [3] * Splitting of xmin and catalog_xmin on hot standby feedback, so logical slots on a replica only hold down catalog_xmin on the master, not the xmin for user tables (with tests). Minimises upstream bloat; originally [4] * Suppress export of snapshot when starting logical decoding on replica. Since we cannot allocate an xid, we cannot export snapshots on standby. Decoding clients can still do their initial setup via a slot on the master then switch over, do it via physical copy, etc. * Require hot_standby_feedback to be enabled when starting logical decoding on a standby. * Drop any replication slots from a database when redo'ing database drop, so we don't leave dangling slots on the replica (with tests). * Make the walsender respect SIGUSR1 and exit via RecoveryConflictInterrupt() when it gets PROCSIG_RECOVERY_CONFLICT_DATABASE (with tests); see [6] * PostgresNode.pm enhancements for the tests * New test coverage for logical decoding on standby Remaining issues === * The method used to make the walsender respect conflict with recovery interrupts may not be entirely safe, see walsender procsignal_sigusr1_handler thread [5]; * We probably terminate walsenders running inside an output plugin with a virtual xact whose xmin is below the upstream's global xmin, even though its catalog xmin is fine, in ResolveRecoveryConflictWithSnapshot(...). Haven't been able to test this. Need to only terminate them when the conflict affects relations accessible in logical decoding, which likely needs the upstream to send more info in WAL; * logical decoding timeline following needs tests for cascading physical replication where an intermediate node is promoted per timeline following thread [3]; * walsender may need to maintain ThisTimeLineID in more places per decoding timeline following v10 thread [3]; * it may be desirable to refactor the walsender to deliver cleaner logical decoding timeline following per the decoding timeline following v10 thread[3] also: * Nothing stops the user from disabling hot_standby_feedback on the standby or dropping and re-creating the physical slot on the master, causing needed catalog tuples to get vacuumed away. Since it's not going to be safe to check slot shmem state from the hot_standby_feedback verify hook and we let hot_standby_feedback change at runtime this is going to be hard to fix comprehensively, so we need to cope with what happens when feedback fails, but: * We don't yet detect when upstream's catalog_xmin increases past our needed catalog_xmin and needed catalog tuples are vacuumed away by the upstream. So we don't invalidate the slot or terminate any active decoding sessions using the slot. Active decoding sessions often won't have a vtxid to use with ResolveRecoveryConflictWithVirtualXIDs(), transaction cancel is not going to be sufficient, and anyway it'll cancel too aggressively since it doesn't know it's safe to apply changes that affect only (non-user-catalog) heap tables without conflict with decoding sessions. ... so this is definitely NOT ready for commit. It does, however, make logical decoding work on standby. Next steps === Since it doesn't look practical to ensure there's never been a gap in hot standby feedback or detect such a gap directly, I'm currently looking at ways to reliably detect when the upstream has removed tuples we need and error out. That means we need a way to tell when upstream's catalog_xmin has advanced, which we don't currently have from xlogs. Checkpoint's oldestXID is insufficient since advance could've happened since last checkpoint. Related threads === This series supercedes: * Timeline following for logical slots [1] https://www.postgresql.org/message-id/flat/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com * WIP: Failover Slots [2] https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com and incorporates the patches in: * Logical decoding timeline following take II [3] https://www.postgresql.org/message-id/flat/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com * Send catalog_xmin separately in hot standby feedback [4] https://www.postgresql.org/message-id/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com * Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from walsender? [5] https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com Also relevant: * Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() in walsender [6] https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2016-11-21 16:17:58 +0800, Craig Ringer wrote: > I've prepared a working initial, somewhat raw implementation for > logical decoding on physical standbys. Please attach. Otherwise in a year or two it'll be impossible to look this up. Andres
On 22 November 2016 at 03:14, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2016-11-21 16:17:58 +0800, Craig Ringer wrote: >> I've prepared a working initial, somewhat raw implementation for >> logical decoding on physical standbys. > > Please attach. Otherwise in a year or two it'll be impossible to look > this up. Fair enough. Attached. Easy to apply with "git am". I'm currently looking at making detection of replay conflict with a slot work by separating the current catalog_xmin into two effective parts - the catalog_xmin currently needed by any known slots (ProcArray->replication_slot_catalog_xmin, as now), and the oldest actually valid catalog_xmin where we know we haven't removed anything yet. That'll be recorded in a new CheckPoint.oldestCatalogXid field and in ShmemVariableCache ( i.e. VariableCacheData.oldestCatalogXid ). Vacuum will be responsible for advancing VariableCacheData.oldestCatalogXid by writing an expanded xl_heap_cleanup_info record with a new latestRemovedCatalogXid field and then advancing the value in the ShmemVariableCache. Vacuum will only remove rows of catalog or user-catalog tables that are older than VariableCacheData.oldestCatalogXid. This allows recovery on a standby to tell, based on the last checkpoint + any xl_heap_cleanup_info records used to maintain ShmemVariableCache, whether the upstream has removed catalog or user-catalog records it needs. We can signal walsenders with running xacts to terminate if their xmin passes the threshold, and when they start a new xact they can check to see if they're still valid and bail out. xl_heap_cleanup_info isn't emitted much, but if adding a field there is a problem we can instead add an additional xlog buffer that's only appended when wal_level = logical. Doing things this way avoids: * the need for the standby to be able to tell at redo time whether a RelFileNode is for a catalog or user-catalog relation without access to relcache; or * the need to add info on whether a catalog or user-catalog is being affected to any xlog record that can cause a snapshot conflict on standby; or * a completely reliably way to ensure hot_standby_feedback can never cease to affect the master even if the user does something dumb at the cost of sometimes somewhat delaying removal of catalog or user-catalog tuples when wal_level >= hot_standby, a new CheckPoint field, and a new field in xl_heap_cleanup_info . The above is not incorporated in the attached patch series, see the prior post for status of the attached patches. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patch
- 0002-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patch
- 0003-Follow-timeline-switches-in-logical-decoding.patch
- 0004-PostgresNode-methods-to-wait-for-node-catchup.patch
- 0005-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patch
- 0006-Expand-streaming-replication-tests-to-cover-hot-stan.patch
- 0007-Send-catalog_xmin-in-hot-standby-feedback-protocol.patch
- 0008-Make-walsender-respect-catalog_xmin-in-hot-standby-f.patch
- 0009-Allow-GetOldestXmin-.-to-optionally-disregard-the-ca.patch
- 0010-Send-catalog_xmin-separately-in-hot_standby_feedback.patch
- 0011-Update-comment-on-issues-with-logical-decoding-on-st.patch
- 0012-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patch
- 0013-ERROR-if-timeline-is-zero-in-walsender.patch
- 0014-Permit-logical-decoding-on-standby-with-a-warning.patch
- 0015-Tests-for-logical-decoding-on-standby.patch
- 0016-Drop-logical-replication-slots-when-redoing-database.patch
- 0017-Allow-walsender-to-exit-on-conflict-with-recovery.patch
- 0018-Tests-for-db-drop-during-decoding-on-standby.patch
On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote: > I'm currently looking at making detection of replay conflict with a > slot work by separating the current catalog_xmin into two effective > parts - the catalog_xmin currently needed by any known slots > (ProcArray->replication_slot_catalog_xmin, as now), and the oldest > actually valid catalog_xmin where we know we haven't removed anything > yet. OK, more detailed plan. The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid, are already held down by ProcArray's catalog_xmin. But that doesn't mean we haven't removed newer tuples from specific relations and logged that in xl_heap_clean, etc, including catalogs or user catalogs, it only means the clog still exists for those XIDs. We don't emit a WAL record when we advance oldestXid in SetTransactionIdLimit(), and doing so is useless because vacuum will have already removed needed tuples from needed catalogs before calling SetTransactionIdLimit() from vac_truncate_clog(). We know that if oldestXid is n, the true valid catalog_xmin where no needed tuples have been removed must be >= n. But we need to know the lower bound of valid catalog_xmin, which oldestXid doesn't give us. So right now a standby has no way to reliably know if the catalog_xmin requirement for a given replication slot can be satisfied. A standby can't tell based on a xl_heap_cleanup_info record, xl_heap_clean record, etc whether the affected table is a catalog or not, and shouldn't generate conflicts for non-catalogs since otherwise it'll be constantly clobbering walsenders. A 2-phase advance of the global catalog_xmin would mean that GetOldestXmin() would return a value from ShmemVariableCache, not the oldest catalog xmin from ProcArray like it does now. (auto)vacuum would then be responsible for: * Reading the oldest catalog_xmin from procarray * If it has advanced vs what's present in ShmemVariableCache, writing a new xlog record type recording an advance of oldest catalog xmin * advancing ShmemVariableCache's oldest catalog xmin and would do so before it called GetOldestXmin via vacuum_set_xid_limits() to determine what it can remove. GetOldestXmin would return the ProcArray's copy of the oldest catalog_xmin when in recovery, so we report it via hot_standby_fedback to the upstream, it's recorded on our physical slot, and in turn causes vacuum to advance the master's effective oldest catalog_xmin for vacuum. On the standby we'd replay the catalog_xmin advance record, advance the standby's ShmemVariableCache's oldest catalog xmin, and check to see if any replication slots, active or not, have a catalog_xmin < than the new threshold. If none do, there's no conflict and we're fine. If any do, we wait max_standby_streaming_delay/max_standby_archive_delay as appropriate, then generate recovery conflicts against all backends that have an active replication slot based on the replication slot state in shmem. Those backends - walsender or normal decoding backend - would promptly die. New decoding sessions will check the ShmemVariableCache and refuse to start if their catalog_xmin is < the threshold. Since we advance it before generating recovery conflicts there's no race with clients trying to reconnect after their backend is killed with a conflict. If we wanted to get fancy we could set the latches of walsender backends at risk of conflicting and they could check ShmemVariableCache's oldest valid catalog xmin, so they could send immediate keepalives with reply_requested set and hopefully get flush confirmation from the client and advance their catalog_xmin before we terminate them as conflicting with recovery. But that can IMO be done later/separately. Going to prototype this. Alternate approach: --------------- The oldest xid in heap_xlog_cleanup_info is relation-specific, but the standby has no way to know if it's a catalog relation or not during redo and know whether to kill slots and decoding sessions based on its latestRemovedXid. Same for xl_heap_clean and the other records that can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page, xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect). Instead of adding a 2-phase advance of the global catalog_xmin, we can instead add a rider to each of these records that identifies whether it's a catalog table or not. This would only be emitted when wal_level >= logical, but it *would* increase WAL sizes a bit when logical decoding is enabled even if it's not going to be used on a standby. The rider would be a simple: typedef struct xl_rel_catalog_info { bool rel_accessible_from_logical_decoding; } xl_catalog_info; or similar. During redo we call a new ResolveRecoveryConflictWithLogicalSlot function from each of those records' redo routines that does what I outlined above. This way add more info to more xlog records, and the upstream has to use RelationIsAccessibleInLogicalDecoding() to set up the records when writing the xlogs. In exchange, we don't have to add a new field to CheckPoint or ShmemVariableCache or add a new xlog record type. It seems the worse option to me. (BTW, as comments on GetOldestSafeDecodingTransactionId() note, we can't rely on KnownAssignedXidsGetOldestXmin() since it can be incomplete at least on standby.) -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote: >> I'm currently looking at making detection of replay conflict with a >> slot work by separating the current catalog_xmin into two effective >> parts - the catalog_xmin currently needed by any known slots >> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest >> actually valid catalog_xmin where we know we haven't removed anything >> yet. > > OK, more detailed plan. > > The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid, > are already held down by ProcArray's catalog_xmin. But that doesn't > mean we haven't removed newer tuples from specific relations and > logged that in xl_heap_clean, etc, including catalogs or user > catalogs, it only means the clog still exists for those XIDs. Really? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23 November 2016 at 03:55, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote: >>> I'm currently looking at making detection of replay conflict with a >>> slot work by separating the current catalog_xmin into two effective >>> parts - the catalog_xmin currently needed by any known slots >>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest >>> actually valid catalog_xmin where we know we haven't removed anything >>> yet. >> >> OK, more detailed plan. >> >> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid, >> are already held down by ProcArray's catalog_xmin. But that doesn't >> mean we haven't removed newer tuples from specific relations and >> logged that in xl_heap_clean, etc, including catalogs or user >> catalogs, it only means the clog still exists for those XIDs. > > Really? (Note the double negative above). Yes, necessarily so. You can't look up xids older than the clog truncation threshold at oldestXid, per our discussion on txid_status() and traceable commit. But the tuples from that xact aren't guaranteed to exist in any given relation; vacuum uses vacuum_set_xid_limits(...) which calls GetOldestXmin(...); that in turn scans ProcArray to find the oldest xid any running xact cares about. It might bump it down further if there's a replication slot requirement or based on vacuum_defer_cleanup_age, but it doesn't care in the slightest about oldestXmin. oldestXmin cannot advance until vacuum has removed all tuples for that xid and advanced the database's datfrozenxid. But a given oldestXmin says nothing about which tuples, catalog or otherwise, still exist and are acessible. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Nov 23, 2016 at 8:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >>> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid, >>> are already held down by ProcArray's catalog_xmin. But that doesn't >>> mean we haven't removed newer tuples from specific relations and >>> logged that in xl_heap_clean, etc, including catalogs or user >>> catalogs, it only means the clog still exists for those XIDs. >> >> Really? > > (Note the double negative above). > > Yes, necessarily so. You can't look up xids older than the clog > truncation threshold at oldestXid, per our discussion on txid_status() > and traceable commit. But the tuples from that xact aren't guaranteed > to exist in any given relation; vacuum uses vacuum_set_xid_limits(...) > which calls GetOldestXmin(...); that in turn scans ProcArray to find > the oldest xid any running xact cares about. It might bump it down > further if there's a replication slot requirement or based on > vacuum_defer_cleanup_age, but it doesn't care in the slightest about > oldestXmin. > > oldestXmin cannot advance until vacuum has removed all tuples for that > xid and advanced the database's datfrozenxid. But a given oldestXmin > says nothing about which tuples, catalog or otherwise, still exist and > are acessible. Right. Sorry, my mistake. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<p dir="ltr"><p dir="ltr">On 26 Nov. 2016 23:40, "Robert Haas" <<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Wed, Nov 23, 2016 at 8:37AM, Craig Ringer <<a href="mailto:craig@2ndquadrant.com">craig@2ndquadrant.com</a>> wrote:<br /> > >>>The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,<br /> > >>> are already helddown by ProcArray's catalog_xmin. But that doesn't<br /> > >>> mean we haven't removed newer tuples fromspecific relations and<br /> > >>> logged that in xl_heap_clean, etc, including catalogs or user<br /> >>>> catalogs, it only means the clog still exists for those XIDs.<br /> > >><br /> > >> Really?<br/> > ><br /> > > (Note the double negative above).<br /> > ><br /> > > Yes, necessarilyso. You can't look up xids older than the clog<br /> > > truncation threshold at oldestXid, per our discussionon txid_status()<br /> > > and traceable commit. But the tuples from that xact aren't guaranteed<br /> >> to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)<br /> > > which calls GetOldestXmin(...);that in turn scans ProcArray to find<br /> > > the oldest xid any running xact cares about. It mightbump it down<br /> > > further if there's a replication slot requirement or based on<br /> > > vacuum_defer_cleanup_age,but it doesn't care in the slightest about<br /> > > oldestXmin.<br /> > ><br /> >> oldestXmin cannot advance until vacuum has removed all tuples for that<br /> > > xid and advanced the database'sdatfrozenxid. But a given oldestXmin<br /> > > says nothing about which tuples, catalog or otherwise, stillexist and<br /> > > are acessible.<br /> ><br /> > Right. Sorry, my mistake.<p dir="ltr">Phew. Had me worriedthere.<p dir="ltr">Thanks for looking over it. Prototype looks promising so far.
Hi, I did look at the code a bit. The first 6 patches seem reasonable. I don't understand why some patches are separate tbh (like 7-10, or 11). About the 0009: > diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c > index 9985e3e..4fa3ad4 100644 > --- a/contrib/pg_visibility/pg_visibility.c > +++ b/contrib/pg_visibility/pg_visibility.c > @@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) > if (all_visible) > { > /* Don't pass rel; that will fail in recovery. */ > - OldestXmin = GetOldestXmin(NULL, true); > + OldestXmin = GetOldestXmin(NULL, true, false); > } > > rel = relation_open(relid, AccessShareLock); > @@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen) > * a buffer lock. And this shouldn't happen often, so it's > * worth being careful so as to avoid false positives. > */ > - RecomputedOldestXmin = GetOldestXmin(NULL, true); > + RecomputedOldestXmin = GetOldestXmin(NULL, true, false); > > if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin)) > record_corrupt_item(items, &tuple.t_self); > diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c > index f524fc4..5b33c97 100644 > --- a/contrib/pgstattuple/pgstatapprox.c > +++ b/contrib/pgstattuple/pgstatapprox.c > @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat) > TransactionId OldestXmin; > uint64 misc_count = 0; > > - OldestXmin = GetOldestXmin(rel, true); > + OldestXmin = GetOldestXmin(rel, true, false); > bstrategy = GetAccessStrategy(BAS_BULKREAD); > > nblocks = RelationGetNumberOfBlocks(rel); This does not seem correct, you are sending false as pointer parameter. 0012: I think there should be parameter saying if snapshot should be exported or not and if user asks for it on standby it should fail. 0014 makes 0011 even more pointless. Not going into deeper detail as this is still very WIP. I go agree with the general design though. This also replaces the previous timeline following and decoding threads/CF entries so maybe those should be closed in CF? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
>> --- a/contrib/pgstattuple/pgstatapprox.c >> +++ b/contrib/pgstattuple/pgstatapprox.c >> @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat) >> TransactionId OldestXmin; >> uint64 misc_count = 0; >> >> - OldestXmin = GetOldestXmin(rel, true); >> + OldestXmin = GetOldestXmin(rel, true, false); >> bstrategy = GetAccessStrategy(BAS_BULKREAD); >> >> nblocks = RelationGetNumberOfBlocks(rel); > > This does not seem correct, you are sending false as pointer parameter. Thanks. That's an oversight from the GetOldestXmin interface change per your prior feedback. C doesn't care since null is 0 and false is 0, and I missed it when transforming the patch. > 0012: > > I think there should be parameter saying if snapshot should be exported > or not and if user asks for it on standby it should fail. Sounds reasonable. That also means clients can suppress standby export on master, which as we recently learned can be desirable sometimes. > 0014 makes 0011 even more pointless. Yeah, as I said, it's a bit WIP still and needs some rebasing and rearrangement. > This also replaces the previous timeline following and decoding > threads/CF entries so maybe those should be closed in CF? I wasn't sure what to do about that, since it's all a set of related functionality. I think it's going to get more traction as a single "logical decoding onstandby" feature though, since the other parts are hard to test and use in isolation. So yeah, probably, I'll do so unless someone objects. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote: > Hi all > > I've prepared a working initial, somewhat raw implementation for > logical decoding on physical standbys. Hi all I've attached a significantly revised patch, which now incorporates safeguards to ensure that we prevent decoding if the master has not retained needed catalogs and cancel decoding sessions that are holding up apply because they need too-old catalogs The biggest change in this patch, and the main intrusive part, is that procArray->replication_slot_catalog_xmin is no longer directly used by vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is added, with a corresponding CheckPoint field. Vacuum notices if procArray->replication_slot_catalog_xmin has advanced past ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr record with the new value before it copies it to oldestCatalogXmin. This means that a standby can now reliably tell when catalogs are about to be removed or become candidates for removal, so it can pause redo until logical decoding sessions on the standby advance far enough that their catalog_xmin passes that point. It also means that if our hot_standby_feedback somehow fails to lock in the catalogs our slots need on a standby, we can cancel sessions with a conflict with recovery. If wal_level is < logical this won't do anything, since replication_slot_catalog_xmin and oldestCatalogXmin will both always be 0. Because oldestCatalogXmin advances eagerly as soon as vacuum sees the new replication_slot_catalog_xmin, this won't impact catalog bloat. Ideally this mechanism won't generally actually be needed, since hot_standby_feedback stops the master from removing needed catalogs, and we make an effort to ensure that the standby has hot_standby_feedback enabled and is using a replication slot. We cannot prevent the user from dropping and re-creating the physical slot on the upstream, though, and it doesn't look simple to stop them turning off hot_standby_feedback or turning off use of a physical slot after creating logical slots, either. So we try to stop users shooting themselves in the foot, but if they do it anyway we notice and cope gracefully. Logging catalog_xmin also helps slots created on standbys know where to start, and makes sure we can deal gracefully with a race between hs_feedback and slot creation on a standby. There can be a significant delay for slot creation on standby since we have to wait until there's a new xl_running_xacts record logged. I'd like to extend the hot_standby_feedback protocol a little to address that and some other issues, but that's a separate step. I haven't addressed Petr's point yet, that "there should be parameter saying if snapshot should be exported or not and if user asks for it on standby it should fail". Otherwise I think it's looking fairly solid. Due to the amount of churn I landed up flattening the patchset. It probably makes sense to split it up, likely into the sequence of changes listed in the commit message. I'll wait for a general opinion on the validity of this approach first. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-PostgresNode-methods-to-wait-for-node-catchup.patch
- 0002-Create-new-pg_lsn-class-to-deal-with-awkward-LSNs-in.patch
- 0003-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patch
- 0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patch
- 0005-Follow-timeline-switches-in-logical-decoding.patch
- 0006-Expand-streaming-replication-tests-to-cover-hot-stan.patch
- 0007-Don-t-attempt-to-export-a-snapshot-from-CREATE_REPLI.patch
- 0008-ERROR-if-timeline-is-zero-in-walsender.patch
- 0009-Logical-decoding-on-standby.patch
On 07/12/16 07:05, Craig Ringer wrote: > On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote: >> Hi all >> >> I've prepared a working initial, somewhat raw implementation for >> logical decoding on physical standbys. > > Hi all > > I've attached a significantly revised patch, which now incorporates > safeguards to ensure that we prevent decoding if the master has not > retained needed catalogs and cancel decoding sessions that are holding > up apply because they need too-old catalogs > > The biggest change in this patch, and the main intrusive part, is that > procArray->replication_slot_catalog_xmin is no longer directly used by > vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is > added, with a corresponding CheckPoint field. Vacuum notices if > procArray->replication_slot_catalog_xmin has advanced past > ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr > record with the new value before it copies it to oldestCatalogXmin. > This means that a standby can now reliably tell when catalogs are > about to be removed or become candidates for removal, so it can pause > redo until logical decoding sessions on the standby advance far enough > that their catalog_xmin passes that point. It also means that if our > hot_standby_feedback somehow fails to lock in the catalogs our slots > need on a standby, we can cancel sessions with a conflict with > recovery. > > If wal_level is < logical this won't do anything, since > replication_slot_catalog_xmin and oldestCatalogXmin will both always > be 0. > > Because oldestCatalogXmin advances eagerly as soon as vacuum sees the > new replication_slot_catalog_xmin, this won't impact catalog bloat. > > Ideally this mechanism won't generally actually be needed, since > hot_standby_feedback stops the master from removing needed catalogs, > and we make an effort to ensure that the standby has > hot_standby_feedback enabled and is using a replication slot. We > cannot prevent the user from dropping and re-creating the physical > slot on the upstream, though, and it doesn't look simple to stop them > turning off hot_standby_feedback or turning off use of a physical slot > after creating logical slots, either. So we try to stop users shooting > themselves in the foot, but if they do it anyway we notice and cope > gracefully. Logging catalog_xmin also helps slots created on standbys > know where to start, and makes sure we can deal gracefully with a race > between hs_feedback and slot creation on a standby. > Hi, If this mechanism would not be needed most of the time, wouldn't it be better to not have it and just have a way to ask physical slot about what's the current reserved catalog_xmin (in most cases the standby should actually know what it is since it's sending the hs_feedback, but first slot created on such standby may need to check). WRT preventing hs_feedback going off, we can refuse to start with hs_feedback off when there are logical slots detected. We can also refuse to connect to the master without physical slot if there are logical slots detected. I don't see problem with either of those. You may ask what if user drops the slot and recreates or somehow otherwise messes up catalog_xmin on master, well, considering that under my proposal we'd first (on connect) check the slot for catalog_xmin we'd know about it so we'd either mark the local slots broken/drop them or plainly refuse to connect to the master same way as if it didn't have required WAL anymore (not sure which behavior is better). Note that user could mess up catalog_xmin even in your design the same way, so it's not really a regression. In general I don't think that it's necessary to WAL log anything for this. It will not work without slot and therefore via archiving anyway so writing to WAL does not seem to buy us anything. There are some interesting side effects of cascading (ie having A->B->C replication and creating logical slot on C) but those should not be insurmountable. Plus it might even be okay to only allow creating logical slots on standbys connected directly to master in v1. That's about approach, but since there are prerequisite patches in the patchset that don't really depend on the approach I will comment about them as well. 0001 and 0002 add testing infrastructure and look fine to me, possibly committable. But in 0003 I don't understand following code: > + if (endpos != InvalidXLogRecPtr && !do_start_slot) > + { > + fprintf(stderr, > + _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"), > + progname); > + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), > + progname); > + exit(1); > + } Why is --create-slot and --endpos not allowed together? 0004 again looks good but depends on 0003. 0005 is timeline following which is IMHO ready for committer, as is 0006 and 0008 and I still maintain the opinion that these should go in soon. 0007 is unfinished as you said in your mail (missing option to specify behavior). And the last one 0009 is the implementation discussed above, which I think needs rework. IMHO 0007 and 0009 should be ultimately merged. I think parts of this could be committed separately and are ready for committer IMHO, but there is no way in CF application to mark only part of patch-set for committer to attract the attention. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >> The biggest change in this patch, and the main intrusive part, is that >> procArray->replication_slot_catalog_xmin is no longer directly used by >> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is >> added, with a corresponding CheckPoint field. >> [snip] > > If this mechanism would not be needed most of the time, wouldn't it be > better to not have it and just have a way to ask physical slot about > what's the current reserved catalog_xmin (in most cases the standby > should actually know what it is since it's sending the hs_feedback, but > first slot created on such standby may need to check). Yes, and that was actually my originally preferred approach, though the one above does offer the advantage that if something goes wrong we can detect it and fail gracefully. Possibly not worth the complexity though. Your approach requires us to make very sure that hot_standby_feedback does not get turned off by user or become ineffective once we're replicating, though, since we won't have any way to detect when needed tuples are removed. We'd probably just bail out with relcache/syscache lookup errors, but I can't guarantee we wouldn't crash if we tried logical decoding on WAL where needed catalogs have been removed. I initially ran into trouble doing that, but now think I have a way forward. > WRT preventing > hs_feedback going off, we can refuse to start with hs_feedback off when > there are logical slots detected. Yes. There are some ordering issues there though. We load slots quite late in startup and they don't get tracked in checkpoints. So we might launch the walreceiver before we load slots and notice their needed xmin/catalog_xmin. So we need to prevent sending of hot_standby_feedback until slots are loaded, or load slots earlier in startup. The former sounds less intrusive and safer - probably just add an "initialized" flag to ReplicationSlotCtlData and suppress hs_feedback until it becomes true. We'd also have to suppress the validation callback action on the hot_standby_feedback GUC until we know replication slot state is initialised, and perform the check during slot startup instead. The hot_standby_feedback GUC validation check might get called before shmem is even set up so we have to guard against attempts to access a shmem segment that may not event exist yet. The general idea is workable though. Refuse to start if logical slots exist and hot_standby_feedback is off or walreceiver isn't using a physical slot. Refuse to allow hot_standby_feedback to change > We can also refuse to connect to the > master without physical slot if there are logical slots detected. I > don't see problem with either of those. Agreed. We must also be able to reliably enforce that the walreceiver is using a replication slot to connect to the master and refuse to start if it is not. The user could change recovery.conf and restart the walreceiver while we're running, after we perform that check, so walreceiver must also refuse to start if logical replication slots exist but it has no primary slot name configured. > You may ask what if user drops the slot and recreates or somehow > otherwise messes up catalog_xmin on master, well, considering that under > my proposal we'd first (on connect) check the slot for catalog_xmin we'd > know about it so we'd either mark the local slots broken/drop them or > plainly refuse to connect to the master same way as if it didn't have > required WAL anymore (not sure which behavior is better). Note that user > could mess up catalog_xmin even in your design the same way, so it's not > really a regression. Agreed. Checking catalog_xmin of the slot when we connect is sufficient to guard against that, assuming we can trust that the catalog_xmin is actually in effect on the master. Consider cascading setups, where we set our catalog_xmin but it might not be "locked in" until the middle cascaded server relays it to the master. I have a proposed solution to that which I'll outline in a separate patch+post; it ties in to some work on addressing the race between hot standby feedback taking effect and queries starting on the hot standby. It boils down to "add a hot_standby_feedback reply protocol message". > Plus > it might even be okay to only allow creating logical slots on standbys > connected directly to master in v1. True. I didn't consider that. We haven't had much luck in the past with such limitations, but personally I'd consider it a perfectly reasonable one. > But in 0003 I don't understand following code: >> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >> + { >> + fprintf(stderr, >> + _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"), >> + progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> + exit(1); >> + } > > Why is --create-slot and --endpos not allowed together? What would --create-slot with --endpos do? Ah. I had not realised that it is legal to do pg_recvlogical -S test --create-slot --start -f - -d 'test' i.e. "create a slot then in the same invocation begin decoding from it". I misread and thought that --create-slot and --start were mutually exclusive. I will address that. > 0005 is timeline following which is IMHO ready for committer, as is 0006 > and 0008 and I still maintain the opinion that these should go in soon. I wonder if I should re-order 0005 and 0006 so we can commit the hot_standby test improvements before logical decoding timeline following. > I think parts of this could be committed separately and are ready for > committer IMHO, but there is no way in CF application to mark only part > of patch-set for committer to attract the attention. Yeah. I raised that before and nobody was really sure what to do about it. It's confusing to post patches on the same thread on separate CF entries. It's also confusing to post patches on a nest of inter-related threads to allow each thread to be tracked by a separate CF entry. At the moment I'm aiming to progressively get the underlying infrastructure/test stuff in so we can focus on the core feature. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > But in 0003 I don't understand following code: >> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >> + { >> + fprintf(stderr, >> + _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"), >> + progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> + exit(1); >> + } > > Why is --create-slot and --endpos not allowed together? Actually, the test is fine, the error is just misleading due to my misunderstanding. The fix is simply to change the error message to _("%s: --endpos may only be specified with --start\n"), so I won't post a separate followup patch. Okano Naoki tried to bring this to my attention earlier, but I didn't understand as I hadn't yet realised you could use --create-slot --start, they weren't mutually exclusive. (We test to ensure --start --drop-slot isn't specified earlier so no test for do_drop_slot is required). -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 21/12/16 04:31, Craig Ringer wrote: > On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > >> But in 0003 I don't understand following code: >>> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >>> + { >>> + fprintf(stderr, >>> + _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"), >>> + progname); >>> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >>> + progname); >>> + exit(1); >>> + } >> >> Why is --create-slot and --endpos not allowed together? > > Actually, the test is fine, the error is just misleading due to my > misunderstanding. > > The fix is simply to change the error message to > > _("%s: --endpos may only be specified > with --start\n"), > > so I won't post a separate followup patch. > Ah okay makes sense. The --create-slot + --endpos should definitely be allowed combination, especially now that we can extend this to optionally use temporary slot. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 21/12/16 04:06, Craig Ringer wrote: > On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > >>> The biggest change in this patch, and the main intrusive part, is that >>> procArray->replication_slot_catalog_xmin is no longer directly used by >>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is >>> added, with a corresponding CheckPoint field. >>> [snip] >> >> If this mechanism would not be needed most of the time, wouldn't it be >> better to not have it and just have a way to ask physical slot about >> what's the current reserved catalog_xmin (in most cases the standby >> should actually know what it is since it's sending the hs_feedback, but >> first slot created on such standby may need to check). > > Yes, and that was actually my originally preferred approach, though > the one above does offer the advantage that if something goes wrong we > can detect it and fail gracefully. Possibly not worth the complexity > though. > > Your approach requires us to make very sure that hot_standby_feedback > does not get turned off by user or become ineffective once we're > replicating, though, since we won't have any way to detect when needed > tuples are removed. We'd probably just bail out with relcache/syscache > lookup errors, but I can't guarantee we wouldn't crash if we tried > logical decoding on WAL where needed catalogs have been removed. > > I initially ran into trouble doing that, but now think I have a way forward. > >> WRT preventing >> hs_feedback going off, we can refuse to start with hs_feedback off when >> there are logical slots detected. > > Yes. There are some ordering issues there though. We load slots quite > late in startup and they don't get tracked in checkpoints. So we might > launch the walreceiver before we load slots and notice their needed > xmin/catalog_xmin. So we need to prevent sending of > hot_standby_feedback until slots are loaded, or load slots earlier in > startup. The former sounds less intrusive and safer - probably just > add an "initialized" flag to ReplicationSlotCtlData and suppress > hs_feedback until it becomes true. > > We'd also have to suppress the validation callback action on the > hot_standby_feedback GUC until we know replication slot state is > initialised, and perform the check during slot startup instead. The > hot_standby_feedback GUC validation check might get called before > shmem is even set up so we have to guard against attempts to access a > shmem segment that may not event exist yet. > > The general idea is workable though. Refuse to start if logical slots > exist and hot_standby_feedback is off or walreceiver isn't using a > physical slot. Refuse to allow hot_standby_feedback to change > These are all problems associated with replication slots being in shmem if I understand correctly. I wonder, could we put just bool someplace which is available early that says if there are any logical slots defined? We don't actually need all the slot info, just to know if there are some. > >> You may ask what if user drops the slot and recreates or somehow >> otherwise messes up catalog_xmin on master, well, considering that under >> my proposal we'd first (on connect) check the slot for catalog_xmin we'd >> know about it so we'd either mark the local slots broken/drop them or >> plainly refuse to connect to the master same way as if it didn't have >> required WAL anymore (not sure which behavior is better). Note that user >> could mess up catalog_xmin even in your design the same way, so it's not >> really a regression. > > Agreed. Checking catalog_xmin of the slot when we connect is > sufficient to guard against that, assuming we can trust that the > catalog_xmin is actually in effect on the master. Consider cascading > setups, where we set our catalog_xmin but it might not be "locked in" > until the middle cascaded server relays it to the master. > > I have a proposed solution to that which I'll outline in a separate > patch+post; it ties in to some work on addressing the race between hot > standby feedback taking effect and queries starting on the hot > standby. It boils down to "add a hot_standby_feedback reply protocol > message". > >> Plus >> it might even be okay to only allow creating logical slots on standbys >> connected directly to master in v1. > > True. I didn't consider that. > > We haven't had much luck in the past with such limitations, but > personally I'd consider it a perfectly reasonable one. > I think it's infinitely better with that limitation than the status quo. Especially for failover scenario (you usually won't failover to replica down the cascade as it's always more behind). Not to mention that with every level of cascading you get automatically more lag which means more bloat so it might not even be all that desirable to go that route immediately in v1 when we don't have way to control that bloat/maximum slot lag. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 20, 2016 at 10:06 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >>> The biggest change in this patch, and the main intrusive part, is that >>> procArray->replication_slot_catalog_xmin is no longer directly used by >>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is >>> added, with a corresponding CheckPoint field. >>> [snip] >> >> If this mechanism would not be needed most of the time, wouldn't it be >> better to not have it and just have a way to ask physical slot about >> what's the current reserved catalog_xmin (in most cases the standby >> should actually know what it is since it's sending the hs_feedback, but >> first slot created on such standby may need to check). > > Yes, and that was actually my originally preferred approach, though > the one above does offer the advantage that if something goes wrong we > can detect it and fail gracefully. Possibly not worth the complexity > though. > > Your approach requires us to make very sure that hot_standby_feedback > does not get turned off by user or become ineffective once we're > replicating, though, since we won't have any way to detect when needed > tuples are removed. We'd probably just bail out with relcache/syscache > lookup errors, but I can't guarantee we wouldn't crash if we tried > logical decoding on WAL where needed catalogs have been removed. I dunno, Craig, I think your approach sounds more robust. It's not very nice to introduce a bunch of random prohibitions on what works with what, and it doesn't sound like it's altogether watertight anyway. Incorporating an occasional, small record into the WAL stream to mark the advancement of the reserved catalog_xmin seems like a cleaner and safer solution. We certainly do NOT want to find out about corruption only because of random relcache/syscache lookup failures, let alone crashes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 20, 2016 at 4:03 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > That's about approach, but since there are prerequisite patches in the > patchset that don't really depend on the approach I will comment about > them as well. > > 0001 and 0002 add testing infrastructure and look fine to me, possibly > committable. > > But in 0003 I don't understand following code: >> + if (endpos != InvalidXLogRecPtr && !do_start_slot) >> + { >> + fprintf(stderr, >> + _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"), >> + progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> + exit(1); >> + } > > Why is --create-slot and --endpos not allowed together? > > 0004 again looks good but depends on 0003. > > 0005 is timeline following which is IMHO ready for committer, as is 0006 > and 0008 and I still maintain the opinion that these should go in soon. > > 0007 is unfinished as you said in your mail (missing option to specify > behavior). And the last one 0009 is the implementation discussed above, > which I think needs rework. IMHO 0007 and 0009 should be ultimately merged. > > I think parts of this could be committed separately and are ready for > committer IMHO, but there is no way in CF application to mark only part > of patch-set for committer to attract the attention. Craig has pinged me about looking at 0001, 0002, 0004 and 0006 as those involve the TAP infrastructure. So, for 0001: --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -93,6 +93,7 @@ use RecursiveCopy;use Socket;use Test::More;use TestLib (); +use pg_lsn qw(parse_lsn);use Scalar::Util qw(blessed); This depends on 0002, so the order should be reversed. +sub lsn +{ + my $self = shift; + return $self->safe_psql('postgres', 'select case when pg_is_in_recovery() then pg_last_xlog_replay_location() else pg_current_xlog_insert_location() end as lsn;'); +} The design of the test should be in charge of choosing which value it wants to get, and the routine should just blindly do the work. More flexibility is more useful to design tests. So it would be nice to have one routine able to switch at will between 'flush', 'insert', 'write', 'receive' and 'replay modes to get values from the corresponding xlog functions. - die "error running SQL: '$$stderr'\nwhile running '@psql_params'" + die "error running SQL: '$$stderr'\nwhile running '@psql_params' with sql '$sql'" if $ret == 3; That's separate from this patch, and definitely useful. + if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) { + die "valid modes are restart, confirmed_flush"; + } + if (!defined($target_lsn)) { + $target_lsn = $self->lsn; + } That's not really the style followed by the perl scripts present in the code regarding the use of the brackets. Do we really need to care about the object type checks by the way? Regarding wait_for_catchup, there are two ways to do things. Either query the standby like in the way 004_timeline_switch.pl does it or the way this routine does. The way of this routine looks more straight-forward IMO, and other tests should be refactored to use it. In short I would make the target LSN a mandatory argument, and have the caller send a standby's application_name instead of a PostgresNode object, the current way to enforce the value of $standby_name being really confusing. + my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, 'replay' => 1 ); What's actually the point of 'sent'? + my @fields = ('plugin', 'slot_type', 'datoid', 'database', 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn'); + my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'"); + $result = undef if $result eq ''; + # hash slice, see http://stackoverflow.com/a/16755894/398670 . Couldn't this portion be made more generic? Other queries could benefit from that by having a routine that accepts as argument an array of column names for example. Now looking at 0002.... One whitespace: $ git diff HEAD~1 --check src/test/perl/pg_lsn.pm:139: trailing whitespace. +=cut pg_lsn sounds like a fine name, now we are more used to camel case for module names. And routines are written as lower case separated by an underscore. +++ b/src/test/perl/t/002_pg_lsn.pl @@ -0,0 +1,68 @@ +use strict; +use warnings; +use Test::More tests => 42; +use Scalar::Util qw(blessed); Most of the tests added don't have a description. This makes things harder to debug when tracking an issue. It may be good to begin using this module within the other tests in this patch as well. Now do we actually need it? Most of the existing tests I recall rely on the backend's operators for the pg_lsn data type, so this is actually duplicating an exiting facility. And all the values are just passed as-is. +++ b/src/test/perl/t/001_load.pl @@ -0,0 +1,9 @@ +use strict; +use warnings; +use Test::More tests => 5; I can guess the meaning of this test, having a comment on top of it to explain the purpose of the test is good practice though. Looking at 0004... +Disallows pg_recvlogial from internally retrying on error by passing --no-loop. s/pg_recvlogial/pg_recvlogical +sub pg_recvlogical_upto +{ This looks like a good idea for your tests. +my $endpos = $node_master->safe_psql('postgres', "SELECT location FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY location DESC LIMIT 1;"); +diag "waiting to replay $endpos"; On the same wave as the pg_recvlogical wrapper, you may want to consider some kind of wrapper at SQL level calling the slot functions. And finally 0006. +$node_standby_1->append_conf('recovery.conf', "primary_slot_name = $slotname_1\n"); +$node_standby_1->append_conf('postgresql.conf', "wal_receiver_status_interval = 1\n"); +$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n"); No need to call multiple times this routine. Increasing the test coverage is definitely worth it. -- Michael
On 22 December 2016 at 13:43, Michael Paquier <michael.paquier@gmail.com> wrote: > So, for 0001: > --- a/src/test/perl/PostgresNode.pm > +++ b/src/test/perl/PostgresNode.pm > @@ -93,6 +93,7 @@ use RecursiveCopy; > use Socket; > use Test::More; > use TestLib (); > +use pg_lsn qw(parse_lsn); > use Scalar::Util qw(blessed); > This depends on 0002, so the order should be reversed. Will do. That was silly. I think I should probably also move the standby tests earlier, then add a patch to update them when the results change. > +sub lsn > +{ > + my $self = shift; > + return $self->safe_psql('postgres', 'select case when > pg_is_in_recovery() then pg_last_xlog_replay_location() else > pg_current_xlog_insert_location() end as lsn;'); > +} > The design of the test should be in charge of choosing which value it > wants to get, and the routine should just blindly do the work. More > flexibility is more useful to design tests. So it would be nice to > have one routine able to switch at will between 'flush', 'insert', > 'write', 'receive' and 'replay modes to get values from the > corresponding xlog functions. Fair enough. I can amend that. > - die "error running SQL: '$$stderr'\nwhile running '@psql_params'" > + die "error running SQL: '$$stderr'\nwhile running > '@psql_params' with sql '$sql'" > if $ret == 3; > That's separate from this patch, and definitely useful. Yeah. Slipped through. I don't think it really merits a separate patch though tbh. > + if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) { > + die "valid modes are restart, confirmed_flush"; > + } > + if (!defined($target_lsn)) { > + $target_lsn = $self->lsn; > + } > That's not really the style followed by the perl scripts present in > the code regarding the use of the brackets. Do we really need to care > about the object type checks by the way? Brackets: will look / fix. Type checks (not in quoted snippet above): that's a convenience to let you pass a PostgresNode instance or a string name. Maybe there's a more idiomatic Perl-y way to write it. My Perl is pretty dire. > Regarding wait_for_catchup, there are two ways to do things. Either > query the standby like in the way 004_timeline_switch.pl does it or > the way this routine does. The way of this routine looks more > straight-forward IMO, and other tests should be refactored to use it. > In short I would make the target LSN a mandatory argument, and have > the caller send a standby's application_name instead of a PostgresNode > object, the current way to enforce the value of $standby_name being > really confusing. Hm, ok. I'll take a look. Making LSN mandatory so you have to pass $self->lsn is ok with me. > + my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1, > 'replay' => 1 ); > What's actually the point of 'sent'? Pretty useless, but we expose it in Pg, so we might as well in the tests. > + my @fields = ('plugin', 'slot_type', 'datoid', 'database', > 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn'); > + my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', > @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = > '$slot_name'"); > + $result = undef if $result eq ''; > + # hash slice, see http://stackoverflow.com/a/16755894/398670 . > Couldn't this portion be made more generic? Other queries could > benefit from that by having a routine that accepts as argument an > array of column names for example. Yeah, probably. I'm not sure where it should live though - TestLib.pm ? Not sure if there's an idomatic way to pass a string (in this case queyr) in Perl with a placeholder for interpolation of values (in this case columns). in Python you'd pass it with pre-defined %(placeholders)s for %. > Now looking at 0002.... > One whitespace: > $ git diff HEAD~1 --check > src/test/perl/pg_lsn.pm:139: trailing whitespace. > +=cut Will fix. > pg_lsn sounds like a fine name, now we are more used to camel case for > module names. And routines are written as lower case separated by an > underscore. Unsure what the intent of this is. > +++ b/src/test/perl/t/002_pg_lsn.pl > @@ -0,0 +1,68 @@ > +use strict; > +use warnings; > +use Test::More tests => 42; > +use Scalar::Util qw(blessed); > Most of the tests added don't have a description. This makes things > harder to debug when tracking an issue. > > It may be good to begin using this module within the other tests in > this patch as well. Now do we actually need it? Most of the existing > tests I recall rely on the backend's operators for the pg_lsn data > type, so this is actually duplicating an exiting facility. And all the > values are just passed as-is. I added it mainly for ordered tests of whether some expected lsn had passed/increased. But maybe it makes sense to just call into the server and let it evaluate such tests. > +++ b/src/test/perl/t/001_load.pl > @@ -0,0 +1,9 @@ > +use strict; > +use warnings; > +use Test::More tests => 5; > I can guess the meaning of this test, having a comment on top of it to > explain the purpose of the test is good practice though. Will. > Looking at 0004... > +Disallows pg_recvlogial from internally retrying on error by passing --no-loop. > s/pg_recvlogial/pg_recvlogical Thanks. > +sub pg_recvlogical_upto > +{ > This looks like a good idea for your tests. Yeah, and likely others too as we start doing more with logical replication in future. > +my $endpos = $node_master->safe_psql('postgres', "SELECT location > FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY > location DESC LIMIT 1;"); > +diag "waiting to replay $endpos"; > On the same wave as the pg_recvlogical wrapper, you may want to > consider some kind of wrapper at SQL level calling the slot functions. I'd really rather beg off that until needed later. The SQL functions are simple to invoke from PostgresNode::psql in the mean time; not so much so with pg_recvlogical. > And finally 0006. > +$node_standby_1->append_conf('recovery.conf', "primary_slot_name = > $slotname_1\n"); > +$node_standby_1->append_conf('postgresql.conf', > "wal_receiver_status_interval = 1\n"); > +$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n"); > No need to call multiple times this routine. > > Increasing the test coverage is definitely worth it. Thanks. I'll follow up with amendments. I've also implemented Petr's suggestion to allow explicit omission of a snapshot on slot creation. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote: changes-in-0001-v2.diff shows the changes to PostgresNode.pm per Michael's comments, and applies on top of 0001. I also attach a patch to add a new CREATE_REPLICATION_SLOT option per Petr's suggestion, so you can request a slot be created WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of silently suppressing snapshot export when a slot was created on a replica. It'll conflict (easily resolved) if applied on top of the current series. I have more to do before re-posting the full series, so waiting on author at this point. The PostgresNode changes likely break later tests, I'm just posting them so there's some progress here and so I don't forget over the next few days' distraction. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 12/22/2016 01:21 AM, Craig Ringer wrote: >> + my @fields = ('plugin', 'slot_type', 'datoid', 'database', >> 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn'); >> + my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ', >> @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name = >> '$slot_name'"); >> + $result = undef if $result eq ''; >> + # hash slice, see http://stackoverflow.com/a/16755894/398670 . >> Couldn't this portion be made more generic? Other queries could >> benefit from that by having a routine that accepts as argument an >> array of column names for example. > Yeah, probably. I'm not sure where it should live though - TestLib.pm ? > > Not sure if there's an idomatic way to pass a string (in this case > queyr) in Perl with a placeholder for interpolation of values (in this > case columns). in Python you'd pass it with pre-defined > %(placeholders)s for %. > > For direct interpolation of an expression, there is this slightly baroque gadget: my $str = "here it is @{[ arbitrary expression here ]}"; For indirect interpolation using placeholders, there is my $str = sprintf("format string",...); which works much like C except that the string is returned as a result instead of being the first argument. And as we always say, TIMTOWTDI. cheers andrew (japh)
On 23 December 2016 at 18:11, Craig Ringer <craig@2ndquadrant.com> wrote: > On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote: > > changes-in-0001-v2.diff shows the changes to PostgresNode.pm per > Michael's comments, and applies on top of 0001. > > I also attach a patch to add a new CREATE_REPLICATION_SLOT option per > Petr's suggestion, so you can request a slot be created > WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of > silently suppressing snapshot export when a slot was created on a > replica. It'll conflict (easily resolved) if applied on top of the > current series. OK, patch series updated. 0001 incorporates the changes requested by Michael Paquier. Simon expressed his intention to commit this after updates, in the separate thread for The pg_lsn patch is gone; I worked around it using the server to work with LSNs. 0002 (endpos) is unchanged. 0003 is new, some minimal tests for pg_recvlogical. It can be squashed with 0002 (pg_recvlogical --endpos) if desired. 0004 (pg_recvlogical wrapper) is unchanged. 0005 (new streaming rep tests) is updated for the changes in 0001, otherwise unchanged. Simon said he wanted to commit this soon. 0006 (timeline following) is unchanged except for updates to be compatible with 0001. 0007 is the optional snapshot export requested by Petr. 0008 is unchanged. 0009 is unchanged except for updates vs 0001 and use of the WITHOUT_SNAPSHOT option added in 0007. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- 0001-PostgresNode-methods-to-wait-for-node-catchup.patch
- 0002-Add-an-optional-endpos-LSN-argument-to-pg_recvlogica.patch
- 0003-Add-some-minimal-tests-for-pg_recvlogical.patch
- 0004-Add-a-pg_recvlogical-wrapper-to-PostgresNode.patch
- 0005-Expand-streaming-replication-tests-to-cover-hot-stan.patch
- 0006-Follow-timeline-switches-in-logical-decoding.patch
- 0007-Make-snapshot-export-on-logical-slot-creation-option.patch
- 0008-ERROR-if-timeline-is-zero-in-walsender.patch
- 0009-Logical-decoding-on-standby.patch
On 4 January 2017 at 12:08, Craig Ringer <craig@2ndquadrant.com> wrote: > > 0001 incorporates the changes requested by Michael Paquier. Simon > expressed his intention to commit this after updates, in the separate > thread [...] ... > 0005 (new streaming rep tests) is updated for the changes in 0001, > otherwise unchanged. Simon said he wanted to commit this soon. > > 0006 (timeline following) is unchanged except for updates to be > compatible with 0001. > > 0007 is the optional snapshot export requested by Petr. > > 0008 is unchanged. > > 0009 is unchanged except for updates vs 0001 and use of the > WITHOUT_SNAPSHOT option added in 0007. Oh, note that it's possible to commit 0001 then 0005, skipping over 2..4. I should probably have ordered them that way. That's particularly relevant to you Simon as you expressed a wish to commit the new streaming rep tests. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote: > That's particularly relevant to you Simon as you expressed a wish to > commit the new streaming rep tests. Patches 0001 and 0005 in this series also posted as https://www.postgresql.org/message-id/CAMsr+YHxTMrY1woH_m4bEF3f5+kxX_T=sDuyXf4d2-+e-56iFg@mail.gmail.com , since they're really pre-requisites not part of decoding on standby as such. I'll post a new series with them removed once committed. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 January 2017 at 16:19, Craig Ringer <craig@2ndquadrant.com> wrote: > On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote: > >> That's particularly relevant to you Simon as you expressed a wish to >> commit the new streaming rep tests. Simon committed 1, 2, 3 and 5: * Extra PostgresNode methods * pg_recvlogical --endpos * Tests for pg_recvlogical * Expand streaming replication tests to cover hot standby so here's a rebased series on top of master. No other changes. The first patch to add a pg_recvlogical wrapper to PostgresNode is really only needed to test the rest of the patches, so it can be committed together with them. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: > so here's a rebased series on top of master. No other changes. Now with actual patches. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: > >> so here's a rebased series on top of master. No other changes. > > Now with actual patches. Looking at the PostgresNode code in 0001... +=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, timeout_secs, ...) + This format is incorrect. I think that you also need to fix the brackets for the do{} and the eval{] blocks. + push @cmd, '--endpos', $endpos if ($endpos); endpos should be made a mandatory argument. If it is not defined that would make the test calling this routine being stuck. -- Michael
On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: >> >>> so here's a rebased series on top of master. No other changes. >> >> Now with actual patches. > > Looking at the PostgresNode code in 0001... > +=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, > timeout_secs, ...) > + > This format is incorrect. I think that you also need to fix the > brackets for the do{} and the eval{] blocks. > > + push @cmd, '--endpos', $endpos if ($endpos); > endpos should be made a mandatory argument. If it is not defined that > would make the test calling this routine being stuck. > -- > Michael Acknowledged and agreed. I'll fix both in the next revision. I'm currently working on hot standby replies, but will return to this ASAP. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 6, 2017 at 1:07 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote: >> On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >>> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: >>> >>>> so here's a rebased series on top of master. No other changes. >>> >>> Now with actual patches. >> >> Looking at the PostgresNode code in 0001... >> +=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos, >> timeout_secs, ...) >> + >> This format is incorrect. I think that you also need to fix the >> brackets for the do{} and the eval{] blocks. >> >> + push @cmd, '--endpos', $endpos if ($endpos); >> endpos should be made a mandatory argument. If it is not defined that >> would make the test calling this routine being stuck. > > Acknowledged and agreed. I'll fix both in the next revision. I'm > currently working on hot standby replies, but will return to this > ASAP. By the way, be sure to fix as well the =pod blocks for the new routines. perldoc needs to use both =pod and =item. -- Michael
On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote: > On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: > >> so here's a rebased series on top of master. No other changes. > > Now with actual patches. Patch 5 no longer applies: patching file src/include/pgstat.h Hunk #1 FAILED at 745. 1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej --- src/include/pgstat.h +++ src/include/pgstat.h @@ -745,7 +745,8 @@ typedef enum WAIT_EVENT_SYSLOGGER_MAIN, WAIT_EVENT_WAL_RECEIVER_MAIN, WAIT_EVENT_WAL_SENDER_MAIN, - WAIT_EVENT_WAL_WRITER_MAIN + WAIT_EVENT_WAL_WRITER_MAIN, + WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE} WaitEventActivity; /* ---------- Could you rebase? Thanks Thom
Rebased series attached, on top of current master (which includes logical replicaiton). I'm inclined to think I should split out a few of the changes from 0005, roughly along the lines of the bullet points in its commit message. Anyone feel strongly about how granular this should be? This patch series is a pre-requisite for supporting logical replication using a physical standby as a source, but does not its self enable logical replication from a physical standby. On 23 January 2017 at 23:03, Thom Brown <thom@linux.com> wrote: > On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote: >> >>> so here's a rebased series on top of master. No other changes. >> >> Now with actual patches. > > Patch 5 no longer applies: > > patching file src/include/pgstat.h > Hunk #1 FAILED at 745. > 1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej > > --- src/include/pgstat.h > +++ src/include/pgstat.h > @@ -745,7 +745,8 @@ typedef enum > WAIT_EVENT_SYSLOGGER_MAIN, > WAIT_EVENT_WAL_RECEIVER_MAIN, > WAIT_EVENT_WAL_SENDER_MAIN, > - WAIT_EVENT_WAL_WRITER_MAIN > + WAIT_EVENT_WAL_WRITER_MAIN, > + WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE > } WaitEventActivity; > > /* ---------- > > Could you rebase? > > Thanks > > Thom -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Jan 24, 2017 at 7:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > Rebased series attached, on top of current master (which includes > logical replicaiton). > > I'm inclined to think I should split out a few of the changes from > 0005, roughly along the lines of the bullet points in its commit > message. Anyone feel strongly about how granular this should be? > > This patch series is a pre-requisite for supporting logical > replication using a physical standby as a source, but does not its > self enable logical replication from a physical standby. There are patches but no reviews yet so moved to CF 2017-03. -- Michael
On 24 January 2017 at 06:37, Craig Ringer <craig@2ndquadrant.com> wrote: > Rebased series attached, on top of current master (which includes > logical replicaiton). > > I'm inclined to think I should split out a few of the changes from > 0005, roughly along the lines of the bullet points in its commit > message. Anyone feel strongly about how granular this should be? > > This patch series is a pre-requisite for supporting logical > replication using a physical standby as a source, but does not its > self enable logical replication from a physical standby. Patch 4 committed. Few others need rebase. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > Patch 4 committed. Few others need rebase. Since this patch series and initial data copy for logical replication both add a facility for suppressing initial snapshot export on a logical slot, I've dropped patch 0003 (make snapshot export on logical slot creation) in favour of Petr's similar patch. I will duplicate it in this patch series for ease of application. (The version here is slightly extended over Petr's so I'll re-post the modified version on the logical replication initial data copy thread too). The main thing I want to direct attention to for Simon, as committer, is the xlog'ing of VACUUM's xid threshold before we advance it and start removing tuples. This is necessary for the standby to know whether a given replication slot is safe to use and fail with conflict with recovery if it is not, or if it becomes unsafe due to master vacuum activity. Note that we can _not_ use the various vacuum records for this because we don't know which are catalogs and which aren't; we'd have to add a separate "is catalog" field to each vacuum xlog record, which is undesirable. The downstream can't look up whether it's a catalog or not because it doesn't have relcache/syscache access during decoding. This change might look a bit similar to the vac_truncate_clog change in the txid_status patch, but it isn't really related. The txid_status change is about knowing when we can safely look up xids in clog and preventing a race with clog truncation. This change is about knowing when we can know all catalog tuples for a given xid will still be in the heap, not vacuumed away. Both are about making sure standbys know more about the state of the system in a low-cost way, though. WaitForMasterCatalogXminReservation(...) in logical.c is also worth looking more closely at. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote: > On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > >> Patch 4 committed. Few others need rebase. > > Since this patch series and initial data copy for logical replication > both add a facility for suppressing initial snapshot export on a > logical slot, I've dropped patch 0003 (make snapshot export on logical > slot creation) in favour of Petr's similar patch. > > I will duplicate it in this patch series for ease of application. (The > version here is slightly extended over Petr's so I'll re-post the > modified version on the logical replication initial data copy thread > too). > > The main thing I want to direct attention to for Simon, as committer, > is the xlog'ing of VACUUM's xid threshold before we advance it and > start removing tuples. This is necessary for the standby to know > whether a given replication slot is safe to use and fail with conflict > with recovery if it is not, or if it becomes unsafe due to master > vacuum activity. Note that we can _not_ use the various vacuum records > for this because we don't know which are catalogs and which aren't; > we'd have to add a separate "is catalog" field to each vacuum xlog > record, which is undesirable. The downstream can't look up whether > it's a catalog or not because it doesn't have relcache/syscache access > during decoding. > > This change might look a bit similar to the vac_truncate_clog change > in the txid_status patch, but it isn't really related. The txid_status > change is about knowing when we can safely look up xids in clog and > preventing a race with clog truncation. This change is about knowing > when we can know all catalog tuples for a given xid will still be in > the heap, not vacuumed away. Both are about making sure standbys know > more about the state of the system in a low-cost way, though. > > WaitForMasterCatalogXminReservation(...) in logical.c is also worth > looking more closely at. I should also note that because the TAP tests currently take a long time, I recommend skipping the tests for this patch by default and running them only when actually touching logical decoding. I'm looking at ways to make them faster, but they're inevitably going to take a while until we can get hot standby feedback replies in place, including cascading support. Which I have as WIP, but won't make this release. Changing the test import to use Test::More skip_all => "disabled by default, too slow"; will be sufficient. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote: > On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > >> Patch 4 committed. Few others need rebase. > > Since this patch series Patch 1 fails since feature has already been applied. If other reason, let me know. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 19 March 2017 at 18:02, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > Patch 1 fails since feature has already been applied. If other reason, > let me know. Nope, that's fine. Rebased attached. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, I don't know how well I can review the 0001 (the TAP infra patch) but it looks okay to me. I don't really have any complaints about 0002 either. I like that it's more or less one self-contained function and there are no weird ifdefs anymore like in 9.6 version (btw your commit message talks about 9.5 but it was 9.6). I also like the clever test :) I am slightly worried about impact of the readTimeLineHistory() call but I think it should be called so little that it should not matter. That brings us to the big patch 0003. I still don't like the "New in 10.0" comments in documentation, for one it's 10, not 10.0 and mainly we don't generally write stuff like this to documentation, that's what release notes are for. There is large amounts of whitespace mess (empty lines with only whitespace, spaces at the end of the lines), nothing horrible, but should be cleaned up. One thing I don't understand much is the wal_level change and turning off catalogXmin tracking. I don't really see anywhere that the catalogXmin would be reset in control file for example. There is TODO in UpdateOldestCatalogXmin() that seems related but tbh I don't follow what's happening there - comment says about not doing anything, but the code inside the if block are only Asserts. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 19 March 2017 at 21:12, Craig Ringer <craig@2ndquadrant.com> wrote: > Rebased attached. Patch1 looks good to go. I'll correct a spelling mistake in the tap test when I commit that later today. Patch2 has a couple of points 2.1 Why does call to ReplicationSlotAcquire() move earlier in pg_logical_slot_get_changes_guts()? 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really documented well. The setting sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID; should be sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID); but that doesn't cause failure because in read_local_xlog_page() we say that we are reading from history when state->currTLI != ThisTimeLineID explicitly rather than use sendTimeLineIsHistoric So it looks like we could do with a few extra comments If you correct these I'll commit it tomorrow. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > 2.1 Why does call to ReplicationSlotAcquire() move earlier in > pg_logical_slot_get_changes_guts()? That appears to be an oversight from an earlier version where it looped over timelines in pg_logical_slot_get_changes_guts . Reverted. > 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really > documented well. > The setting > sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID; > should be > sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID); Definitely wrong. Fixed. > but that doesn't cause failure because in read_local_xlog_page() we > say that we are reading from history when > state->currTLI != ThisTimeLineID explicitly rather than use > sendTimeLineIsHistoric XLogRead(...), as called by logical_read_xlog_page, does test it. It's part of the walsender-local log read callback. We don't hit read_local_xlog_page at all when we're doing walsender based logical decoding. We have two parallel code paths for reading xlogs, one for walsender, one for normal backends. The walsender one is glued together with a bunch of globals that pass state "around" the xlogreader - we set it up before calling into xlogreader, and then examine it when xlogreader calls back into walsender.c with logical_read_xlog_page. I really want to refactor that at some stage, getting rid of the use of walsender globals for timeline state tracking and sharing more of the xlog reading logic between walsender and normal backends. But -ENOTIME, especially to do it as carefully as it must be done. There are comments on read_local_xlog_page, logical_read_xlog_page that mention this. Also XLogRead in src/backend/access/transam/xlogutils.c (which has the same name as XLogRead in src/backend/replication/walsender.c). I have a draft for a timeline following readme that would address some of this but don't expect to be able to finish it off for this release cycle, and I'd really rather clean it up instead. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 20 March 2017 at 17:03, Craig Ringer <craig@2ndquadrant.com> wrote: > On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > >> 2.1 Why does call to ReplicationSlotAcquire() move earlier in >> pg_logical_slot_get_changes_guts()? > > That appears to be an oversight from an earlier version where it > looped over timelines in pg_logical_slot_get_changes_guts . Reverted. > >> 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really >> documented well. >> The setting >> sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID; >> should be >> sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID); > > Definitely wrong. Fixed. Attached. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, Have you checked how high the overhead of XLogReadDetermineTimeline is? A non-local function call, especially into a different translation-unit (no partial inlining), for every single page might end up being noticeable. That's fine in the cases it actually adds functionality, but for a master streaming out data, that's not actually adding anything. Did you check whether you changes to read_local_xlog_page could cause issues with twophase.c? Because that now also uses it. Did you check whether ThisTimeLineID is actually always valid in the processes logical decoding could run in? IIRC it's not consistently update during recovery in any process but the startup process. On 2017-03-19 21:12:23 +0800, Craig Ringer wrote: > From 2fa891a555ea4fb200d75b8c906c6b932699b463 Mon Sep 17 00:00:00 2001 > From: Craig Ringer <craig@2ndquadrant.com> > Date: Thu, 1 Sep 2016 10:16:55 +0800 > Subject: [PATCH 2/3] Follow timeline switches in logical decoding FWIW, the title doesn't really seem accurate to me. > Logical slots cannot actually be created on a replica without use of > the low-level C slot management APIs so this is mostly foundation work > for subsequent changes to enable logical decoding on standbys. Everytime I read references to anything like this my blood starts to boil. I kind of regret not having plastered RecoveryInProgress() errors all over this code. > From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001 > From: Craig Ringer <craig@2ndquadrant.com> > Date: Mon, 5 Sep 2016 15:30:53 +0800 > Subject: [PATCH 3/3] Logical decoding on standby > > * Make walsender aware of ProcSignal and recovery conflicts, make walsender > exit with recovery conflict on upstream drop database when it has an active > logical slot on that database. > * Allow GetOldestXmin to omit catalog_xmin, be called already locked. "be called already locked"? > * Send catalog_xmin separately in hot_standby_feedback messages. > * Store catalog_xmin separately on a physical slot if received in hot_standby_feedback What does separate mean? > * Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin, > requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store > it in checkpoints, make vacuum and bgwriter advance it. I can't parse that sentence. > * Add a new recovery conflict type for conflict with catalog_xmin. Abort > in-progress logical decoding sessions with conflict with recovery where needed > catalog_xmin is too old Are we retaining WAL for slots broken in that way? > * Make extra efforts to reserve master's catalog_xmin during decoding startup > on standby. What does that mean? > * Remove checks preventing starting logical decoding on standby To me that's too many different things in one commit. A bunch of them seem like it'd be good if they'd get independent buildfarm cycles too. > diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c > index d7f65a5..36bbb98 100644 > --- a/src/backend/access/heap/rewriteheap.c > +++ b/src/backend/access/heap/rewriteheap.c > @@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state) > if (!state->rs_logical_rewrite) > return; > > - ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin); > + /* Use the catalog_xmin being retained by vacuum */ > + ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL); What does that comment mean? Vacuum isn't the only thing that prunes old records. > +/* > + * Set the global oldest catalog_xmin used to determine when tuples > + * may be removed from catalogs and user-catalogs accessible from logical > + * decoding. > + * > + * Only to be called from the startup process or by UpdateOldestCatalogXmin(), > + * which ensures the update is properly written to xlog first. > + */ > +void > +SetOldestCatalogXmin(TransactionId oldestCatalogXmin) > +{ > + Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock)); Uh, that's long-ish. And doesn't agree with the comment above (s/startup process/process performing recovery/?). This is a long enough list that I'd consider just dropping the assert. > + else if (info == XLOG_XACT_CATALOG_XMIN_ADV) > + { > + xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record); > + > + /* > + * Unless logical decoding is possible on this node, we don't care about > + * this record. > + */ > + if (!XLogLogicalInfoActive() || max_replication_slots == 0) > + return; Too many negatives for my taste, but whatever. > + /* > + * Apply the new catalog_xmin limit immediately. New decoding sessions > + * will refuse to start if their slot is past it, and old ones will > + * notice when we signal them with a recovery conflict. There's no > + * effect on the catalogs themselves yet, so it's safe for backends > + * with older catalog_xmins to still exist. > + * > + * We don't have to take ProcArrayLock since only the startup process > + * is allowed to change oldestCatalogXmin when we're in recovery. > + */ > + SetOldestCatalogXmin(xlrec->new_catalog_xmin); Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's lwlock acquisition for barriers? > +/* > + * Record when we advance the catalog_xmin used for tuple removal > + * so standbys find out before we remove catalog tuples they might > + * need for logical decoding. > + */ > +XLogRecPtr > +XactLogCatalogXminUpdate(TransactionId new_catalog_xmin) > +{ > + XLogRecPtr ptr = InvalidXLogRecPtr; > + > + if (XLogInsertAllowed()) > + { > + xl_xact_catalog_xmin_advance xlrec; > + > + xlrec.new_catalog_xmin = new_catalog_xmin; > + > + XLogBeginInsert(); > + XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance); > + > + ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV); > + } Huh, why is this test needed and ok? > @@ -9449,6 +9456,16 @@ XLogReportParameters(void) > XLogFlush(recptr); > } > > + /* > + * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer > + * require oldestCatalogXmin in checkpoints and it no longer > + * makes sense, so update shmem and xlog the change. This will > + * get written out in the next checkpoint. > + */ > + if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL && > + wal_level < WAL_LEVEL_LOGICAL) > + UpdateOldestCatalogXmin(true); What if we crash before this happens? > diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c > index ff633fa..2d16bf0 100644 > --- a/src/backend/commands/vacuum.c > +++ b/src/backend/commands/vacuum.c > @@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel, > MultiXactId safeMxactLimit; > > /* > + * When logical decoding is enabled, we must write any advance of > + * catalog_xmin to xlog before we allow VACUUM to remove those tuples. > + * This ensures that any standbys doing logical decoding can cancel > + * decoding sessions and invalidate slots if we remove tuples they > + * still need. > + */ > + UpdateOldestCatalogXmin(false); I'm on a first read-through through this, but it appears you don't do anything similar in heap_page_prune()? And we can't just start emitting loads of additional records there, because it's called much more often... > /* > * Make sure the current settings & environment are capable of doing logical > * decoding. > @@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void) > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("logical decoding requires a database connection"))); > > - /* ---- > - * TODO: We got to change that someday soon... > - * > - * There's basically three things missing to allow this: > - * 1) We need to be able to correctly and quickly identify the timeline a > - * LSN belongs to > - * 2) We need to force hot_standby_feedback to be enabled at all times so > - * the primary cannot remove rows we need. > - * 3) support dropping replication slots referring to a database, in > - * dbase_redo. There can't be any active ones due to HS recovery > - * conflicts, so that should be relatively easy. > - * ---- > - */ > if (RecoveryInProgress()) > - ereport(ERROR, > - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > - errmsg("logical decoding cannot be used while in recovery"))); > + { > + bool walrcv_running, walrcv_has_slot; > + > + SpinLockAcquire(&WalRcv->mutex); > + walrcv_running = WalRcv->pid != 0; > + walrcv_has_slot = WalRcv->slotname[0] != '\0'; > + SpinLockRelease(&WalRcv->mutex); > + > + /* > + * The walreceiver should be running when we try to create a slot. If > + * we're unlucky enough to catch the walreceiver just as it's > + * restarting after an error, well, the client can just retry. We don't > + * bother to sleep and re-check. > + */ > + if (!walrcv_running) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("streaming replication is not active"), > + errhint("Logical decoding on standby requires that streaming replication be configured and active.Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the logs."))); That seems quite problematic. What if there's a momentaneous connection failure? This also has the issue that just because you checked that walrcv_running at some point, doesn't guarantee anything by the time you actually check. Seems like life were easier if recovery.conf were guc-ified already - checking for primary_conninfo/primary_slot_name etc wouldn't have that issue (and can't be changed while running). Usage of a slot doesn't actually guarantee much in cascased setups, does it? > @@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin, > * xmin horizons by other backends, get the safe decoding xid, and inform > * the slot machinery about the new limit. Once that's done the > * ProcArrayLock can be released as the slot machinery now is > - * protecting against vacuum. > + * protecting against vacuum - if we're on the master. If we're running on > + * a replica, we have to wait until hot_standby_feedback locks in our > + * needed catalogs, per details on WaitForMasterCatalogXminReservation(). > * ---- > */ > LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > @@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin, > > ReplicationSlotsComputeRequiredXmin(true); > > + if (RecoveryInProgress()) > + WaitForMasterCatalogXminReservation(slot); > + > + Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin, > + slot->data.catalog_xmin)); > + > LWLockRelease(ProcArrayLock); I think it's quite a bad idea to do a blocking operation like WaitForMasterCatalogXminReservation while holding ProcArrayLock. > +/* > + * Wait until the master's catalog_xmin is set, advancing our catalog_xmin > + * if needed. Caller must hold exclusive ProcArrayLock, which this function will > + * temporarily release while sleeping but will re-acquire. Ah. I see. Hm :(. > + * We're pretty much just hoping that, if someone else already has a > + * catalog_xmin reservation affecting the master, it stays where we want it > + * until our own hot_standby_feedback can pin it down. Hm. > + * When we're creating a slot on a standby we can't directly set the > + * master's catalog_xmin; the catalog_xmin is set locally, then relayed > + * over hot_standby_feedback. The master may remove the catalogs we > + * asked to reserve between when we set a local catalog_xmin and when > + * hs feedback makes that take effect on the master. We need a feedback > + * reply mechanism here, where: > + * > + * - we tentatively reserve catalog_xmin locally Will that already trigger recovery conflicts? > + * - we wake the walreceiver by setting its latch > + * - walreceiver sends hs_feedback > + * - upstream walsender sends a new 'hs_feedback reply' message with > + * actual (xmin, catalog_xmin) reserved. > + * - walreceiver sees reply and updates ShmemVariableCache or some other > + * handy bit of shmem with hs feedback reservations from reply "or some other handy bit"? > + * - we poll the reservations while we wait > + * - we set our catalog_xmin to that value, which might be later if > + * we missed our requested reservation, or might be earlier if > + * someone else is holding down catalog_xmin on master. We got a hs > + * feedback reply so we know it's reserved. > + * > + * For cascading, the actual reservation will need to cascade up the > + * chain by walsender setting its own walreceiver's latch in turn, etc. > + * > + * For now, we just set the local slot catalog_xmin and sleep until > + * oldestCatalogXmin equals or passes our reservation. This is fine if we're > + * the only decoding session, but it is vulnerable to races if slots on the > + * master or other decoding sessions on other standbys connected to the same > + * master exist. They might advance their reservation before our hs_feedback > + * locks it down, allowing vacuum to remove tuples we need. So we might start > + * decoding on our slot then error with a conflict with recovery when we see > + * catalog_xmin advance. > + */ I was about to list some of these issues. That's a bit unsatisfying. Pondering this for a bit, but I'm ~9h into a flight, so maybe not tonight^Wthis morning^Wwhaaaa. > +static void > +WaitForMasterCatalogXminReservation(ReplicationSlot *slot) > +{ This comment seems to duplicate some of the function header comment. Such duplication usually leads to either or both getting out of date rather quickly. Not commenting line-by-line on the code here, but I'm extremely doubtful that this approach is stable enough, and that the effect of holding ProcArrayLock exclusively over prolonged amounts of time is acceptable. > + ReplicationSlotsComputeRequiredXmin(true); > Why do we need this? The caller does it too, no? > + /* Tell the master what catalog_xmin we settled on */ > + WalRcvForceReply(); > + > + /* Reset ps display if we changed it */ > + if (new_status) > + { > + set_ps_display(new_status, false); > + pfree(new_status); > + } We really shouldn't do stuff like this while holding ProcArrayLock. > +/* > + * Test to see if the active logical slot is usable. > + */ > +static void > +EnsureActiveLogicalSlotValid() > +{ Missing (void). > +/* > + * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the > + * passed database oid. The caller should hold an exclusive lock on the database > + * to ensure no replication slots on the database are in use. Stuff like this really should be it's own commit. It can trivially be tested on its own, is useful on its own (just have DROP DATABASE do it), ... > + * If we fail here we'll leave the in-memory state of replication slots > + * inconsistent with its on-disk state, so we need to PANIC. We worked quite hard to make it extremely unlikely for that to happen in practice. I also don't see why there should be any new PANICs in this code. > + * This routine isn't as efficient as it could be - but we don't drop databases > + * often, especially databases with lots of slots. That seems fine. > +void > +ReplicationSlotsDropDBSlots(Oid dboid) > +{ > + int i; > + > + if (max_replication_slots <= 0) > + return; > + > + /* > + * We only need a shared lock here even though we activate slots, > + * because we have an exclusive lock on the database we're dropping > + * slots on and don't touch other databases' slots. > + */ > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); Hm? Acquiring a slot always only takes a shared lock, no? I don't really see how "database is locked" guarantees enough for your logic - it's already possible to drop slots from other databases, and dropping a slot acquires it temporarily? > + for (i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s; > + NameData slotname; > + int active_pid; > + > + s = &ReplicationSlotCtl->replication_slots[i]; > + > + /* cannot change while ReplicationSlotCtlLock is held */ > + if (!s->in_use) > + continue; > + > + /* only logical slots are database specific, skip */ > + if (!SlotIsLogical(s)) > + continue; > + > + /* not our database, skip */ > + if (s->data.database != dboid) > + continue; > + > + /* Claim the slot, as if ReplicationSlotAcquire()ing */ > + SpinLockAcquire(&s->mutex); > + strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN); > + NameStr(slotname)[NAMEDATALEN-1] = '\0'; > + active_pid = s->active_pid; > + if (active_pid == 0) > + { > + MyReplicationSlot = s; > + s->active_pid = MyProcPid; > + } > + SpinLockRelease(&s->mutex); > + > + /* > + * The caller should have an exclusive lock on the database so > + * we'll never have any in-use slots, but just in case... > + */ > + if (active_pid) > + elog(PANIC, "replication slot %s is in use by pid %d", > + NameStr(slotname), active_pid); So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of PANICing? There seems to be absolutely no correctness reason for a PANIC here? > + /* > + * To avoid largely duplicating ReplicationSlotDropAcquired() or > + * complicating it with already_locked flags for ProcArrayLock, > + * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we > + * just release our ReplicationSlotControlLock to drop the slot. > + * > + * There's no race here: we acquired this slot, and no slot "behind" > + * our scan can be created or become active with our target dboid due > + * to our exclusive lock on the DB. > + */ > + LWLockRelease(ReplicationSlotControlLock); > + ReplicationSlotDropAcquired(); > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); I don't see much problem with this, but I'd change the code so you simply do a goto restart; if you released the slot. Then there's a lot less chance / complications around temporarily releasing ReplicationSlotControlLock. > + * > + * If logical decoding information is enabled, we also > + * send immediate hot standby feedback so as to reduce > + * the delay before our needed catalogs are locked in. "logical decoding information ... enabled" and "catalogs are locked in" are a bit too imprecise descriptions for my taste. > @@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed) > { > TimestampTz now; > TransactionId nextXid; > - uint32 nextEpoch; > - TransactionId xmin; > + uint32 xmin_epoch, catalog_xmin_epoch; > + TransactionId xmin, catalog_xmin; > static TimestampTz sendTime = 0; > /* initially true so we always send at least one feedback message */ > static bool master_has_standby_xmin = true; > @@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed) > * everything else has been checked. > */ > if (hot_standby_feedback) > - xmin = GetOldestXmin(NULL, false); > + { > + /* > + * Usually GetOldestXmin() would include the catalog_xmin in its > + * calculations, but we don't want to hold upstream back from vacuuming > + * normal user table tuples just because they're within the > + * catalog_xmin horizon of logical replication slots on this standby. > + * Instead we report the catalog_xmin to the upstream separately. > + */ I again don't think it's good to refer to vacuum as it's not the only thing that can remove tuple versions. > + xmin = GetOldestXmin(NULL, > + false, /* don't ignore vacuum */ > + true /* ignore catalog xmin */); > + > + /* > + * The catalog_Xmin reported by GetOldestXmin is the effective > + * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance > + * records from the master. Sending it back to the master would be > + * circular and prevent its catalog_xmin ever advancing once set. > + * We should only send the catalog_xmin we actually need for slots. > + */ > + ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin); Given that you don't have catalog_xmin set by GetOldestXmin that comment is a bit misleading. > @@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum) > NormalTransactionIdPrecedes(replication_slot_xmin, result)) > result = replication_slot_xmin; > > + if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel))) > + { > + /* > + * After locks have been released and defer_cleanup_age has been applied, > + * check whether we need to back up further to make logical decoding > + * safe. We need to do so if we're computing the global limit (rel = > + * NULL) or if the passed relation is a catalog relation of some kind. > + */ > + if (TransactionIdIsValid(replication_slot_catalog_xmin) && > + NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result)) > + result = replication_slot_catalog_xmin; > + } The nesting of these checks, and the comments about them, is a bit weird. > +/* > + * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated > + * to reflect an advance in procArray->replication_slot_catalog_xmin or > + * it becoming newly set or unset. > + * > + */ > +static bool > +CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin) > +{ > + return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin) > + || (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin))); > +} Your lines are really long - pgindent (which you really should run) will much this. I think it'd be better to rephrase this. > +/* > + * If necessary, copy the current catalog_xmin needed by repliation slots to Typo: repliation > + * the effective catalog_xmin used for dead tuple removal. > + * > + * When logical decoding is enabled we write a WAL record before advancing the > + * effective value so that standbys find out if catalog tuples they still need > + * get removed, and can properly cancel decoding sessions and invalidate slots. > + * > + * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off > + * and need to clear the shmem state, since we want to bypass the wal_level > + * check and force xlog writing. > + */ > +void > +UpdateOldestCatalogXmin(bool force) I'm a bit confused by this function and variable name. What does + TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state + * is guaranteed to still exist */ mean? I complained about the overall justification in the commit already, but looking at this commit alone, the justification for this part of the change is quite hard to understand. > +{ > + TransactionId vacuum_catalog_xmin; > + TransactionId slots_catalog_xmin; > + > + /* > + * If we're not recording logical decoding information, catalog_xmin > + * must be unset and we don't need to do any work here. If we don't need to do any work, shouldn't we return early? > + if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force) > + { > + XactLogCatalogXminUpdate(slots_catalog_xmin); > + > + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > + /* > + * A concurrent updater could've changed these values so we need to re-check > + * under ProcArrayLock before updating. > + */ > + vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin); > + slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin); why are there volatile reads here? > + if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin)) > + SetOldestCatalogXmin(slots_catalog_xmin); Why don't we check force here, but above? > @@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void) > oldestSafeXid = ShmemVariableCache->nextXid; > > /* > - * If there's already a slot pegging the xmin horizon, we can start with > - * that value, it's guaranteed to be safe since it's computed by this > - * routine initially and has been enforced since. > + * If there's already an effectiveCatalogXmin held down by vacuum > + * it's definitely safe to start there, and it can't advance > + * while we hold ProcArrayLock. What does "held down by vacuum" mean? > /* > + * Notify a logical decoding session that it conflicts with a > + * newly set catalog_xmin from the master. > + */ > +void > +CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid) > +{ > + ProcArrayStruct *arrayP = procArray; > + int index; > + > + /* > + * We have to scan ProcArray to find the process and set a pending recovery > + * conflict even though we know the pid. At least we can get the BackendId > + * and void a ProcSignal scan later. > + * > + * The pid might've gone away, in which case we got the desired > + * outcome anyway. > + */ > + LWLockAcquire(ProcArrayLock, LW_SHARED); > + > + for (index = 0; index < arrayP->numProcs; index++) > + { > + int pgprocno = arrayP->pgprocnos[index]; > + volatile PGPROC *proc = &allProcs[pgprocno]; > + > + if (proc->pid == session_pid) > + { > + VirtualTransactionId procvxid; > + > + GET_VXID_FROM_PGPROC(procvxid, *proc); > + > + proc->recoveryConflictPending = true; > + > + /* > + * Kill the pid if it's still here. If not, that's what we > + * wanted so ignore any errors. > + */ > + (void) SendProcSignal(session_pid, > + PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId); > + > + break; > + } > + } > + > + LWLockRelease(ProcArrayLock); Doesn't seem ok to do this while holding ProcArrayLock. > +/* > + * Scan to see if any clients are using replication slots that are below the > + * new catalog_xmin theshold and sigal them to terminate with a recovery > + * conflict. > + * > + * We already applied the new catalog_xmin record and updated the shmem > + * catalog_xmin state, so new clients that try to use a replication slot > + * whose on-disk catalog_xmin is below the new threshold will ERROR, and we > + * don't have to guard against them here. > + * > + * Replay can only continue safely once every slot that needs the catalogs > + * we're going to free for removal is gone. So if any conflicting sessions > + * exist, wait for any standby conflict grace period then signal them to exit. > + * > + * The master might clear its reserved catalog_xmin if all upstream slots are > + * removed or clear their feedback reservations, sending us > + * InvalidTransactionId. If we're concurrently trying to create a new slot and > + * reserve catalogs the InvalidXid reservation report might come in while we > + * have a slot waiting for hs_feedback confirmation of its reservation. That > + * would cause the waiting process to get canceled with a conflict with > + * recovery here since its tentative reservation conflicts with the master's > + * report of 'nothing reserved'. To allow it to continue to seek a startpoint > + * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're > + * still looking for where to start. We'll sometimes notice a conflict but the > + * slot will advance its catalog_xmin to a more recent nextXid and cease to > + * conflict when we re-check. (The alternative is to track slots being created > + * differently to slots actively decoding in shmem, which seems unnecessary. Or > + * to separate the 'tentative catalog_xmin reservation' of a slot from its > + * actual needed catalog_xmin.) > + * > + * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because > + * walsender-based logical decoding sessions won't have any virtualxid for much > + * of their life and the end of their virtualxids doesn't mean the end of a > + * potential conflict. It would also cancel too aggressively, since it cares > + * about the backend's xmin and logical decoding only needs the catalog_xmin. > + */ The use of "we" seems confusing here, because it's not the same process. Generally I think your comments need to be edited a bit for brevity and preciseness. > +void > +ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin) > +{ > + int i; > + > + if (!InHotStandby) > + /* nobody can be actively using logical slots */ > + return; > + > + /* Already applied new limit, can't have replayed later one yet */ > + Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin); > + > + /* > + * Find the first conflicting active slot and wait for it to be free, > + * signalling it if necessary, then repeat until there are no more > + * conflicts. > + */ > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (i = 0; i < max_replication_slots; i++) > + { I'm pretty strongly against any code outside of slot.c doing this. > @@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason) > Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending)); > > /* > - * All conflicts apart from database cause dynamic errors where the > - * command or transaction can be retried at a later point with some > - * potential for success. No need to reset this, since non-retryable > - * conflict errors are currently FATAL. > + * All conflicts apart from database and catalog_xmin cause dynamic > + * errors where the command or transaction can be retried at a later > + * point with some potential for success. No need to reset this, since > + * non-retryable conflict errors are currently FATAL. > */ > - if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE) > + if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE || > + reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN) > RecoveryConflictRetryable = false; > } Hm. Why is this a non-retryable error? Ok, landing soon. Gotta finish here. 0002 should be doable as a whole this release, I have severe doubts that 0003 as a whole has a chance for 10 - the code is in quite a raw shape, there's a significant number of open ends. I'd suggest breaking of bits that are independently useful, and work on getting those committed. - Andres
On 19 March 2017 at 22:12, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > I am slightly worried about impact of the readTimeLineHistory() call but > I think it should be called so little that it should not matter. Pretty much my thinking too. > That brings us to the big patch 0003. > > I still don't like the "New in 10.0" comments in documentation, for one > it's 10, not 10.0 and mainly we don't generally write stuff like this to > documentation, that's what release notes are for. OK. Personally I think it's worthwhile for protocol docs, which are more dev-focused. But I agree it's not consistent with the rest of the docs, so removed. (Frankly I wish we did this consistently throughout the Pg docs, too, and it'd be much more user-friendly if we did, but that's just not going to happen.) > There is large amounts of whitespace mess (empty lines with only > whitespace, spaces at the end of the lines), nothing horrible, but > should be cleaned up. Fixed. > One thing I don't understand much is the wal_level change and turning > off catalogXmin tracking. I don't really see anywhere that the > catalogXmin would be reset in control file for example. There is TODO in > UpdateOldestCatalogXmin() that seems related but tbh I don't follow > what's happening there - comment says about not doing anything, but the > code inside the if block are only Asserts. UpdateOldestCatalogXmin(...) with force=true forces a XactLogCatalogXminUpdate(...) call to write the new procArray->replication_slot_catalog_xmin . We call it with force=true from XLogReportParameters(...) when wal_level has been lowered; see XLogReportParameters. This will write out a xl_xact_catalog_xmin_advance with procArray->replication_slot_catalog_xmin's value then update ShmemVariableCache->oldestCatalogXmin in shmem. ShmemVariableCache->oldestCatalogXmin will get written out in the next checkpoint, which gets incorporated in the control file. There is a problem though - StartupReplicationSlots() and RestoreSlotFromDisk() don't care if catalog_xmin is set on a slot but wal_level is < logical and will happily restore a logical slot, or a physical slot with a catalog_xmin. So we can't actually assume that procArray->replication_slot_catalog_xmin will be 0 if we're not writing new logical WAL. This isn't a big deal, it just means we can't short-circuit UpdateOldestCatalogXmin() calls if !XLogLogicalInfoActive(). It also means the XLogReportParameters() stuff can be removed since we don't care about wal_level for tracking oldestCatalogXmin. Fixed in updated patch. I'm now reading over Andres's review. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
.On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote: > Hi, > > Have you checked how high the overhead of XLogReadDetermineTimeline is? > A non-local function call, especially into a different translation-unit > (no partial inlining), for every single page might end up being > noticeable. That's fine in the cases it actually adds functionality, > but for a master streaming out data, that's not actually adding > anything. I don't anticipate any significant effect given the large amount of indirection via decoding, reorder buffer, snapshot builder, output plugin, etc that we already do and how much memory allocation gets done ... but it's worth checking. I could always move the fast path into a macro or inline function if it does turn out to make a detectable difference. One of the things I want to get to is refactoring all the xlog page reading stuff into a single place, shared between walsender and normal backends, to get rid of this confusing mess we currently have. The only necessary difference is how we wait for new WAL, the rest should be as common as possible allowing for xlogreader's needs. I particularly want to get rid of the two identically named static XLogRead functions. But all that probably means making timeline.c FRONTEND friendly and it's way too intrusive to contemplate at this stage. > Did you check whether you changes to read_local_xlog_page could cause > issues with twophase.c? Because that now also uses it. Thanks, that's a helpful point. The commit in question is 978b2f65. I didn't notice that it introduced XLogReader use in twophase.c, though I should've realised given the discussion about fetching recent 2pc info from xlog. I don't see any potential for issues at first glance, but I'll go over it in more detail. The main concern is validity of ThisTimeLineID, but since it doesn't run in recovery I don't see much of a problem there. That also means it can afford to use the current timeline-oblivious read_local_xlog_page safely. TAP tests for 2pc were added by 3082098. I'll check to make sure they have appropriate coverage for this. > Did you check whether ThisTimeLineID is actually always valid in the > processes logical decoding could run in? IIRC it's not consistently > update during recovery in any process but the startup process. I share your concerns that it may not be well enough maintained. Thankyou for the reminder, that's been on my TODO and got lost when I had to task-hop to other priorities. I have some TAP tests to validate promotion that need finishing off. My main concern is around live promotions, both promotion of standby to master, and promotion of upstream master when streaming from a cascaded replica. [Will cover review of 0003 separately, next] > 0002 should be doable as a whole this release, I have severe doubts that > 0003 as a whole has a chance for 10 - the code is in quite a raw shape, > there's a significant number of open ends. I'd suggest breaking of bits > that are independently useful, and work on getting those committed. That would be my preference too. I do not actually feel strongly about the need for logical decoding on standby, and would in many ways prefer to defer it until we have two-way hot standby feedback and the ability to have the master confirm the actual catalog_xmin locked in to eliminate the current race and ugly workaround for it. I'd rather have solid timeline following in place now and bare-minimum failover capability. I'm confident that the ability for xlogreader to follow timeline switches will also be independently useful. The parts I think are important for Pg10 are: * Teach xlogreader to follow timeline switches * Ability to create logical slots on replicas * Ability to advance (via feedback or via SQL function) - no need to actually decode and call output plugins at all. * Ability to drop logical slots on replicas That would be enough to provide minimal standby promotion without hideous hacks. Unfortunately, slot creation on standby is probably the ugliest part of the patch. It can be considerably simplified by imposing the rule that the application must ensure catalog_xmin on the master is already held down (with a replication slot) before creating a slot on the standby, and it's the application's job to send feedback to the master before any standbys it's keeping slots on. If the app fails to do so, the slot on the downstream will become unusable and attempts to decode changes from it will fail with conflict with recovery. That'd get rid of a lot of the code including some of the ugliest bits, since we'd no longer make any special effort with catalog_xmin during slot creation. We're already pushing complexity onto apps for this, after concluding that the transparent failover slots approach wasn't the way forward, so I'm OK with that. Let the apps that want logical decoding to support physical replica promotion pay most of the cost. I'd then like to revisit full decoding on standby later, once we have 2-way hot standby feedback, where the upstream can reply with confirmation xmin is locked in, including cascading handling. Getting there would mostly involve trimming this patch down, which is nice. It would be necessary to add a SQL function and/or walsender command to send feedback on a slot we're not currently replaying changes from, but I see that as independently valuable and have wanted it for a number of things already. We'd still have to decode (so we found the right restart_lsn), but we'd suppress output plugin calls entirely. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote: >> Subject: [PATCH 2/3] Follow timeline switches in logical decoding > > FWIW, the title doesn't really seem accurate to me. Yeah, it's not really at the logical decoding layer at all. "Teach xlogreader to follow timeline switches" ? >> Logical slots cannot actually be created on a replica without use of >> the low-level C slot management APIs so this is mostly foundation work >> for subsequent changes to enable logical decoding on standbys. > > Everytime I read references to anything like this my blood starts to > boil. I kind of regret not having plastered RecoveryInProgress() errors > all over this code. In fairness, I've been trying for multiple releases to get a "right" way in. I have no intention of using such hacks, and only ever did so for testing xlogreader timeline following without full logical decoding on standby being available. >> From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001 >> From: Craig Ringer <craig@2ndquadrant.com> >> Date: Mon, 5 Sep 2016 15:30:53 +0800 >> Subject: [PATCH 3/3] Logical decoding on standby >> >> * Make walsender aware of ProcSignal and recovery conflicts, make walsender >> exit with recovery conflict on upstream drop database when it has an active >> logical slot on that database. >> * Allow GetOldestXmin to omit catalog_xmin, be called already locked. > > "be called already locked"? To be called with ProcArrayLock already held. But that's actually outdated due to changes Petr requested earlier, thanks for noticing. >> * Send catalog_xmin separately in hot_standby_feedback messages. >> * Store catalog_xmin separately on a physical slot if received in hot_standby_feedback > > What does separate mean? Currently, hot standby feedback sends effectively the min(catalog_xmin, xmin) to the upstream, which in turn records that either in the PGPROC entry or, if there's a slot in use, in the xmin field on the slot. So catalog_xmin on the standby gets promoted to xmin on the master's physical slot. Lots of unnecessary bloat results. This splits it up, so we can send catalog_xmin separately on the wire, and store it on the physical replication slot as catalog_xmin, not xmin. >> * Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin, >> requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store >> it in checkpoints, make vacuum and bgwriter advance it. > > I can't parse that sentence. We now write an xlog record before allowing the catalog_xmin in ProcArray replication_slot_catalog_xmin to advance and allow catalog tuples to be removed. This is achieved by making vacuum use a separate field in ShmemVariableCache, oldestCatalogXmin. When vacuum looks up the new xmin from GetOldestXmin, it copies ProcArray.replication_slot_catalog_xmin to ShmemVariableCache.oldestCatalogXmin, writing an xlog record to ensure we remember the new value and ensure standbys know about it. This provides a guarantee to standbys that all catalog tuples >= ShmemVariableCache.oldestCatalogXmin are protected from vacuum and lets them discover when that threshold advances. The reason we cannot use the xid field in existing vacuum xlog records is that the downstream has no way to know if the xact affected catalogs and therefore whether it should advance its idea of catalog_xmin or not. It can't get a Relation for the affected relfilenode because it can't use the relcache during redo. We'd have to add a flag to every vacuum record indicating whether it affected catalogs, which is not fun, and vacuum might not always even know. And the standby would still need a way to keep track of the oldest valid catalog_xmin across restart without the ability to write it to checkpoints. It's a lot simpler and cheaper to have the master do it. >> * Add a new recovery conflict type for conflict with catalog_xmin. Abort >> in-progress logical decoding sessions with conflict with recovery where needed >> catalog_xmin is too old > > Are we retaining WAL for slots broken in that way? Yes, until the slot is dropped. If I added a persistent flag on the slot to indicate that the slot is invalid, then we could ignore it for purposes of WAL retention. It seemed unnecessary at this stage. >> * Make extra efforts to reserve master's catalog_xmin during decoding startup >> on standby. > > What does that mean? WaitForMasterCatalogXminReservation(...) I don't like it. At all. I'd rather have hot standby feedback replies so we can know for sure when the master has locked in our feedback. It's my most disliked part of this patch. >> * Remove checks preventing starting logical decoding on standby > > To me that's too many different things in one commit. A bunch of them > seem like it'd be good if they'd get independent buildfarm cycles too. I agree with you. I had them all separate before and was told that there were too many patches. I also had fixes that spanned multiple patches and were difficult to split up effectively. I'd like to split it roughly along the lines of the bulletted items, but I don't want to do it only to have someone else tell me to just squash it again and waste all the work (again). I'll take the risk I guess. >> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c >> index d7f65a5..36bbb98 100644 >> --- a/src/backend/access/heap/rewriteheap.c >> +++ b/src/backend/access/heap/rewriteheap.c >> @@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state) >> if (!state->rs_logical_rewrite) >> return; >> >> - ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin); >> + /* Use the catalog_xmin being retained by vacuum */ >> + ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL); > > What does that comment mean? Vacuum isn't the only thing that prunes old > records. I mean to refer to ShmemVariableCache.oldestCatalogXmin, the effective catalog xmin used for record removal, not ProcArray.replication_slot_catalog_xmin, the pending catalog_xmin for local slots. i.e. use the catalog_xmin that we've recorded in WAL and promised to standbys. I agree the comment is unclear. Not sure how to improve it without making it overly long though. >> +/* >> + * Set the global oldest catalog_xmin used to determine when tuples >> + * may be removed from catalogs and user-catalogs accessible from logical >> + * decoding. >> + * >> + * Only to be called from the startup process or by UpdateOldestCatalogXmin(), >> + * which ensures the update is properly written to xlog first. >> + */ >> +void >> +SetOldestCatalogXmin(TransactionId oldestCatalogXmin) >> +{ >> + Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock)); > > Uh, that's long-ish. And doesn't agree with the comment above > (s/startup process/process performing recovery/?). > > This is a long enough list that I'd consider just dropping the assert. Fair enough. >> + else if (info == XLOG_XACT_CATALOG_XMIN_ADV) >> + { >> + xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record); >> + >> + /* >> + * Unless logical decoding is possible on this node, we don't care about >> + * this record. >> + */ >> + if (!XLogLogicalInfoActive() || max_replication_slots == 0) >> + return; > > Too many negatives for my taste, but whatever. Also removed in latest version, since it turns out not be accurate. I had made the incorrect assumption that our global catalog_xmin was necessarily 0 when wal_level < logical. But this is not the case, per the new TAP tests in latest patch. We can have logical slots from when wal_level was logical still existing with a valid catalog_xmin after we restart into wal_level = replica. >> + /* >> + * Apply the new catalog_xmin limit immediately. New decoding sessions >> + * will refuse to start if their slot is past it, and old ones will >> + * notice when we signal them with a recovery conflict. There's no >> + * effect on the catalogs themselves yet, so it's safe for backends >> + * with older catalog_xmins to still exist. >> + * >> + * We don't have to take ProcArrayLock since only the startup process >> + * is allowed to change oldestCatalogXmin when we're in recovery. >> + */ >> + SetOldestCatalogXmin(xlrec->new_catalog_xmin); > > Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's > lwlock acquisition for barriers? I don't yet have a really solid grasp of memory ordering and barrier issues in multiprocessing. As I understand it, processes created after this point aren't going to see the old value, they'll fork() with a current snapshot of memory, so either they'll see the new value or they'll be captured by our ResolveRecoveryConflictWithLogicalDecoding() run (assuming they don't exit first). New decoding sessions for existing backends would be an issue. They call EnsureActiveLogicalSlotValid() which performs a volatile read on ShmemVariableCache->oldestCatalogXmin . But that isn't sufficient, is it? We need a write barrier in SetOldestCatalogXmin and a read barrier in EnsureActiveLogicalSlotValid. I'll fix that. Thanks very much. >> +/* >> + * Record when we advance the catalog_xmin used for tuple removal >> + * so standbys find out before we remove catalog tuples they might >> + * need for logical decoding. >> + */ >> +XLogRecPtr >> +XactLogCatalogXminUpdate(TransactionId new_catalog_xmin) >> +{ >> + XLogRecPtr ptr = InvalidXLogRecPtr; >> + >> + if (XLogInsertAllowed()) >> + { >> + xl_xact_catalog_xmin_advance xlrec; >> + >> + xlrec.new_catalog_xmin = new_catalog_xmin; >> + >> + XLogBeginInsert(); >> + XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance); >> + >> + ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV); >> + } > > Huh, why is this test needed and ok? Good point. It isn't anymore. I previously had catalog_xmin advances on replicas running a similar path and skipping xlog. But that was fragile. So now UpdateOldestCatalogXmin() is only called from the master, per the assertion at the start, so it's unnecessary to test for XLogInsertAllowed( ) here. Fixed. >> @@ -9449,6 +9456,16 @@ XLogReportParameters(void) >> XLogFlush(recptr); >> } >> >> + /* >> + * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer >> + * require oldestCatalogXmin in checkpoints and it no longer >> + * makes sense, so update shmem and xlog the change. This will >> + * get written out in the next checkpoint. >> + */ >> + if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL && >> + wal_level < WAL_LEVEL_LOGICAL) >> + UpdateOldestCatalogXmin(true); > > What if we crash before this happens? We run XLogReportParameters before we set ControlFile->state = DB_IN_PRODUCTION, so we'd re-run recovery and call it again next time through. But as it turns out the above is neither necessary nor correct anyway, it relies on the invalid assumption that catalog_xmin is 0 when wal_level is < logical. Per above, not the case, so we can't short-circuit catalog_xmin logging tests when wal_level = replica. >> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c >> index ff633fa..2d16bf0 100644 >> --- a/src/backend/commands/vacuum.c >> +++ b/src/backend/commands/vacuum.c >> @@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel, >> MultiXactId safeMxactLimit; >> >> /* >> + * When logical decoding is enabled, we must write any advance of >> + * catalog_xmin to xlog before we allow VACUUM to remove those tuples. >> + * This ensures that any standbys doing logical decoding can cancel >> + * decoding sessions and invalidate slots if we remove tuples they >> + * still need. >> + */ >> + UpdateOldestCatalogXmin(false); > > I'm on a first read-through through this, but it appears you don't do > anything similar in heap_page_prune()? And we can't just start emitting > loads of additional records there, because it's called much more often... vacuum_set_xid_limits sets OldestXmin in lazy_vacuum_rel, which is the OldestXmin passed to heap_page_prune. I could Assert that heap_page_prune's OldestXmin PrecedesOrEquals ShmemVariableCache->oldestCatalogXmin I guess. It seemed unnecessary. >> + /* >> + * The walreceiver should be running when we try to create a slot. If >> + * we're unlucky enough to catch the walreceiver just as it's >> + * restarting after an error, well, the client can just retry. We don't >> + * bother to sleep and re-check. >> + */ >> + if (!walrcv_running) >> + ereport(ERROR, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("streaming replication is not active"), >> + errhint("Logical decoding on standby requires that streaming replication be configuredand active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errorsin the logs."))); > > > That seems quite problematic. What if there's a momentaneous connection > failure? This also has the issue that just because you checked that > walrcv_running at some point, doesn't guarantee anything by the time you > actually check. Seems like life were easier if recovery.conf were > guc-ified already - checking for primary_conninfo/primary_slot_name etc > wouldn't have that issue (and can't be changed while running). Yes, I very much wish walreceiver were already GUC-ified. I'd rather test primary_conninfo and primary_slot_name . > Usage of a slot doesn't actually guarantee much in cascased setups, does > it? It doesn't entirely eliminate the potential for a race with catalog removal, but neither does hot_standby_feedback on a non-cascading setup. Right now we tolerate that race and the risk that the slot may become invalid. The application can prevent that by making sure it has a slot on the master and the standby has caught up past the master's lsn at the time of that slot's creation before it creates a slot on the standby. That's part of why the hoop jumping for catalog_xmin advance. To make sure we know, for sure, if it's safe to decode from a slot given that we haven't been able to directly enforce our xmin on the master. To get rid of that race without application intervention, we need the ability for a feedback message to flow up the cascade, and a reply that specifically matches that feedback message (or at least that individual downstream node) to flow down the cascade. I'm working on just that, but there's no way it'll be ready for pg10 obviously, and it has some difficult issues. It's actually intended to help prevent conflict with standby cancels shortly after hot_standby starts up, but it'll help with slot creation too. Even with all that, we'll still need some kind of xlog'd catalog_xmin knowledge, because users can do silly things like drop the slot connecting standby to master and re-create it, causing the standby's needed catalog_xmin on the master to become un-pinned. We don't want to risk messily crashing if that happens. >> @@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin, >> * xmin horizons by other backends, get the safe decoding xid, and inform >> * the slot machinery about the new limit. Once that's done the >> * ProcArrayLock can be released as the slot machinery now is >> - * protecting against vacuum. >> + * protecting against vacuum - if we're on the master. If we're running on >> + * a replica, we have to wait until hot_standby_feedback locks in our >> + * needed catalogs, per details on WaitForMasterCatalogXminReservation(). >> * ---- >> */ >> LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >> @@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin, >> >> ReplicationSlotsComputeRequiredXmin(true); >> >> + if (RecoveryInProgress()) >> + WaitForMasterCatalogXminReservation(slot); >> + >> + Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin, >> + slot->data.catalog_xmin)); >> + >> LWLockRelease(ProcArrayLock); > > I think it's quite a bad idea to do a blocking operation like > WaitForMasterCatalogXminReservation while holding ProcArrayLock. > > >> +/* >> + * Wait until the master's catalog_xmin is set, advancing our catalog_xmin >> + * if needed. Caller must hold exclusive ProcArrayLock, which this function will >> + * temporarily release while sleeping but will re-acquire. > > Ah. I see. Hm :(. Exactly. I'm increasingly inclined to rip that out and make preventing races with master catalog removal the application's problem. Create a slot on the master first, or accept that you may have to retry. > >> + * When we're creating a slot on a standby we can't directly set the >> + * master's catalog_xmin; the catalog_xmin is set locally, then relayed >> + * over hot_standby_feedback. The master may remove the catalogs we >> + * asked to reserve between when we set a local catalog_xmin and when >> + * hs feedback makes that take effect on the master. We need a feedback >> + * reply mechanism here, where: >> + * >> + * - we tentatively reserve catalog_xmin locally > > Will that already trigger recovery conflicts? If we already have local active slots, we'll be using their existing catalog_xmin and there's no issue. If we don't already have local slots the only conflict potential is this backend. It could potentially cause a conflict if we replayed a greatly advanced catalog_xmin from the master before we got the chance to advance our local one accordingly. >> + * - we wake the walreceiver by setting its latch >> + * - walreceiver sends hs_feedback >> + * - upstream walsender sends a new 'hs_feedback reply' message with >> + * actual (xmin, catalog_xmin) reserved. >> + * - walreceiver sees reply and updates ShmemVariableCache or some other >> + * handy bit of shmem with hs feedback reservations from reply > > "or some other handy bit"? Ha. Will fix. >> + * - we poll the reservations while we wait >> + * - we set our catalog_xmin to that value, which might be later if >> + * we missed our requested reservation, or might be earlier if >> + * someone else is holding down catalog_xmin on master. We got a hs >> + * feedback reply so we know it's reserved. >> + * >> + * For cascading, the actual reservation will need to cascade up the >> + * chain by walsender setting its own walreceiver's latch in turn, etc. >> + * >> + * For now, we just set the local slot catalog_xmin and sleep until >> + * oldestCatalogXmin equals or passes our reservation. This is fine if we're >> + * the only decoding session, but it is vulnerable to races if slots on the >> + * master or other decoding sessions on other standbys connected to the same >> + * master exist. They might advance their reservation before our hs_feedback >> + * locks it down, allowing vacuum to remove tuples we need. So we might start >> + * decoding on our slot then error with a conflict with recovery when we see >> + * catalog_xmin advance. >> + */ > > I was about to list some of these issues. That's a bit unsatisfying. I concur. I just don't have a better answer. I think I'd like to rip it out and make it the application's problem until we can do it right. > > > Pondering this for a bit, but I'm ~9h into a flight, so maybe not > tonight^Wthis morning^Wwhaaaa. > > >> +static void >> +WaitForMasterCatalogXminReservation(ReplicationSlot *slot) >> +{ > > > This comment seems to duplicate some of the function header > comment. Such duplication usually leads to either or both getting out of > date rather quickly. > > > Not commenting line-by-line on the code here, but I'm extremely doubtful > that this approach is stable enough, and that the effect of holding > ProcArrayLock exclusively over prolonged amounts of time is acceptable. > >> + ReplicationSlotsComputeRequiredXmin(true); >> > Why do we need this? The caller does it too, no? Because we force a walsender update immediately and want a current value. It's cheap enough for running in slot creation. >> + /* Tell the master what catalog_xmin we settled on */ >> + WalRcvForceReply(); >> + >> + /* Reset ps display if we changed it */ >> + if (new_status) >> + { >> + set_ps_display(new_status, false); >> + pfree(new_status); >> + } > > We really shouldn't do stuff like this while holding ProcArrayLock. Yeah, good point. >> +/* >> + * Test to see if the active logical slot is usable. >> + */ >> +static void >> +EnsureActiveLogicalSlotValid() >> +{ > > Missing (void). Augh, C++ still has its tentacles in my brain. >> +/* >> + * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the >> + * passed database oid. The caller should hold an exclusive lock on the database >> + * to ensure no replication slots on the database are in use. > > Stuff like this really should be it's own commit. It can trivially be > tested on its own, is useful on its own (just have DROP DATABASE do it), Agreed, will do. >> + * If we fail here we'll leave the in-memory state of replication slots >> + * inconsistent with its on-disk state, so we need to PANIC. > > We worked quite hard to make it extremely unlikely for that to happen in > practice. I also don't see why there should be any new PANICs in this > code. I didn't figure out a sensible way not to. I'll revisit that. >> +void >> +ReplicationSlotsDropDBSlots(Oid dboid) >> +{ >> + int i; >> + >> + if (max_replication_slots <= 0) >> + return; >> + >> + /* >> + * We only need a shared lock here even though we activate slots, >> + * because we have an exclusive lock on the database we're dropping >> + * slots on and don't touch other databases' slots. >> + */ >> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > Hm? Acquiring a slot always only takes a shared lock, no? > > I don't really see how "database is locked" guarantees enough for your > logic - it's already possible to drop slots from other databases, and > dropping a slot acquires it temporarily? You can drop slots from other DBs. Ugh. Right. That's a frustrating oversight. I'll have to revisit that logic. > >> + for (i = 0; i < max_replication_slots; i++) >> + { >> + ReplicationSlot *s; >> + NameData slotname; >> + int active_pid; >> + >> + s = &ReplicationSlotCtl->replication_slots[i]; >> + >> + /* cannot change while ReplicationSlotCtlLock is held */ >> + if (!s->in_use) >> + continue; >> + >> + /* only logical slots are database specific, skip */ >> + if (!SlotIsLogical(s)) >> + continue; >> + >> + /* not our database, skip */ >> + if (s->data.database != dboid) >> + continue; >> + >> + /* Claim the slot, as if ReplicationSlotAcquire()ing */ >> + SpinLockAcquire(&s->mutex); >> + strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN); >> + NameStr(slotname)[NAMEDATALEN-1] = '\0'; >> + active_pid = s->active_pid; >> + if (active_pid == 0) >> + { >> + MyReplicationSlot = s; >> + s->active_pid = MyProcPid; >> + } >> + SpinLockRelease(&s->mutex); >> + >> + /* >> + * The caller should have an exclusive lock on the database so >> + * we'll never have any in-use slots, but just in case... >> + */ >> + if (active_pid) >> + elog(PANIC, "replication slot %s is in use by pid %d", >> + NameStr(slotname), active_pid); > > So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of > PANICing? There seems to be absolutely no correctness reason for a PANIC > here? We've acquired the slot but it's active by another backend. Something broke. But you're right, PANIC is an over-reaction. >> + /* >> + * To avoid largely duplicating ReplicationSlotDropAcquired() or >> + * complicating it with already_locked flags for ProcArrayLock, >> + * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we >> + * just release our ReplicationSlotControlLock to drop the slot. >> + * >> + * There's no race here: we acquired this slot, and no slot "behind" >> + * our scan can be created or become active with our target dboid due >> + * to our exclusive lock on the DB. >> + */ >> + LWLockRelease(ReplicationSlotControlLock); >> + ReplicationSlotDropAcquired(); >> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > I don't see much problem with this, but I'd change the code so you > simply do a goto restart; if you released the slot. Then there's a lot > less chance / complications around temporarily releasing > ReplicationSlotControlLock. Good idea. >> + * >> + * If logical decoding information is enabled, we also >> + * send immediate hot standby feedback so as to reduce >> + * the delay before our needed catalogs are locked in. > > "logical decoding information ... enabled" XLogLogicalInfoActive() > and "catalogs are locked in" Yeah, fair. > are a bit too imprecise descriptions for my taste. Will adjust. > > >> + xmin = GetOldestXmin(NULL, >> + false, /* don't ignore vacuum */ >> + true /* ignore catalog xmin */); >> + >> + /* >> + * The catalog_Xmin reported by GetOldestXmin is the effective >> + * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance >> + * records from the master. Sending it back to the master would be >> + * circular and prevent its catalog_xmin ever advancing once set. >> + * We should only send the catalog_xmin we actually need for slots. >> + */ >> + ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin); > > Given that you don't have catalog_xmin set by GetOldestXmin that comment > is a bit misleading. It is. Too many revisions with too much time between them. Fixing. >> @@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum) >> NormalTransactionIdPrecedes(replication_slot_xmin, result)) >> result = replication_slot_xmin; >> >> + if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel))) >> + { >> + /* >> + * After locks have been released and defer_cleanup_age has been applied, >> + * check whether we need to back up further to make logical decoding >> + * safe. We need to do so if we're computing the global limit (rel = >> + * NULL) or if the passed relation is a catalog relation of some kind. >> + */ >> + if (TransactionIdIsValid(replication_slot_catalog_xmin) && >> + NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result)) >> + result = replication_slot_catalog_xmin; >> + } > > The nesting of these checks, and the comments about them, is a bit > weird. Agree. I didn't find it readable in one test, and it wasn't clear how to comment on just the inner part of the tests without splitting it up. But it's easy enough to merge, I just found it less readable and harder to understand. >> +/* >> + * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated >> + * to reflect an advance in procArray->replication_slot_catalog_xmin or >> + * it becoming newly set or unset. >> + * >> + */ >> +static bool >> +CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin) >> +{ >> + return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin) >> + || (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin))); >> +} > > Your lines are really long - pgindent (which you really should run) will > much this. I think it'd be better to rephrase this. Thanks. Will. IIRC pgindent created a LOT of unrelated noise at the time I was working on it, but I'll recheck. > > >> +/* >> + * If necessary, copy the current catalog_xmin needed by repliation slots to > > Typo: repliation Thanks. >> + * the effective catalog_xmin used for dead tuple removal. >> + * >> + * When logical decoding is enabled we write a WAL record before advancing the >> + * effective value so that standbys find out if catalog tuples they still need >> + * get removed, and can properly cancel decoding sessions and invalidate slots. >> + * >> + * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off >> + * and need to clear the shmem state, since we want to bypass the wal_level >> + * check and force xlog writing. >> + */ >> +void >> +UpdateOldestCatalogXmin(bool force) > > I'm a bit confused by this function and variable name. What does > > + TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state > + * is guaranteed to still exist */ > > mean? I complained about the overall justification in the commit > already, but looking at this commit alone, the justification for this > part of the change is quite hard to understand. Standbys have no way to know what catalog row versions are guaranteed to exist. They know, from vacuum xlog records, when we remove row versions, index entries, etc associated with a transaction. But the standby has no way to know if the affected relation is a catalog or not, it only knows the relfilenode. So it can't maintain a local notion of "effective global catalog_xmin on the master as of the last xlog record I replayed". I could add is_catalog flags to all the vacuum xlog records via a secondary struct that's only added when wal_level = logical, but that seems pretty awful and likely to be very noisy. It also wouldn't help the standby know, at startup, what the current catalog_xmin of the master is since it won't be in a checkpoint or the control file. > > >> +{ >> + TransactionId vacuum_catalog_xmin; >> + TransactionId slots_catalog_xmin; >> + >> + /* >> + * If we're not recording logical decoding information, catalog_xmin >> + * must be unset and we don't need to do any work here. > > If we don't need to do any work, shouldn't we return early? Yes. >> + if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force) >> + { >> + XactLogCatalogXminUpdate(slots_catalog_xmin); >> + >> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >> + /* >> + * A concurrent updater could've changed these values so we need to re-check >> + * under ProcArrayLock before updating. >> + */ >> + vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin); >> + slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin); > > why are there volatile reads here? Because I didn't understand volatile well enough. It's not a memory barrier and provides no guarantee that we're seeing recent values. It should probably just take ProcArrayLock. >> + if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin)) >> + SetOldestCatalogXmin(slots_catalog_xmin); > > Why don't we check force here, but above? Good point. I've removed force anyway, in the latest revision. Same reason as given above re the StartupXLOG parameters check stuff. >> @@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void) >> oldestSafeXid = ShmemVariableCache->nextXid; >> >> /* >> - * If there's already a slot pegging the xmin horizon, we can start with >> - * that value, it's guaranteed to be safe since it's computed by this >> - * routine initially and has been enforced since. >> + * If there's already an effectiveCatalogXmin held down by vacuum >> + * it's definitely safe to start there, and it can't advance >> + * while we hold ProcArrayLock. > > What does "held down by vacuum" mean? Brain fart. Held down by an existing slot. Comment also needs rewording. >> /* >> + * Notify a logical decoding session that it conflicts with a >> + * newly set catalog_xmin from the master. >> + */ >> +void >> +CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid) >> +{ >> + ProcArrayStruct *arrayP = procArray; >> + int index; >> + >> + /* >> + * We have to scan ProcArray to find the process and set a pending recovery >> + * conflict even though we know the pid. At least we can get the BackendId >> + * and void a ProcSignal scan later. >> + * >> + * The pid might've gone away, in which case we got the desired >> + * outcome anyway. >> + */ >> + LWLockAcquire(ProcArrayLock, LW_SHARED); >> + >> + for (index = 0; index < arrayP->numProcs; index++) >> + { >> + int pgprocno = arrayP->pgprocnos[index]; >> + volatile PGPROC *proc = &allProcs[pgprocno]; >> + >> + if (proc->pid == session_pid) >> + { >> + VirtualTransactionId procvxid; >> + >> + GET_VXID_FROM_PGPROC(procvxid, *proc); >> + >> + proc->recoveryConflictPending = true; >> + >> + /* >> + * Kill the pid if it's still here. If not, that's what we >> + * wanted so ignore any errors. >> + */ >> + (void) SendProcSignal(session_pid, >> + PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId); >> + >> + break; >> + } >> + } >> + >> + LWLockRelease(ProcArrayLock); > > Doesn't seem ok to do this while holding ProcArrayLock. Fair enough. And I guess it's safe enough to take and release it, since new processes that start won't be at risk of cancellation so we don't care about whether or not we scan them. >> +/* >> + * Scan to see if any clients are using replication slots that are below the >> + * new catalog_xmin theshold and sigal them to terminate with a recovery >> + * conflict. >> + * >> + * We already applied the new catalog_xmin record and updated the shmem >> + * catalog_xmin state, so new clients that try to use a replication slot >> + * whose on-disk catalog_xmin is below the new threshold will ERROR, and we >> + * don't have to guard against them here. >> + * >> + * Replay can only continue safely once every slot that needs the catalogs >> + * we're going to free for removal is gone. So if any conflicting sessions >> + * exist, wait for any standby conflict grace period then signal them to exit. >> + * >> + * The master might clear its reserved catalog_xmin if all upstream slots are >> + * removed or clear their feedback reservations, sending us >> + * InvalidTransactionId. If we're concurrently trying to create a new slot and >> + * reserve catalogs the InvalidXid reservation report might come in while we >> + * have a slot waiting for hs_feedback confirmation of its reservation. That >> + * would cause the waiting process to get canceled with a conflict with >> + * recovery here since its tentative reservation conflicts with the master's >> + * report of 'nothing reserved'. To allow it to continue to seek a startpoint >> + * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're >> + * still looking for where to start. We'll sometimes notice a conflict but the >> + * slot will advance its catalog_xmin to a more recent nextXid and cease to >> + * conflict when we re-check. (The alternative is to track slots being created >> + * differently to slots actively decoding in shmem, which seems unnecessary. Or >> + * to separate the 'tentative catalog_xmin reservation' of a slot from its >> + * actual needed catalog_xmin.) >> + * >> + * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because >> + * walsender-based logical decoding sessions won't have any virtualxid for much >> + * of their life and the end of their virtualxids doesn't mean the end of a >> + * potential conflict. It would also cancel too aggressively, since it cares >> + * about the backend's xmin and logical decoding only needs the catalog_xmin. >> + */ > > The use of "we" seems confusing here, because it's not the same process. > > Generally I think your comments need to be edited a bit for brevity and > preciseness. Will work on it. Me, verbose? Really? >> +void >> +ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin) >> +{ >> + int i; >> + >> + if (!InHotStandby) >> + /* nobody can be actively using logical slots */ >> + return; >> + >> + /* Already applied new limit, can't have replayed later one yet */ >> + Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin); >> + >> + /* >> + * Find the first conflicting active slot and wait for it to be free, >> + * signalling it if necessary, then repeat until there are no more >> + * conflicts. >> + */ >> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); >> + for (i = 0; i < max_replication_slots; i++) >> + { > > I'm pretty strongly against any code outside of slot.c doing this. IIRC I originally tried to do that as part of slot.c but found that it resulted in other ugliness relating to access to other structures. But I can't remember what anymore, so I'll revisit it. > > >> @@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason) >> Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending)); >> >> /* >> - * All conflicts apart from database cause dynamic errors where the >> - * command or transaction can be retried at a later point with some >> - * potential for success. No need to reset this, since non-retryable >> - * conflict errors are currently FATAL. >> + * All conflicts apart from database and catalog_xmin cause dynamic >> + * errors where the command or transaction can be retried at a later >> + * point with some potential for success. No need to reset this, since >> + * non-retryable conflict errors are currently FATAL. >> */ >> - if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE) >> + if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE || >> + reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN) >> RecoveryConflictRetryable = false; >> } > > Hm. Why is this a non-retryable error? The global catalog_xmin isn't going to go backwards, so if the slot needs a given catalog_xmin and we want to discard it.... ... then we should give it a while to catch up. Right. It should be retryable. > Ok, landing soon. Gotta finish here. Greatly appreciated, I know it's not the nicest to review. > > 0002 should be doable as a whole this release, I have severe doubts that > 0003 as a whole has a chance for 10 - the code is in quite a raw shape, > there's a significant number of open ends. I'd suggest breaking of bits > that are independently useful, and work on getting those committed. I'll be doing that, yes. I really want some way to create slots on replicas, advance them to follow the master's position, and have them able to be used after promotion to master. I don't think actually live decoding on replica is ready yet, though I'd find the ability to shift decoding workloads to replicas rather nice when it is ready. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 21 March 2017 at 02:21, Craig Ringer <craig@2ndquadrant.com> wrote: > On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote: > >>> Subject: [PATCH 2/3] Follow timeline switches in logical decoding >> >> FWIW, the title doesn't really seem accurate to me. > > Yeah, it's not really at the logical decoding layer at all. > > "Teach xlogreader to follow timeline switches" ? Happy with that. I think Craig has addressed Andres' issues with this patch, so I will apply later today as planned using that name. The longer Logical Decoding on Standby will not be applied yet and not without further changess, per review. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 21 March 2017 at 09:05, Craig Ringer <craig@2ndquadrant.com> wrote: > Thanks, that's a helpful point. The commit in question is 978b2f65. I > didn't notice that it introduced XLogReader use in twophase.c, though > I should've realised given the discussion about fetching recent 2pc > info from xlog. I don't see any potential for issues at first glance, > but I'll go over it in more detail. The main concern is validity of > ThisTimeLineID, but since it doesn't run in recovery I don't see much > of a problem there. That also means it can afford to use the current > timeline-oblivious read_local_xlog_page safely. > > TAP tests for 2pc were added by 3082098. I'll check to make sure they > have appropriate coverage for this. The TAP tests pass fine, and I can't see any likely issues either. XLogReader for 2PC doesn't happen on standby, and RecoveryInProgress() will update ThisTimeLineID on promotion. >> Did you check whether ThisTimeLineID is actually always valid in the >> processes logical decoding could run in? IIRC it's not consistently >> update during recovery in any process but the startup process. > > I share your concerns that it may not be well enough maintained. > Thankyou for the reminder, that's been on my TODO and got lost when I > had to task-hop to other priorities. The main place we maintain ThisTimeLineID (outside StartupXLOG of course) is in walsender's GetStandbyFlushRecPtr, which calls GetWalRcvWriteRecPtr. That's not used in walsender's logical decoding or in the SQL interface. I've changed the order of operations in read_local_xlog_page to ensure that RecoveryInProgress() updates ThisTimeLineID if we're promoted, and made it update ThisTimeLineID from GetXLogReplayRecPtr otherwise. pg_logical_slot_get_changes_guts was fine already. Because xlog read callbacks must not attempt to read pages past the flush limit (master) or replay limit (standby), it doesn't matter if ThisTimeLineID is completely up to date, only that it's valid as-of that LSN. I did identify one problem. The startup process renames the last segment in a timeline to .partial when it processes a timeline switch. See xlog.c:7597. So if we have the following order of operations: * Update ThisTimeLineID to 2 at latest redo ptr * XLogReadDetermineTimeline chooses timeline 2 to read from * startup process replays timeline switch to TL 3 and renames last segment in old timeline to .partial * XLogRead() tries to open segment with TL 2 we'll fail. I don't think it matters much though. We're not actually supporting streaming decoding from standby this release by the looks, and even if we did the effect would be limited to an ERROR and a reconnect. It doesn't look like there's really any sort of lock or other synchronisation we can rely on to prevent this, and we should probably just live with it. If we have already opened the segment we'll just keep reading from it without noticing the rename; if we haven't and are switching to it just as it's renamed we'll ERROR when we try to open it. I had cascading and promotion tests in progress for decoding on standby, but doubt there's much point finishing them off now that it's not likely that decoding on standby can be added for this CF. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi all Updated timeline following patch attached. There's a change in read_local_xlog_page to ensure we maintain ThisTimeLineID properly, otherwise it's just comment changes. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 22 March 2017 at 10:51, Craig Ringer <craig@2ndquadrant.com> wrote: > Hi all > > Updated timeline following patch attached. > > There's a change in read_local_xlog_page to ensure we maintain > ThisTimeLineID properly, otherwise it's just comment changes. OK, so we're looking OK with the TL following. I'm splitting up the rest of the decoding on standby patch set with the goal of getting minimal functionality for creating and managing slots on standbys in, so we can maintain slots on standbys and use them when the standby is promoted to master. The first, to send catalog_xmin separately to the global xmin on hot_standby_feedback and store it in the upstream physical slot's catalog_xmin, is attached. These are extracted directly from the logical decoding on standby patch, with comments by Petr and Andres made re the relevant code addressed. I will next be working on a bare-minimum facility for creating and advancing logical slots on a replica without support for buffering changes, creating historic snapshots or invoking output plugin. The slots will become usable after the replica is promoted. They'll track their own restart_lsn, etc, and will keep track of xids so they can manage their catalog_xmin, so there'll be no need for dodgy slot syncing from the master, but they'll avoid most of the complex and messy bits. The application will be expected to make sure a slot on the master exists and is advanced before the corresponding slot on the replica to protect required catalogs. Then if there's agreement that it's the right way forward I can add the catalog_xmin xlog'ing stuff as the next patch. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On 2017-03-21 09:05:26 +0800, Craig Ringer wrote: > > 0002 should be doable as a whole this release, I have severe doubts that > > 0003 as a whole has a chance for 10 - the code is in quite a raw shape, > > there's a significant number of open ends. I'd suggest breaking of bits > > that are independently useful, and work on getting those committed. > > That would be my preference too. > The parts I think are important for Pg10 are: > * Ability to create logical slots on replicas Doesn't this also imply recovery conflicts on DROP DATABASE? Besides, allowing to drop all slots using a database upon DROP DATABASE, is a useful thing on its own. But I have to admit, I've *severe* doubts about getting the whole infrastructure for slot creation on replica into 10. The work is far from ready, and we're mere days away from freeze. > * Ability to advance (via feedback or via SQL function) - no need to > actually decode and call output plugins at al That pretty much requires decoding, otherwise you really don't know how much WAL you have to retain. > * Ability to drop logical slots on replicas That shouldn't actually require any changes, no? Greetings, Andres Freund
On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote: > But I have to admit, I've *severe* doubts about getting the whole > infrastructure for slot creation on replica into 10. The work is far > from ready, and we're mere days away from freeze. If Craig has to guess what would be acceptable, then its not long enough. It would be better if you could outline a specific approach so he can code it. Coding takes about a day for most things, since Craig knows the code and what we're trying to achieve. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-22 14:58:29 +0000, Simon Riggs wrote: > On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote: > > > But I have to admit, I've *severe* doubts about getting the whole > > infrastructure for slot creation on replica into 10. The work is far > > from ready, and we're mere days away from freeze. > > If Craig has to guess what would be acceptable, then its not long enough. I don't know what you're on about with that statement. I've spent a good chunk of time looking at the 0003 patch, even though it's large and contains a lot of different things. I suggested splitting things up. I even suggested what to move earlier after Craig agreed with that sentiment, in the mail you're replying to, because it seems independently doable. > It would be better if you could outline a specific approach so he can > code it. Coding takes about a day for most things, since Craig knows > the code and what we're trying to achieve. I find that fairly unconvincing. What we have here is a patch that isn't close to being ready, contains a lot of complicated pieces, a couple days before freeze. If we can pull useful pieces out: great. But it's too later for major new development. Greetings, Andres Freund
On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote: >> The parts I think are important for Pg10 are: > >> * Ability to create logical slots on replicas > > Doesn't this also imply recovery conflicts on DROP DATABASE? Not needed until the slot is in use, which is a later patch. > Besides, > allowing to drop all slots using a database upon DROP DATABASE, is a > useful thing on its own. Sure but that's a separate feature unrelated to this patch and we're running out of time. >> * Ability to advance (via feedback or via SQL function) - no need to >> actually decode and call output plugins at al > > That pretty much requires decoding, otherwise you really don't know how > much WAL you have to retain. Knowing how much WAL to retain is key. Why would decoding tell you how much WAL to retain? We tried to implement this automatically from the master, which was rejected. So the only other way is manually. We need one or the other. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote: > I'm splitting up the rest of the decoding on standby patch set with > the goal of getting minimal functionality for creating and managing > slots on standbys in, so we can maintain slots on standbys and use > them when the standby is promoted to master. > > The first, to send catalog_xmin separately to the global xmin on > hot_standby_feedback and store it in the upstream physical slot's > catalog_xmin, is attached. > > These are extracted directly from the logical decoding on standby > patch, with comments by Petr and Andres made re the relevant code > addressed. I've reduced your two patches back to one with a smaller blast radius. I'll commit this tomorrow morning, barring objections. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-03-22 15:59:42 +0000, Simon Riggs wrote: > On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote: > > >> The parts I think are important for Pg10 are: > > > >> * Ability to create logical slots on replicas > > > > Doesn't this also imply recovery conflicts on DROP DATABASE? > > Not needed until the slot is in use, which is a later patch. Hm? We need to drop slots, if they can exist / be created, on a standby, and they're on a dropped database. Otherwise we'll reserve resources, while anyone connecting to the slot will likely just receive errors because the database doesn't exist anymore. It's also one of the patches that can quite easily be developed / reviewed, because there really isn't anything complicated about it. Most of the code is already in Craig's patch, it just needs some adjustments. > > Besides, > > allowing to drop all slots using a database upon DROP DATABASE, is a > > useful thing on its own. > > Sure but that's a separate feature unrelated to this patch and we're > running out of time. Hm? The patch implemented it. > >> * Ability to advance (via feedback or via SQL function) - no need to > >> actually decode and call output plugins at al > > > > That pretty much requires decoding, otherwise you really don't know how > > much WAL you have to retain. > > Knowing how much WAL to retain is key. > > Why would decoding tell you how much WAL to retain? Because decoding already has the necessary logic? (You need to retain enough WAL to restart decoding for all in-progress transactions etc). > We tried to implement this automatically from the master, which was > rejected. So the only other way is manually. We need one or the other. I think the current approach is roughly the right way - but that doesn't make the patch ready. Greetings, Andres Freund
On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-03-21 09:05:26 +0800, Craig Ringer wrote: >> > 0002 should be doable as a whole this release, I have severe doubts that >> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape, >> > there's a significant number of open ends. I'd suggest breaking of bits >> > that are independently useful, and work on getting those committed. >> >> That would be my preference too. > > >> The parts I think are important for Pg10 are: > >> * Ability to create logical slots on replicas > > Doesn't this also imply recovery conflicts on DROP DATABASE? Besides, > allowing to drop all slots using a database upon DROP DATABASE, is a > useful thing on its own. Definitely beneficial, otherwise recovery will stop until you drop slots, which isn't ideal. >> * Ability to advance (via feedback or via SQL function) - no need to >> actually decode and call output plugins at al > > That pretty much requires decoding, otherwise you really don't know how > much WAL you have to retain. Yes, and to update restart_lsn and catalog_xmin correctly. I was thinking that by disallowing snapshot use and output plugin invocation we'd avoid the need to support cancellation on recovery conflicts, etc, simplifying things considerably. >> * Ability to drop logical slots on replicas > > That shouldn't actually require any changes, no? It doesn't, it works as-is. I have NFI why I wrote that. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2017-03-23 06:55:53 +0800, Craig Ringer wrote: > On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > On 2017-03-21 09:05:26 +0800, Craig Ringer wrote: > >> > 0002 should be doable as a whole this release, I have severe doubts that > >> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape, > >> > there's a significant number of open ends. I'd suggest breaking of bits > >> > that are independently useful, and work on getting those committed. > >> > >> That would be my preference too. > > > > > >> The parts I think are important for Pg10 are: > > > >> * Ability to create logical slots on replicas > > > > Doesn't this also imply recovery conflicts on DROP DATABASE? Besides, > > allowing to drop all slots using a database upon DROP DATABASE, is a > > useful thing on its own. > > Definitely beneficial, otherwise recovery will stop until you drop > slots, which isn't ideal. s/isn't ideal/not acceptable/ ;) > >> * Ability to advance (via feedback or via SQL function) - no need to > >> actually decode and call output plugins at al > > > > That pretty much requires decoding, otherwise you really don't know how > > much WAL you have to retain. > > Yes, and to update restart_lsn and catalog_xmin correctly. > I was thinking that by disallowing snapshot use and output plugin > invocation we'd avoid the need to support cancellation on recovery > conflicts, etc, simplifying things considerably. That seems like it'd end up being pretty hacky - the likelihood that we'd run into snapbuild error cross-checks seems very high. Greetings, Andres Freund
On 23 March 2017 at 00:17, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-22 15:59:42 +0000, Simon Riggs wrote: >> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote: >> >> >> The parts I think are important for Pg10 are: >> > >> >> * Ability to create logical slots on replicas >> > >> > Doesn't this also imply recovery conflicts on DROP DATABASE? >> >> Not needed until the slot is in use, which is a later patch. > > Hm? We need to drop slots, if they can exist / be created, on a standby, > and they're on a dropped database. Otherwise we'll reserve resources, > while anyone connecting to the slot will likely just receive errors > because the database doesn't exist anymore. It's also one of the > patches that can quite easily be developed / reviewed, because there > really isn't anything complicated about it. Most of the code is already > in Craig's patch, it just needs some adjustments. Right, I'm not too concerned about doing that, and it's next on my TODO as I clean up the split patch series. >> >> * Ability to advance (via feedback or via SQL function) - no need to >> >> actually decode and call output plugins at al >> > >> > That pretty much requires decoding, otherwise you really don't know how >> > much WAL you have to retain. >> >> Knowing how much WAL to retain is key. >> >> Why would decoding tell you how much WAL to retain? > > Because decoding already has the necessary logic? (You need to retain > enough WAL to restart decoding for all in-progress transactions etc). Indeed; after all, standby status updates from the decoding client only contain the *flushed* LSN. The downstream doesn't know the restartpoint LSN, it must be tracked by the upstream. It's also necessary to maintain our catalog_xmin correctly on the standby so we can send it via hot_standby_feedback to a physical replication slot used on the master, ensuring the master doesn't remove catalog tuples we may still need. > I don't know what you're on about with that statement. I've spent a > good chunk of time looking at the 0003 patch, even though it's large > and contains a lot of different things. I suggested splitting things up. > I even suggested what to move earlier after Craig agreed with that > sentiment, in the mail you're replying to, because it seems > independently doable. I really appreciate the review, as I'm all too aware of how time consuming it can be. From my PoV, the difficulty I'm in is that this patch series has languished for most of the Pg 10 release cycle with no real input from stakeholders in the logical decoding area, so while the review is important, the fact that it's now means that it pretty comprehensively blocks the patch for Pg 10. I asked on list for input on structure (if/how to split it up) literally months ago, for example. I've been struggling to get some kind of support for logical decoding on standbys for most of two release cycles, and there are people climbing the walls wanting it. I'm trying to make sure it's done right, but I can't do that alone, and it's hard to progress when I don't know what will be expected until it's too late to do anything about it. I guess all we can do at this point is get the foundations in place and carry on for Pg 11, where the presence of in-core logical replication will offer a lever to actually push this in. In the mean time I'll have to continue carrying the out-of-tree failover slots patch for people who use logical decoding and want it to be reliable. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-23 06:55:53 +0800, Craig Ringer wrote: >> I was thinking that by disallowing snapshot use and output plugin >> invocation we'd avoid the need to support cancellation on recovery >> conflicts, etc, simplifying things considerably. > > That seems like it'd end up being pretty hacky - the likelihood that > we'd run into snapbuild error cross-checks seems very high. TBH I'm not following this. But I haven't touched snapbuild much yet, Petr's done much more with snapbuild than I have. We're not going to have robust logical replication that's suitable for HA and failover use on high load systems until 2020 or so, with Pg 12. We'll need concurrent decoding and apply, which nobody's even started on AFAIK, we'll need sequence replication, and more. So I'd really, really like to get some kind of HA picture other than "none" in for logical decoding based systems. If it's imperfect, it's still something. I wish we'd just proceeded with failover slots. They were blocked in favour of decoding on standby, and HA is possible if we have decoding on standby with some more work by the application. But now we'll have neither. If we'd just done failover slots we could've had logical replication able to follow failover in Pg 10. What do _you_ see as the minimum acceptable way to achieve the ability for a logical decoding client to follow failover of an upstream to a physical standby? In the end, you're one of the main people whose view carries weight in this area, and I don't want to develop yet another approach only to have it knocked back once the work is done. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2017-03-23 09:14:07 +0800, Craig Ringer wrote: > On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote: > > On 2017-03-23 06:55:53 +0800, Craig Ringer wrote: > > >> I was thinking that by disallowing snapshot use and output plugin > >> invocation we'd avoid the need to support cancellation on recovery > >> conflicts, etc, simplifying things considerably. > > > > That seems like it'd end up being pretty hacky - the likelihood that > > we'd run into snapbuild error cross-checks seems very high. > > TBH I'm not following this. But I haven't touched snapbuild much yet, > Petr's done much more with snapbuild than I have. We can't just assume that snapbuild is going to work correctly when it's prerequisites - pinned xmin horizon - isn't working. > We're not going to have robust logical replication that's suitable for > HA and failover use on high load systems until 2020 or so, with Pg 12. > We'll need concurrent decoding and apply, which nobody's even started > on AFAIK, we'll need sequence replication, and more. These seem largely unrelated to the topic at hand(nor do I agree on all of them). > So I'd really, really like to get some kind of HA picture other than > "none" in for logical decoding based systems. If it's imperfect, it's > still something. I still think decoding-on-standby is simply not the right approach as the basic/first HA approach for logical rep. It's a nice later-on feature. But that's an irrelevant aside. I don't understand why you're making a "fundamental" argument here - I'm not arguing against the goals of the patch at all. I want as much stuff committed as we can in a good shape. > What do _you_ see as the minimum acceptable way to achieve the ability > for a logical decoding client to follow failover of an upstream to a > physical standby? In the end, you're one of the main people whose view > carries weight in this area, and I don't want to develop yet another I think your approach here wasn't that bad? There's a lot of cleaning up/shoring up needed, and we probably need a smarter feedback system. I don't think anybody here has objected to the fundamental approach? Greetings, Andres Freund
On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote: > We can't just assume that snapbuild is going to work correctly when it's > prerequisites - pinned xmin horizon - isn't working. Makes sense. >> What do _you_ see as the minimum acceptable way to achieve the ability >> for a logical decoding client to follow failover of an upstream to a >> physical standby? In the end, you're one of the main people whose view >> carries weight in this area, and I don't want to develop yet another > > I think your approach here wasn't that bad? There's a lot of cleaning > up/shoring up needed, and we probably need a smarter feedback system. I > don't think anybody here has objected to the fundamental approach? That's useful, thanks. I'm not arguing that the patch as it stands is ready, and appreciate the input re the general design. > I still think decoding-on-standby is simply not the right approach as > the basic/first HA approach for logical rep. It's a nice later-on > feature. But that's an irrelevant aside. I don't really agree that it's irrelevant. Right now Pg has no HA capability for logical decoding clients. We've now added logical replication, but it has no way to provide for upstream node failure and ensure a consistent switch-over, whether to a logical or physical replica. Since real world servers fail or need maintenance, this is kind of a problem for practical production use. Because of transaction serialization for commit-time order replay, logical replication experiences saw-tooth replication lag, where large or long xacts such as batch jobs effectively stall all later xacts until they are fully replicated. We cannot currently start replicating a big xact until it commits on the upstream, so that lag can easily be ~2x the runtime on the upstream. So while you can do sync rep on a logical standby, it tends to result in big delays on COMMITs relative to physical rep, even if app are careful to keep transactions small. When the app DR planning people come and ask you what the max data loss window / max sync rep lag is, you have to say ".... dunno? depends on what else was running on the server at the time." AFAICT, changing those things will require the ability to begin streaming reorder buffers for big xacts before commit, which as the logical decoding on 2PC thread shows is not exactly trivial. We'll also need to be able to apply them concurrently with other xacts on the other end. Those are both big and complex things IMO, and I'll be surprised if we can do either in Pg11 given that AFAIK nobody has even started work on either of them or has a detailed plan. Presuming we get some kind of failover to logical replica upstreams into Pg11, it'll have significant limitations relative to what we can deliver to users by using physical replication. Especially when it comes to bounded-time lag for HA, sync rep, etc. And I haven't seen a design for it, though Petr and I have discussed some with regards to pglogical. That's why I think we need to do HA on the physical side first. Because it's going to take a long time to get equivalent functionality for logical rep based upstreams, and when it is we'll still have to teach management tools and other non-logical-rep logical decoding clients about the new way of doing things. Wheras for physical HA setups to support logical downstreams requires only relatively minor changes and gets us all the physical HA features _now_. That's why we pursued failover slots - as a simple, minimal solution to allowing logical decoding clients to inter-operate with Pg in a physical HA configuration. TBH, I still think we should just add them. Sure, they don't help us achieve decoding on standby, but they're a lot simpler and they help Pg's behaviour with slots match user expectations for how the rest of Pg behaves, i.e. if it's on the master it'll be on the replica too. And as you've said, decoding on standby is a nice-to-have, wheras I think some kind of HA support is rather more important. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote: > >> I'm splitting up the rest of the decoding on standby patch set with >> the goal of getting minimal functionality for creating and managing >> slots on standbys in, so we can maintain slots on standbys and use >> them when the standby is promoted to master. >> >> The first, to send catalog_xmin separately to the global xmin on >> hot_standby_feedback and store it in the upstream physical slot's >> catalog_xmin, is attached. >> >> These are extracted directly from the logical decoding on standby >> patch, with comments by Petr and Andres made re the relevant code >> addressed. > > I've reduced your two patches back to one with a smaller blast radius. > > I'll commit this tomorrow morning, barring objections. Thanks. I was tempted to refactor GetOldestXmin to use flags myself, but thought it might be at higher risk of objections. Since Eiji Seki has shown that there are other uses for excluding particular things from GetOldestXmin it and that's committed now, it's nice to have the impact of this patch reduced. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2017-03-23 12:14:02 +0800, Craig Ringer wrote: > On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote: > > I still think decoding-on-standby is simply not the right approach as > > the basic/first HA approach for logical rep. It's a nice later-on > > feature. But that's an irrelevant aside. > > I don't really agree that it's irrelevant. I'm not sure we have enough time for either getting some parts of your patch in, or for figuring out long term goals. But we definitely don't have time for both. - Andres
On 23 March 2017 at 12:41, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-23 12:14:02 +0800, Craig Ringer wrote: >> On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote: >> > I still think decoding-on-standby is simply not the right approach as >> > the basic/first HA approach for logical rep. It's a nice later-on >> > feature. But that's an irrelevant aside. >> >> I don't really agree that it's irrelevant. > > I'm not sure we have enough time for either getting some parts of your > patch in, or for figuring out long term goals. But we definitely don't > have time for both. Fair. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote: > >> I'm splitting up the rest of the decoding on standby patch set with >> the goal of getting minimal functionality for creating and managing >> slots on standbys in, so we can maintain slots on standbys and use >> them when the standby is promoted to master. >> >> The first, to send catalog_xmin separately to the global xmin on >> hot_standby_feedback and store it in the upstream physical slot's >> catalog_xmin, is attached. >> >> These are extracted directly from the logical decoding on standby >> patch, with comments by Petr and Andres made re the relevant code >> addressed. > > I've reduced your two patches back to one with a smaller blast radius. > > I'll commit this tomorrow morning, barring objections. This needs rebasing on top of commit af4b1a0869bd3bb52e5f662e4491554b7f611489 Author: Simon Riggs <simon@2ndQuadrant.com> Date: Wed Mar 22 16:51:01 2017 +0000 Refactor GetOldestXmin() to use flags Replace ignoreVacuum parameter with more flexible flags. Author: Eiji Seki Review: Haribabu Kommi That patch landed up using PROCARRAY flags directly as flags to GetOldestXmin, so it doesn't make much sense to add a flag like PROCARRAY_REPLICATION_SLOTS . There won't be any corresponding PROC_ flag for PGXACT->vacuumFlags, replication slot xmin and catalog_xmin are global state not tracked in individual proc entries. Rather than add some kind of "PROC_RESERVED" flag in proc.h that would never be used and only exist to reserve a bit for use for PROCARRAY_REPLICATION_SLOTS, which we'd use a flag to GetOldestXmin, I added a new argument to GetOldestXmin like the prior patch did. If preferred I can instead add proc.h: #define PROC_RESERVED 0x20 procarray.h: #define PROCARRAY_REPLICATION_SLOTS 0x20 and then test for (flags & PROCARRAY_REPLICATION_SLOTS) but that's kind of ugly to say the least, I'd rather just add another argument. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 23 March 2017 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote: > If preferred I can instead add > > proc.h: > > #define PROC_RESERVED 0x20 > > procarray.h: > > #define PROCARRAY_REPLICATION_SLOTS 0x20 > > and then test for (flags & PROCARRAY_REPLICATION_SLOTS) Attached done that way. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote: Minor update to catalog_xmin walsender patch to fix failure to parenthesize definition of PROCARRAY_PROC_FLAGS_MASK . This one's ready to go. Working on drop slots on DB drop now. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 24 March 2017 at 06:23, Craig Ringer <craig@2ndquadrant.com> wrote: > On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote: > > Minor update to catalog_xmin walsender patch to fix failure to > parenthesize definition of PROCARRAY_PROC_FLAGS_MASK . > > This one's ready to go. Working on drop slots on DB drop now. Committed. Next! -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote: > Have you checked how high the overhead of XLogReadDetermineTimeline is? > A non-local function call, especially into a different translation-unit > (no partial inlining), for every single page might end up being > noticeable. That's fine in the cases it actually adds functionality, > but for a master streaming out data, that's not actually adding > anything. I haven't been able to measure any difference. But, since we require the caller to ensure a reasonably up to date ThisTimeLineID, maybe it's worth adding an inlineable function for the fast-path that tests the "page cached" and "timeline is current and unchanged" conditions? //xlogutils.h: static inline void XLogReadDetermineTimeline(...) { ... first test for page already read-in and valid ... ... second test for ThisTimeLineId ... XLogReadCheckTimeLineChange(...) } XLogReadCheckTimeLineChange(...) { ... rest of function } (Yes, I know "inline" means little, but it's a hint for readers) I'd rather avoid using a macro since it'd be pretty ugly, but it's also an option if an inline func is undesirable. #define XLOG_READ_DETERMINE_TIMELINE \ do { \ ... same as above ... } while (0); Can be done after CF if needed anyway, it's just fiddling some code around. Figured I'd mention though. >> + /* >> + * To avoid largely duplicating ReplicationSlotDropAcquired() or >> + * complicating it with already_locked flags for ProcArrayLock, >> + * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we >> + * just release our ReplicationSlotControlLock to drop the slot. >> + * >> + * There's no race here: we acquired this slot, and no slot "behind" >> + * our scan can be created or become active with our target dboid due >> + * to our exclusive lock on the DB. >> + */ >> + LWLockRelease(ReplicationSlotControlLock); >> + ReplicationSlotDropAcquired(); >> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > I don't see much problem with this, but I'd change the code so you > simply do a goto restart; if you released the slot. Then there's a lot > less chance / complications around temporarily releasing > ReplicationSlotControlLock I don't quite get this. I suspect I'm just not seeing the implications as clearly as you do. Do you mean we should restart the whole scan of the slot array if we drop any slot? That'll be O(n log m) but since we don't expect to be working on a big array or a lot of slots it's unlikely to matter. The patch coming soon will assume we'll restart the whole scan, anyway. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Here's the next patch in the split-up series, drop db-specific (logical) replication slots on DROP DATABASE. Current behaviour is to ERROR if logical slots exist on the DB, whether in-use or not. With this patch we can DROP a database if it has logical slots so long as they are not active. I haven't added any sort of syntax for this, it's just done unconditionally. I don't see any sensible way to stop a slot becoming active after we check for active slots and begin the actual database DROP, since ReplicationSlotAcquire will happily acquire a db-specific slot for a different DB and the only lock it takes is a shared lock on ReplicationSlotControlLock, which we presumably don't want to hold throughout DROP DATABASE. So this patch makes ReplicationSlotAcquire check that the slot database matches the current database and refuse to acquire the slot if it does not. The only sensible reason to acquire a slot from a different DB is to drop it, and then it's only a convenience at best. Slot drop is the only user-visible behaviour change, since all other activity on logical slots happened when the backend was already connected to the slot's DB. Appropriate docs changes have been made and tests added. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote: > So this patch makes ReplicationSlotAcquire check that the slot > database matches the current database and refuse to acquire the slot > if it does not. New patch attached that drops above requirement, so slots can still be dropped from any DB. This introduces a narrow race window where DROP DATABASE may ERROR if somebody connects to a different database and runs a pg_drop_replication_slot(...) for one of the slots being dropped by DROP DATABASE after we check for active slots but before we've dropped the slot. But it's hard to hit and it's pretty harmless; the worst possible result is dropping one or more of the slots before we ERROR out of the DROP. But you clearly didn't want them anyway, since you were dropping the DB and dropping some slots at the same time. I think this one's ready to go. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote: > I think this one's ready to go. Looks like something I could commit. Full review by me while offline today, aiming to commit tomorrow barring issues raised. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 27 March 2017 at 16:20, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote: > >> I think this one's ready to go. > > Looks like something I could commit. Full review by me while offline > today, aiming to commit tomorrow barring issues raised. Great. Meanwhile I'm going to be trying to work with Stas on 2PC logical decoding, while firming up the next patches in this series to see if we can progress a bit further. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2017-03-27 16:03:48 +0800, Craig Ringer wrote: > On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote: > > > So this patch makes ReplicationSlotAcquire check that the slot > > database matches the current database and refuse to acquire the slot > > if it does not. > > New patch attached that drops above requirement, so slots can still be > dropped from any DB. > > This introduces a narrow race window where DROP DATABASE may ERROR if > somebody connects to a different database and runs a > pg_drop_replication_slot(...) for one of the slots being dropped by > DROP DATABASE after we check for active slots but before we've dropped > the slot. But it's hard to hit and it's pretty harmless; the worst > possible result is dropping one or more of the slots before we ERROR > out of the DROP. But you clearly didn't want them anyway, since you > were dropping the DB and dropping some slots at the same time. > > I think this one's ready to go. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001 > From: Craig Ringer <craig@2ndquadrant.com> > Date: Wed, 22 Mar 2017 13:21:09 +0800 > Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB > > Automatically drop all logical replication slots associated with a > database when the database is dropped. > --- > doc/src/sgml/func.sgml | 3 +- > doc/src/sgml/protocol.sgml | 2 + > src/backend/commands/dbcommands.c | 32 +++++--- > src/backend/replication/slot.c | 88 ++++++++++++++++++++++ > src/include/replication/slot.h | 1 + > src/test/recovery/t/006_logical_decoding.pl | 40 +++++++++- > .../recovery/t/010_logical_decoding_timelines.pl | 30 +++++++- > 7 files changed, 182 insertions(+), 14 deletions(-) > > diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml > index ba6f8dd..78508d7 100644 > --- a/doc/src/sgml/func.sgml > +++ b/doc/src/sgml/func.sgml > @@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); > <entry> > Drops the physical or logical replication slot > named <parameter>slot_name</parameter>. Same as replication protocol > - command <literal>DROP_REPLICATION_SLOT</>. > + command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must > + be called when connected to the same database the slot was created on. > </entry> > </row> > > diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml > index b3a5026..5f97141 100644 > --- a/doc/src/sgml/protocol.sgml > +++ b/doc/src/sgml/protocol.sgml > @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are: > <para> > Drops a replication slot, freeing any reserved server-side resources. If > the slot is currently in use by an active connection, this command fails. > + If the slot is a logical slot that was created in a database other than > + the database the walsender is connected to, this command fails. > </para> > <variablelist> > <varlistentry> Shouldn't the docs in the drop database section about this? > +void > +ReplicationSlotsDropDBSlots(Oid dboid) > +{ > + int i; > + > + if (max_replication_slots <= 0) > + return; > + > +restart: > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s; > + NameData slotname; > + int active_pid; > + > + s = &ReplicationSlotCtl->replication_slots[i]; > + > + /* cannot change while ReplicationSlotCtlLock is held */ > + if (!s->in_use) > + continue; > + > + /* only logical slots are database specific, skip */ > + if (!SlotIsLogical(s)) > + continue; > + > + /* not our database, skip */ > + if (s->data.database != dboid) > + continue; > + > + /* Claim the slot, as if ReplicationSlotAcquire()ing. */ > + SpinLockAcquire(&s->mutex); > + strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN); > + NameStr(slotname)[NAMEDATALEN-1] = '\0'; > + active_pid = s->active_pid; > + if (active_pid == 0) > + { > + MyReplicationSlot = s; > + s->active_pid = MyProcPid; > + } > + SpinLockRelease(&s->mutex); > + > + /* > + * We might fail here if the slot was active. Even though we hold an > + * exclusive lock on the database object a logical slot for that DB can > + * still be active if it's being dropped by a backend connected to > + * another DB or is otherwise acquired. > + * > + * It's an unlikely race that'll only arise from concurrent user action, > + * so we'll just bail out. > + */ > + if (active_pid) > + elog(ERROR, "replication slot %s is in use by pid %d", > + NameStr(slotname), active_pid); > + > + /* > + * To avoid largely duplicating ReplicationSlotDropAcquired() or > + * complicating it with already_locked flags for ProcArrayLock, > + * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we > + * just release our ReplicationSlotControlLock to drop the slot. > + * > + * For safety we'll restart our scan from the beginning each > + * time we release the lock. > + */ > + LWLockRelease(ReplicationSlotControlLock); > + ReplicationSlotDropAcquired(); > + goto restart; > + } > + LWLockRelease(ReplicationSlotControlLock); > + > + /* recompute limits once after all slots are dropped */ > + ReplicationSlotsComputeRequiredXmin(false); > + ReplicationSlotsComputeRequiredLSN(); I was concerned for a second that we'd skip doing ReplicationSlotsComputeRequired* if we ERROR out above - but ReplicationSlotDropAcquired already calls these as necessary. I.e. they should be dropped from here. - Andres
On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote: >> --- a/doc/src/sgml/protocol.sgml >> +++ b/doc/src/sgml/protocol.sgml >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are: >> <para> >> Drops a replication slot, freeing any reserved server-side resources. If >> the slot is currently in use by an active connection, this command fails. >> + If the slot is a logical slot that was created in a database other than >> + the database the walsender is connected to, this command fails. >> </para> >> <variablelist> >> <varlistentry> > > Shouldn't the docs in the drop database section about this? DROP DATABASE doesn't really discuss all the resources it drops, but I'm happy to add mention of replication slots handling. I just notice that I failed to remove the docs changes regarding dropping slots becoming db-specific, so I'll post a follow-up for that in a sec. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote: > I just notice that I failed to remove the docs changes regarding > dropping slots becoming db-specific, so I'll post a follow-up for that > in a sec. Attached. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 29 March 2017 at 08:11, Craig Ringer <craig@2ndquadrant.com> wrote: > On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote: > >> I just notice that I failed to remove the docs changes regarding >> dropping slots becoming db-specific, so I'll post a follow-up for that >> in a sec. > > Attached. ... and here's the next in the patch series. Both this and the immediately prior minor patch fix-drop-slot-docs.patch are pending now. Notable changes in this patch since review: * Split oldestCatalogXmin tracking into separate patch * Critically, fix use of procArray->replication_slot_catalog_xmin in GetSnapshotData's setting of RecentGlobalXmin and RecentGlobalDataXmin so it instead uses ShmemVariableCache->oldestCatalogXmin . This could've led to tuples newer than oldestCatalogXmin being removed. * Memory barrier in UpdateOldestCatalogXmin and SetOldestCatalogXmin. It still does a pre-check before deciding if it needs to take ProcArrayLock, recheck, and advance, since we don't want to unnecessarily contest ProcArrayLock. * Remove unnecessary volatile usage (retained in UpdateOldestCatalogXmin due to barrier) * Remove unnecessary test for XLogInsertAllowed() in XactLogCatalogXminUpdate * EnsureActiveLogicalSlotValid(void) - add (void) * pgidented changes in this diff; have left unrelated changes alone Re: > what does > > + TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state > + * is guaranteed to still exist */ > > mean? I complained about the overall justification in the commit > already, but looking at this commit alone, the justification for this > part of the change is quite hard to understand. The patch now contains TransactionId oldestCatalogXmin; /* oldest xid it is guaranteed to be safe * to create a historic snapshot for; see * also * procArray->replication_slot_catalog_xmin * */ which I think is an improvement. I've also sought to explain the purpose of this change better with /* * If necessary, copy the current catalog_xmin needed by replication slots to * the effective catalog_xmin used for dead tuple removal and write a WAL * record recording the change. * * This allows standbys to know the oldest xid for which it is safe to create * a historic snapshot for logical decoding. VACUUM or other cleanup may have * removed catalog tuple versions needed to correctly decode transactions older * than this threshold. Standbys can use this information to cancel conflicting * decoding sessions and invalidate slots that need discarded information. * * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for * this, since they don't identify the relation as a catalog or not. Nor can a * standby look up the relcache to get the Relation for the affected * relfilenode to check if it is a catalog. The standby would also have no way * to know the oldest safe position at startup if it wasn't in the control * file.) */ void UpdateOldestCatalogXmin(void) { ... Does that help? (Sidenote for later: ResolveRecoveryConflictWithLogicalDecoding will need a read barrier too, when the next patch adds it.) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote: > * Split oldestCatalogXmin tracking into separate patch Regarding this, Simon raised concerns about xlog volume here. It's pretty negligible. We only write a new record when a vacuum runs after catalog_xmin advances on the slot with the currently-lowest catalog_xmin (or, if vacuum doesn't run reasonably soon, when the bgworker next looks). So at worst on a fairly slow moving system or one with a super high vacuum rate we'll write one per commit. But in most cases we'll write a lot fewer than that. When running t/006_logical_decoding.pl for example: $ ../../../src/bin/pg_waldump/pg_waldump tmp_check/data_master_daPa/pgdata/pg_wal/000000010000000000000001 | grep CATALOG rmgr: Transaction len (rec/tot): 4/ 30, tx: 0, lsn: 0/01648D50, prev 0/01648D18, desc: CATALOG_XMIN catalog_xmin 555 rmgr: Transaction len (rec/tot): 4/ 30, tx: 0, lsn: 0/0164C840, prev 0/0164C378, desc: CATALOG_XMIN catalog_xmin 0 pg_waldump: FATAL: error in WAL record at 0/16BBF10: invalid record length at 0/16BBF88: wanted 24, got 0 and of course, none at all unless you use logical decoding. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote: > On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote: > >> * Split oldestCatalogXmin tracking into separate patch > > Regarding this, Simon raised concerns about xlog volume here. > > It's pretty negligible. > > We only write a new record when a vacuum runs after catalog_xmin > advances on the slot with the currently-lowest catalog_xmin (or, if > vacuum doesn't run reasonably soon, when the bgworker next looks). I'd prefer to slow things down a little, not be so eager. If we hold back update of the catalog_xmin until when we run GetRunningTransactionData() we wouldn't need to produce any WAL records at all AND we wouldn't need to have VACUUM do UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra task. That would also make this patch about half the length it is. Let me know what you think. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: > On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote: >> >>> * Split oldestCatalogXmin tracking into separate patch >> >> Regarding this, Simon raised concerns about xlog volume here. >> >> It's pretty negligible. >> >> We only write a new record when a vacuum runs after catalog_xmin >> advances on the slot with the currently-lowest catalog_xmin (or, if >> vacuum doesn't run reasonably soon, when the bgworker next looks). > > I'd prefer to slow things down a little, not be so eager. > > If we hold back update of the catalog_xmin until when we run > GetRunningTransactionData() we wouldn't need to produce any WAL > records at all AND we wouldn't need to have VACUUM do > UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra > task. > > That would also make this patch about half the length it is. > > Let me know what you think. Good idea. We can always add a heuristic later to make xl_running_xacts get emitted more often at high transaction rates if it's necessary. Patch coming soon. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 30 March 2017 at 11:34, Craig Ringer <craig@2ndquadrant.com> wrote: > On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote: >> On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote: >>> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote: >>> >>>> * Split oldestCatalogXmin tracking into separate patch >>> >>> Regarding this, Simon raised concerns about xlog volume here. >>> >>> It's pretty negligible. >>> >>> We only write a new record when a vacuum runs after catalog_xmin >>> advances on the slot with the currently-lowest catalog_xmin (or, if >>> vacuum doesn't run reasonably soon, when the bgworker next looks). >> >> I'd prefer to slow things down a little, not be so eager. >> >> If we hold back update of the catalog_xmin until when we run >> GetRunningTransactionData() we wouldn't need to produce any WAL >> records at all AND we wouldn't need to have VACUUM do >> UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra >> task. >> >> That would also make this patch about half the length it is. >> >> Let me know what you think. > > Good idea. > > We can always add a heuristic later to make xl_running_xacts get > emitted more often at high transaction rates if it's necessary. > > Patch coming soon. Attached. A bit fiddlier than expected, but I like the result more. In the process I identified an issue with both the prior patch and this one where we don't check slot validity for slots that existed on standby prior to promotion of standby to master. We were just assuming that being the master was good enough, since it controls replication_slot_catalog_xmin, but that's not true for pre-existing slots. Fixed by forcing update of the persistent safe catalog xmin after the first slot is created on the master - which is now done by doing an immediate LogStandbySnapshot() after assigning the slot's catalog_xmin. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote: > Attached. * Cleaned up in 3 places * Added code for faked up RunningTransactions in xlog.c * Ensure catalog_xmin doesn't go backwards All else looks good. Comments before commit? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2017-03-30 15:26:02 +0100, Simon Riggs wrote: > On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote: > > > Attached. > > * Cleaned up in 3 places > * Added code for faked up RunningTransactions in xlog.c > * Ensure catalog_xmin doesn't go backwards > > All else looks good. Comments before commit? Can you give me till after lunch? - Andres
On 30 March 2017 at 15:27, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-30 15:26:02 +0100, Simon Riggs wrote: >> On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote: >> >> > Attached. >> >> * Cleaned up in 3 places >> * Added code for faked up RunningTransactions in xlog.c >> * Ensure catalog_xmin doesn't go backwards >> >> All else looks good. Comments before commit? > > Can you give me till after lunch? Sure, np -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record) > SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); > > /* > + * There can be no concurrent writers to oldestCatalogXmin during > + * recovery, so no need to take ProcArrayLock. > + */ > + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin; s/writers/writes/? > @@ -9731,6 +9748,15 @@ xlog_redo(XLogReaderState *record) > checkPoint.oldestXid)) > SetTransactionIdLimit(checkPoint.oldestXid, > checkPoint.oldestXidDB); > + > + /* > + * There can be no concurrent writers to oldestCatalogXmin during > + * recovery, so no need to take ProcArrayLock. > + */ > + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, > + checkPoint.oldestCatalogXmin) > + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin; dito. > @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin, > > ReplicationSlotsComputeRequiredXmin(true); > > + /* > + * If this is the first slot created on the master we won't have a > + * persistent record of the oldest safe xid for historic snapshots yet. > + * Force one to be recorded so that when we go to replay from this slot we > + * know it's safe. > + */ > + force_standby_snapshot = > + !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin); s/first slot/first logical slot/. Also, the reference to master doesn't seem right. > LWLockRelease(ProcArrayLock); > > + /* Update ShmemVariableCache->oldestCatalogXmin */ > + if (force_standby_snapshot) > + LogStandbySnapshot(); The comment and code don't quite square to me - it's far from obvious that LogStandbySnapshot does something like that. I'd even say it's a bad idea to have it do that. > /* > * tell the snapshot builder to only assemble snapshot once reaching the > * running_xact's record with the respective xmin. > @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn, > start_lsn = slot->data.confirmed_flush; > } > > + EnsureActiveLogicalSlotValid(); This seems like it should be in a separate patch, and seperately reviewed. It's code that's currently unreachable (and thus untestable). > +/* > + * Test to see if the active logical slot is usable. > + */ > +static void > +EnsureActiveLogicalSlotValid(void) > +{ > + TransactionId shmem_catalog_xmin; > + > + Assert(MyReplicationSlot != NULL); > + > + /* > + * A logical slot can become unusable if we're doing logical decoding on a > + * standby or using a slot created before we were promoted from standby > + * to master. Neither of those is currently possible... > If the master advanced its global catalog_xmin past the > + * threshold we need it could've removed catalog tuple versions that > + * we'll require to start decoding at our restart_lsn. > + * > + * We need a barrier so that if we decode in recovery on a standby we > + * don't allow new decoding sessions to start after redo has advanced > + * the threshold. > + */ > + if (RecoveryInProgress()) > + pg_memory_barrier(); I don't think this is a meaningful locking protocol. It's a bad idea to use lock-free programming without need, especially when the concurrency protocol isn't well defined. With what other barrier does this pair with? What prevents the data being out of date by the time we actually do the check? > diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c > index cfc3fba..cdc5f95 100644 > --- a/src/backend/replication/walsender.c > +++ b/src/backend/replication/walsender.c > @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn) > * be energy wasted - the worst lost information can do here is give us > * wrong information in a statistics view - we'll just potentially be more > * conservative in removing files. > + * > + * We don't have to do any effective_xmin / effective_catalog_xmin testing > + * here either, like for LogicalConfirmReceivedLocation. If we received > + * the xmin and catalog_xmin from downstream replication slots we know they > + * were already confirmed there, > */ > } This comment reads as if LogicalConfirmReceivedLocation had justification for not touching / checking catalog_xmin. But it does. > /* > + * Update our knowledge of the oldest xid we can safely create historic > + * snapshots for. > + * > + * There can be no concurrent writers to oldestCatalogXmin during > + * recovery, so no need to take ProcArrayLock. By now I think is pretty flawed logic, because there can be concurrent readers, that need to be safe against oldestCatalogXmin advancing concurrently. > /* > - * It's important *not* to include the limits set by slots here because > + * It's important *not* to include the xmin set by slots here because > * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those > * were to be included here the initial value could never increase because > * of a circular dependency where slots only increase their limits when > * running xacts increases oldestRunningXid and running xacts only > * increases if slots do. > + * > + * We can include the catalog_xmin limit here; there's no similar > + * circularity, and we need it to log xl_running_xacts records for > + * standbys. > */ Those comments seem to need some more heavyhanded reconciliation. > * > * Return the current slot xmin limits. That's useful to be able to remove > * data that's older than those limits. > + * > + * For logical replication slots' catalog_xmin, we return both the effective This seems to need some light editing. > /* > * Record an enhanced snapshot of running transactions into WAL. > * > + * We also record the value of procArray->replication_slot_catalog_xmin > + * obtained from GetRunningTransactionData here, so standbys know we're about > + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start > + * removing dead catalog tuples below that threshold. I think needs some rephrasing. We're not necessarily about to remove catalog tuples here, nor are we necessarily advancing oldestCatalogXmin. > +static void > +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin) > +{ > + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin) > + || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin))) > + ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin; > + LWLockRelease(ProcArrayLock); > +} Doing TransactionIdPrecedes before ensuring ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike me as a good idea. Generally, the expression as it stands is hard to understand. > diff --git a/src/include/access/transam.h b/src/include/access/transam.h > index d25a2dd..a4ecfb7 100644 > --- a/src/include/access/transam.h > +++ b/src/include/access/transam.h > @@ -136,6 +136,17 @@ typedef struct VariableCacheData > * aborted */ > > /* > + * This field is protected by ProcArrayLock except > + * during recovery, when it's set unlocked. > + * > + * oldestCatalogXmin is the oldest xid it is > + * guaranteed to be safe to create a historic > + * snapshot for. See also > + * procArray->replication_slot_catalog_xmin > + */ > + TransactionId oldestCatalogXmin; Maybe it'd be better to rephrase that do something like "oldestCatalogXmin guarantees that no valid catalog tuples >= than it are removed. That property is used for logical decoding.". or similar? > /* > * Each page of XLOG file has a header like this: > */ > -#define XLOG_PAGE_MAGIC 0xD097 /* can be used as WAL version indicator */ > +#define XLOG_PAGE_MAGIC 0xD100 /* can be used as WAL version indicator */ We normally only advance this by one, it's not tied to the poistgres version. I'm sorry, but to me this patch isn't ready. I'm also doubtful that it makes a whole lot of sense on its own, without having finished the design for decoding on a standby - it seems quite likely that we'll have to redesign the mechanisms in here a bit for that. For 10 this seems to do nothing but add overhead? Greetings, Andres Freund
On 2017-03-29 08:01:34 +0800, Craig Ringer wrote: > On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote: > > >> --- a/doc/src/sgml/protocol.sgml > >> +++ b/doc/src/sgml/protocol.sgml > >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are: > >> <para> > >> Drops a replication slot, freeing any reserved server-side resources. If > >> the slot is currently in use by an active connection, this command fails. > >> + If the slot is a logical slot that was created in a database other than > >> + the database the walsender is connected to, this command fails. > >> </para> > >> <variablelist> > >> <varlistentry> > > > > Shouldn't the docs in the drop database section about this? > > DROP DATABASE doesn't really discuss all the resources it drops, but > I'm happy to add mention of replication slots handling. I don't think that's really comparable, because the other things aren't global objects, which replication slots are. - Andres
On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote: >> /* >> * Each page of XLOG file has a header like this: >> */ >> -#define XLOG_PAGE_MAGIC 0xD097 /* can be used as WAL version indicator */ >> +#define XLOG_PAGE_MAGIC 0xD100 /* can be used as WAL version indicator */ > > We normally only advance this by one, it's not tied to the poistgres version. That was my addition. I rounded it up cos this is release 10. No biggie. (Poistgres? Is that the Manhattan spelling?) > I'm sorry, but to me this patch isn't ready. I'm also doubtful that it > makes a whole lot of sense on its own, without having finished the > design for decoding on a standby - it seems quite likely that we'll have > to redesign the mechanisms in here a bit for that. For 10 this seems to > do nothing but add overhead? I'm sure we can reword the comments. We've been redesigning the mechanisms for 2 years now, so it seems unlikely that further redesign can be required. If it is required, this patch is fairly low touch and change is possible later, incremental development etc. As regards overhead, this adds a small amount of time to a background process executed every 10 secs, generates no new WAL records. So I don't see any reason not to commit this feature, after the minor corrections. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-30 19:40:08 +0100, Simon Riggs wrote: > On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote: > > >> /* > >> * Each page of XLOG file has a header like this: > >> */ > >> -#define XLOG_PAGE_MAGIC 0xD097 /* can be used as WAL version indicator */ > >> +#define XLOG_PAGE_MAGIC 0xD100 /* can be used as WAL version indicator */ > > > > We normally only advance this by one, it's not tied to the poistgres version. > > That was my addition. I rounded it up cos this is release 10. No biggie. We'll probably upgrade that more than once again this release... > (Poistgres? Is that the Manhattan spelling?) Tiredness spelling ;) > We've been redesigning the mechanisms for 2 years now, so it seems > unlikely that further redesign can be required. I don't think that's true *at all* - the mechanism previously fundamentally different. The whole topic has largely seen activity shortly before the code freeze, both last time round and now. I don't think it's surprising that it thus doesn't end up being ready. > If it is required, > this patch is fairly low touch and change is possible later, > incremental development etc. As regards overhead, this adds a small > amount of time to a background process executed every 10 secs, > generates no new WAL records. > > So I don't see any reason not to commit this feature, after the minor > corrections. It doesn't have any benefit on its own, the locking model doesn't seem fully there. I don't see much reason to get this in before the release. - Andres
On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-29 08:01:34 +0800, Craig Ringer wrote: >> On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote: >> >> >> --- a/doc/src/sgml/protocol.sgml >> >> +++ b/doc/src/sgml/protocol.sgml >> >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are: >> >> <para> >> >> Drops a replication slot, freeing any reserved server-side resources. If >> >> the slot is currently in use by an active connection, this command fails. >> >> + If the slot is a logical slot that was created in a database other than >> >> + the database the walsender is connected to, this command fails. >> >> </para> >> >> <variablelist> >> >> <varlistentry> >> > >> > Shouldn't the docs in the drop database section about this? >> >> DROP DATABASE doesn't really discuss all the resources it drops, but >> I'm happy to add mention of replication slots handling. > > I don't think that's really comparable, because the other things aren't > global objects, which replication slots are. Fine by me. Patch fix-slot-drop-docs.patch, upthread, adds the passage + + <para> + Active <link linkend="logicaldecoding-replication-slots">logical + replication slots</> count as connections and will prevent a + database from being dropped. Inactive slots will be automatically + dropped when the database is dropped. + </para> to the notes section of the DROP DATABASE docs; that should do the trick. I'm not convinced it's worth going into the exceedingly unlikely race with concurrent slot drop, and we don't seem to in other places in the docs, like the race you mentioned with connecting to a db as it's being dropped. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote: >> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record) >> SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); >> >> /* >> + * There can be no concurrent writers to oldestCatalogXmin during >> + * recovery, so no need to take ProcArrayLock. >> + */ >> + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin; > > s/writers/writes/? I meant writers, i.e. nothing else can be writing to it. But writes works too. >> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin, >> >> ReplicationSlotsComputeRequiredXmin(true); >> >> + /* >> + * If this is the first slot created on the master we won't have a >> + * persistent record of the oldest safe xid for historic snapshots yet. >> + * Force one to be recorded so that when we go to replay from this slot we >> + * know it's safe. >> + */ >> + force_standby_snapshot = >> + !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin); > > s/first slot/first logical slot/. Also, the reference to master doesn't > seem right. Unsure what you mean re reference to master not seeming right. If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding from the new slot so we need to make sure it gets advanced one we've decided on our starting catalog_xmin. >> LWLockRelease(ProcArrayLock); >> >> + /* Update ShmemVariableCache->oldestCatalogXmin */ >> + if (force_standby_snapshot) >> + LogStandbySnapshot(); > > The comment and code don't quite square to me - it's far from obvious > that LogStandbySnapshot does something like that. I'd even say it's a > bad idea to have it do that. So you prefer the prior approach with separate xl_catalog_xmin advance records? I don't have much preference; I liked the small code reduction of Simon's proposed approach, but it landed up being a bit awkward in terms of ordering and locking. I don't think catalog_xmin tracking is really closely related to the standby snapshot stuff and it feels a bit like it's a tacked-on afterthought where it is now. >> /* >> * tell the snapshot builder to only assemble snapshot once reaching the >> * running_xact's record with the respective xmin. >> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn, >> start_lsn = slot->data.confirmed_flush; >> } >> >> + EnsureActiveLogicalSlotValid(); > > This seems like it should be in a separate patch, and seperately > reviewed. It's code that's currently unreachable (and thus untestable). It is reached and is run, those checks run whenever creating a non-initial decoding context on master or replica. The failure case is reachable if a replica has hot_standby_feedback off or it's not using a physical slot and loses its connection. If promoted, any slot existing on that replica (from a file system level copy when the replica was created) will fail. I agree it's contrived since we can't create and maintain slots on replicas, which is the main point of it. >> +/* >> + * Test to see if the active logical slot is usable. >> + */ >> +static void >> +EnsureActiveLogicalSlotValid(void) >> +{ >> + TransactionId shmem_catalog_xmin; >> + >> + Assert(MyReplicationSlot != NULL); >> + >> + /* >> + * A logical slot can become unusable if we're doing logical decoding on a >> + * standby or using a slot created before we were promoted from standby >> + * to master. > > Neither of those is currently possible... Right. Because it's foundations for decoding on standby. >> If the master advanced its global catalog_xmin past the >> + * threshold we need it could've removed catalog tuple versions that >> + * we'll require to start decoding at our restart_lsn. >> + * >> + * We need a barrier so that if we decode in recovery on a standby we >> + * don't allow new decoding sessions to start after redo has advanced >> + * the threshold. >> + */ >> + if (RecoveryInProgress()) >> + pg_memory_barrier(); > > I don't think this is a meaningful locking protocol. It's a bad idea to > use lock-free programming without need, especially when the concurrency > protocol isn't well defined. Yeah. The intended interaction is with recovery conflict on standby which doesn't look likely to land in this release due to patch split/cleanup etc. (Not a complaint). Better to just take a brief shared ProcArrayLock. >> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c >> index cfc3fba..cdc5f95 100644 >> --- a/src/backend/replication/walsender.c >> +++ b/src/backend/replication/walsender.c >> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn) >> * be energy wasted - the worst lost information can do here is give us >> * wrong information in a statistics view - we'll just potentially be more >> * conservative in removing files. >> + * >> + * We don't have to do any effective_xmin / effective_catalog_xmin testing >> + * here either, like for LogicalConfirmReceivedLocation. If we received >> + * the xmin and catalog_xmin from downstream replication slots we know they >> + * were already confirmed there, >> */ >> } > > This comment reads as if LogicalConfirmReceivedLocation had > justification for not touching / checking catalog_xmin. But it does. It touches it, what it doesn't do is test and only advance if the new value is greater, like for xmin as referenced in the prior par. Will clarify. >> /* >> + * Update our knowledge of the oldest xid we can safely create historic >> + * snapshots for. >> + * >> + * There can be no concurrent writers to oldestCatalogXmin during >> + * recovery, so no need to take ProcArrayLock. > > By now I think is pretty flawed logic, because there can be concurrent > readers, that need to be safe against oldestCatalogXmin advancing > concurrently. You're right, we'll need a lock or suitable barriers here to ensure that slot conflict with recovery and startup of new decoding sessions doesn't see outdated values. This would be the peer of the pg_memory_barrier() above. Or could just take a lock; there's enough other locking activity in redo that it should be fine. >> /* >> - * It's important *not* to include the limits set by slots here because >> + * It's important *not* to include the xmin set by slots here because >> * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those >> * were to be included here the initial value could never increase because >> * of a circular dependency where slots only increase their limits when >> * running xacts increases oldestRunningXid and running xacts only >> * increases if slots do. >> + * >> + * We can include the catalog_xmin limit here; there's no similar >> + * circularity, and we need it to log xl_running_xacts records for >> + * standbys. >> */ > > Those comments seem to need some more heavyhanded reconciliation. OK. To me it seems clear that the first refers to xmin, the second to catalog_xmin. But after all I wrote it, and the important thing is what it says to people who are not me. Will adjust. >> * >> * Return the current slot xmin limits. That's useful to be able to remove >> * data that's older than those limits. >> + * >> + * For logical replication slots' catalog_xmin, we return both the effective > > This seems to need some light editing. catalog_xmin => catalog_xmins I guess. >> /* >> * Record an enhanced snapshot of running transactions into WAL. >> * >> + * We also record the value of procArray->replication_slot_catalog_xmin >> + * obtained from GetRunningTransactionData here, so standbys know we're about >> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start >> + * removing dead catalog tuples below that threshold. > > I think needs some rephrasing. We're not necessarily about to remove > catalog tuples here, nor are we necessarily advancing oldestCatalogXmin. Agreed >> +static void >> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin) >> +{ >> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >> + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin) >> + || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin))) >> + ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin; >> + LWLockRelease(ProcArrayLock); >> +} > > Doing TransactionIdPrecedes before ensuring > ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike > me as a good idea. Generally, the expression as it stands is hard to > understand. OK. I found other formulations to be long and hard to read. Expressing it as "if validity has changed or value has increased" made more sense. Agree order should change. >> diff --git a/src/include/access/transam.h b/src/include/access/transam.h >> index d25a2dd..a4ecfb7 100644 >> --- a/src/include/access/transam.h >> +++ b/src/include/access/transam.h >> @@ -136,6 +136,17 @@ typedef struct VariableCacheData >> * aborted */ >> >> /* >> + * This field is protected by ProcArrayLock except >> + * during recovery, when it's set unlocked. >> + * >> + * oldestCatalogXmin is the oldest xid it is >> + * guaranteed to be safe to create a historic >> + * snapshot for. See also >> + * procArray->replication_slot_catalog_xmin >> + */ >> + TransactionId oldestCatalogXmin; > > Maybe it'd be better to rephrase that do something like > "oldestCatalogXmin guarantees that no valid catalog tuples >= than it > are removed. That property is used for logical decoding.". or similar? Fine by me. I'll adjust this per discussion and per a comment Simon made separately. Whether we use it right away or not it's worth having it updated while it's still freshly in mind. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote: > On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote: >> The comment and code don't quite square to me - it's far from obvious >> that LogStandbySnapshot does something like that. I'd even say it's a >> bad idea to have it do that. > > So you prefer the prior approach with separate xl_catalog_xmin advance records? Alternately, we can record the creation timeline on slots, so we know if there's been a promotion. It wouldn't make sense to do this if that were the only use of timelines on slots. But I'm aware you'd rather keep slots timeline-agnostic and I tend to agree. Anyway, per your advice will split out the validation step. (I'd like replication origins to be able to track time alongside lsn, and to pass the timeline of each LSN to output plugin callbacks so we can detect if a physical promotion results in us backtracking down a fork in history, but that doesn't affect slots.) -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote: > On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote: >>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record) >>> SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB); >>> >>> /* >>> + * There can be no concurrent writers to oldestCatalogXmin during >>> + * recovery, so no need to take ProcArrayLock. >>> + */ >>> + ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin; >> >> s/writers/writes/? > > I meant writers, i.e. nothing else can be writing to it. But writes works too. > Fixed. >>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin, >>> >>> ReplicationSlotsComputeRequiredXmin(true); >>> >>> + /* >>> + * If this is the first slot created on the master we won't have a >>> + * persistent record of the oldest safe xid for historic snapshots yet. >>> + * Force one to be recorded so that when we go to replay from this slot we >>> + * know it's safe. >>> + */ >>> + force_standby_snapshot = >>> + !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin); >> >> s/first slot/first logical slot/. Also, the reference to master doesn't >> seem right. > > Unsure what you mean re reference to master not seeming right. > > If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding > from the new slot so we need to make sure it gets advanced one we've > decided on our starting catalog_xmin. Moved to next patch, will address there. >>> LWLockRelease(ProcArrayLock); >>> >>> + /* Update ShmemVariableCache->oldestCatalogXmin */ >>> + if (force_standby_snapshot) >>> + LogStandbySnapshot(); >> >> The comment and code don't quite square to me - it's far from obvious >> that LogStandbySnapshot does something like that. I'd even say it's a >> bad idea to have it do that. > > So you prefer the prior approach with separate xl_catalog_xmin advance records? > > I don't have much preference; I liked the small code reduction of > Simon's proposed approach, but it landed up being a bit awkward in > terms of ordering and locking. I don't think catalog_xmin tracking is > really closely related to the standby snapshot stuff and it feels a > bit like it's a tacked-on afterthought where it is now. This code moved to next patch. But we do need to agree on the best approach. If we're not going to force a standby snapshot here, then it's probably better to use the separate xl_catalog_xmin design instead. >>> /* >>> * tell the snapshot builder to only assemble snapshot once reaching the >>> * running_xact's record with the respective xmin. >>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn, >>> start_lsn = slot->data.confirmed_flush; >>> } >>> >>> + EnsureActiveLogicalSlotValid(); >> >> This seems like it should be in a separate patch, and seperately >> reviewed. It's code that's currently unreachable (and thus untestable). > > It is reached and is run, those checks run whenever creating a > non-initial decoding context on master or replica. Again, moved to next patch. >>> /* >>> + * Update our knowledge of the oldest xid we can safely create historic >>> + * snapshots for. >>> + * >>> + * There can be no concurrent writers to oldestCatalogXmin during >>> + * recovery, so no need to take ProcArrayLock. >> >> By now I think is pretty flawed logic, because there can be concurrent >> readers, that need to be safe against oldestCatalogXmin advancing >> concurrently. > > You're right, we'll need a lock or suitable barriers here to ensure > that slot conflict with recovery and startup of new decoding sessions > doesn't see outdated values. This would be the peer of the > pg_memory_barrier() above. Or could just take a lock; there's enough > other locking activity in redo that it should be fine. Now takes ProcArrayLock briefly. oldestCatalogXmin is also used in GetOldestSafeDecodingTransactionId, and there we want to prevent it from being advanced. But on further thought, relying on oldestCatalogXmin there is actually unsafe; on the master, we might've already logged our intent to advance it to some greater value of procArray->replication_slot_catalog_xmin and will do so as ProcArrayLock is released. On standby we're also better off relying on procArray->replication_slot_catalog_xmin since that's what we'll be sending in feedback. Went back to using replication_slot_catalog_xmin here, with comment * * We don't use ShmemVariableCache->oldestCatalogXmin here because another * backend may have already logged its intention to advance it to a higher * value (still <= replication_slot_catalog_xmin) and just be waiting on * ProcArrayLock to actually apply the change. On a standby * replication_slot_catalog_xmin is what the walreceiver will be sending in * hot_standby_feedback, not oldestCatalogXmin. */ >>> /* >>> - * It's important *not* to include the limits set by slots here because >>> + * It's important *not* to include the xmin set by slots here because >>> * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those >>> * were to be included here the initial value could never increase because >>> * of a circular dependency where slots only increase their limits when >>> * running xacts increases oldestRunningXid and running xacts only >>> * increases if slots do. >>> + * >>> + * We can include the catalog_xmin limit here; there's no similar >>> + * circularity, and we need it to log xl_running_xacts records for >>> + * standbys. >>> */ >> >> Those comments seem to need some more heavyhanded reconciliation. > > OK. To me it seems clear that the first refers to xmin, the second to > catalog_xmin. But after all I wrote it, and the important thing is > what it says to people who are not me. Will adjust. Changed to * We can safely report the catalog_xmin limit for replication slots here * because it's only used to advance oldestCatalogXmin. Slots' * catalog_xmin advance does not depend on it so there's no circularity. > >>> * >>> * Return the current slot xmin limits. That's useful to be able to remove >>> * data that's older than those limits. >>> + * >>> + * For logical replication slots' catalog_xmin, we return both the effective >> >> This seems to need some light editing. > > catalog_xmin => catalog_xmins I guess. Amended. >>> /* >>> * Record an enhanced snapshot of running transactions into WAL. >>> * >>> + * We also record the value of procArray->replication_slot_catalog_xmin >>> + * obtained from GetRunningTransactionData here, so standbys know we're about >>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start >>> + * removing dead catalog tuples below that threshold. >> >> I think needs some rephrasing. We're not necessarily about to remove >> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin. > > Agreed * We also record the value of procArray->replication_slot_catalog_xmin * obtained from GetRunningTransactionData here. We intend to advance * ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed * of the new value, which will permit removal of previously-protected dead * catalog tuples. The standby needs to know about that before any WAL * records containing such tuple removals could possibly arrive. >>> +static void >>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin) >>> +{ >>> + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >>> + if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin) >>> + || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) != TransactionIdIsValid(pendingOldestCatalogXmin))) >>> + ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin; >>> + LWLockRelease(ProcArrayLock); >>> +} >> >> Doing TransactionIdPrecedes before ensuring >> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike >> me as a good idea. Generally, the expression as it stands is hard to >> understand. > > OK. > > I found other formulations to be long and hard to read. Expressing it > as "if validity has changed or value has increased" made more sense. > Agree order should change. Re-ordered, otherwise left the same. Could add a comment like "we must set oldestCatalogXmin if its validity has changed or it is advancing" but seems rather redundant to the code. >>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h >>> index d25a2dd..a4ecfb7 100644 >>> --- a/src/include/access/transam.h >>> +++ b/src/include/access/transam.h >>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData >>> * aborted */ >>> >>> /* >>> + * This field is protected by ProcArrayLock except >>> + * during recovery, when it's set unlocked. >>> + * >>> + * oldestCatalogXmin is the oldest xid it is >>> + * guaranteed to be safe to create a historic >>> + * snapshot for. See also >>> + * procArray->replication_slot_catalog_xmin >>> + */ >>> + TransactionId oldestCatalogXmin; >> >> Maybe it'd be better to rephrase that do something like >> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it >> are removed. That property is used for logical decoding.". or similar? > > Fine by me. > > I'll adjust this per discussion and per a comment Simon made > separately. Whether we use it right away or not it's worth having it > updated while it's still freshly in mind. OK, updated catalog_xmin logging patch attached. Important fix included: when faking up a RunningTransactions snapshot in StartupXLOG for replay of shutdown checkpoints, copy the checkpoint's oldestCatalogXmin so we apply it instead of clobbering the replica's value. It's kind of roundabout to set this once when we apply the checkpoint and again via ProcArrayApplyRecoveryInfo, but it's necessary if we're using xl_running_xacts to carry oldestCatalogXmin info. Found another issue too. We log our intention to increase oldestCatalogXmin in LogStandbySnapshot when we write xl_running_xacts. We then release ProcArrayLock to re-acquire it LW_EXCLUSIVE so we can increment oldestCatalogXmin in shmem. But a checkpoint runs and copies the old oldestCatalogXmin value after we wrote xlog but before we updated in shmem. On the standby, redo will apply the new value then clobber it with the old one. To fix this, take CheckpointLock in LogStandbySnapshot (if not called during a checkpoint) so we can't have the xl_running_xacts with the new oldestCatalogXmin land up in WAL before a checkpoint with an older value. Also take oldestCatalogXmin's value after we've forced LogStandbySnapshot in a checkpoint. Extended tests a bit to cover redo on standbys. Personally I'm not a huge fan of how integrating this with logging standby snapshots has turned out. It seemed to make sense initially, but I think the way it works out is more convoluted than necessary for little benefit. I'll prep an updated version of the xl_advance_catalog_xmin patch with the same fixes for side by side comparison. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote: > OK, updated catalog_xmin logging patch attached. Ahem, that should be v5. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 3 April 2017 at 15:27, Craig Ringer <craig@2ndquadrant.com> wrote: > On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote: > >> OK, updated catalog_xmin logging patch attached. > > Ahem, that should be v5. ... and here's v6, which returns to the separate xl_xact_catalog_xmin_advance approach. pgintented. This is what I favour proceeding with. Now updating/amending recovery conflict patch. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi all Here's the final set of three patches on top of what's already committed. The first is catalog_xmin logging, which is unchanged from the prior post. The 2nd is support for conflict with recovery, with changes that should address Andres's concerns there. The 3rd actually enables decoding on standby. Unlike the prior version, no attempt is made to check the walsender configuration for slot use, etc. The ugly code to try to mitigate races is also removed. Instead, if wal_level is logical the catalog_xmin sent by hot_standby_feedback is now the same as the xmin if there's no local slot holding it down. So we're always sending a catalog_xmin in feedback and we should always expect to have a valid local oldestCatalogXmin once hot_standby_feedback kicks in. This makes the race in slot creation no worse than the existing race between hot_standby_feedback establishment and the first queries run on a downstream, albeit with more annoying consequences. Apps can still ensure a slot created on standby is guaranteed safe and conflict-free by having a slot on the master first. I'm much happier with this. I'm still fixing some issues in the tests for 03 and tidying them up, but 03 should allow 01 and 02 to be reviewed in their proper context now. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 April 2017 at 22:32, Craig Ringer <craig@2ndquadrant.com> wrote: > Hi all > > Here's the final set of three patches on top of what's already committed. > > The first is catalog_xmin logging, which is unchanged from the prior post. > > The 2nd is support for conflict with recovery, with changes that > should address Andres's concerns there. > > The 3rd actually enables decoding on standby. Unlike the prior > version, no attempt is made to check the walsender configuration for > slot use, etc. The ugly code to try to mitigate races is also removed. > Instead, if wal_level is logical the catalog_xmin sent by > hot_standby_feedback is now the same as the xmin if there's no local > slot holding it down. So we're always sending a catalog_xmin in > feedback and we should always expect to have a valid local > oldestCatalogXmin once hot_standby_feedback kicks in. This makes the > race in slot creation no worse than the existing race between > hot_standby_feedback establishment and the first queries run on a > downstream, albeit with more annoying consequences. Apps can still > ensure a slot created on standby is guaranteed safe and conflict-free > by having a slot on the master first. > > I'm much happier with this. I'm still fixing some issues in the tests > for 03 and tidying them up, but 03 should allow 01 and 02 to be > reviewed in their proper context now. Dammit. Attached. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2017-04-04 22:32:40 +0800, Craig Ringer wrote: > I'm much happier with this. I'm still fixing some issues in the tests > for 03 and tidying them up, but 03 should allow 01 and 02 to be > reviewed in their proper context now. To me this very clearly is too late for v10, and now should be moved to the next CF. - Andres
On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote: > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote: >> I'm much happier with this. I'm still fixing some issues in the tests >> for 03 and tidying them up, but 03 should allow 01 and 02 to be >> reviewed in their proper context now. > > To me this very clearly is too late for v10, and now should be moved to > the next CF. I tend to agree that it's late in the piece. It's still worth cleaning it up into a state ready for early pg11 though. I've just fixed an issue where hot_standby_feedback on a physical slot could cause oldestCatalogXmin to go backwards. When the slot's catalog_xmin was 0 and is being set for the first time the standby's supplied catalog_xmin is trusted. To fix it, in PhysicalReplicationSlotNewXmin when setting catalog_xmin from 0, clamp the value to the master's GetOldestSafeDecodingTransactionId(). Tests are cleaned up and fixed. This series adds full support for logical decoding on a standby. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2017-04-05 17:18:24 +0800, Craig Ringer wrote: > On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote: > > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote: > >> I'm much happier with this. I'm still fixing some issues in the tests > >> for 03 and tidying them up, but 03 should allow 01 and 02 to be > >> reviewed in their proper context now. > > > > To me this very clearly is too late for v10, and now should be moved to > > the next CF. > > I tend to agree that it's late in the piece. It's still worth cleaning > it up into a state ready for early pg11 though. Totally agreed. - Andres
On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-04-05 17:18:24 +0800, Craig Ringer wrote: >> On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote: >> > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote: >> >> I'm much happier with this. I'm still fixing some issues in the tests >> >> for 03 and tidying them up, but 03 should allow 01 and 02 to be >> >> reviewed in their proper context now. >> > >> > To me this very clearly is too late for v10, and now should be moved to >> > the next CF. >> >> I tend to agree that it's late in the piece. It's still worth cleaning >> it up into a state ready for early pg11 though. > > Totally agreed. Based on this exchange, marked as "Moved to next CF". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company