Thread: Minimal logical decoding on standbys

Minimal logical decoding on standbys

From

Andres Freund

Date:

12 December 2018, 20:41:54

Hi,

Craig previously worked on $subject, see thread [1]. A bunch of the
prerequisite features from that and other related threads have been
integrated into PG. What's missing is actually allowing logical
decoding on a standby. The latest patch from that thread does that [2],
but unfortunately hasn't been updated after slipping v10.

The biggest remaining issue to allow it is that the catalog xmin on the
primary has to be above the catalog xmin horizon of all slots on the
standby. The patch in [2] does so by periodically logging a new record
that announces the current catalog xmin horizon. Additionally it
checks that hot_standby_feedback is enabled when doing logical decoding
from a standby.

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements. I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts. As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

I also don't think we should actually enforce hot_standby_feedback being
enabled - there's use-cases where that's not convenient, and it's not
bullet proof anyway (can be enabled/disabled without using logical
decoding inbetween). I think when there's a conflict we should have the
HINT mention that hs_feedback can be used to prevent such conflicts,
that ought to be enough.

Attached is a rough draft patch. If we were to go for this approach,
we'd obviously need to improve the actual conflict handling against
slots - right now it just logs a WARNING and retries shortly after.

I think there's currently one hole in this approach. Nbtree (and other
index types, which are pretty unlikely to matter here) have this logic
to handle snapshot conflicts for single-page deletions:

/*
* If we have any conflict processing to do, it must happen before we
* update the page.
*
* Btree delete records can conflict with standby queries. You might
* think that vacuum records would conflict as well, but we've handled
* that already. XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
* cleaned by the vacuum of the heap and so we can resolve any conflicts
* just once when that arrives. After that we know that no conflicts
* exist from individual btree vacuum records on that index.
*/
if (InHotStandby)
{
TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
RelFileNode rnode;

XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);

ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->onCatalogTable, rnode);
}

I.e. we get the latest removed xid from the heap, which has the
following logic:
/*
* If there's nothing running on the standby we don't need to derive a
* full latestRemovedXid value, so use a fast path out of here. This
* returns InvalidTransactionId, and so will conflict with all HS
* transactions; but since we just worked out that that's zero people,
* it's OK.
*
* XXX There is a race condition here, which is that a new backend might
* start just after we look. If so, it cannot need to conflict, but this
* coding will result in throwing a conflict anyway.
*/
if (CountDBBackends(InvalidOid) == 0)
return latestRemovedXid;

/*
* In what follows, we have to examine the previous state of the index
* page, as well as the heap page(s) it points to. This is only valid if
* WAL replay has reached a consistent database state; which means that
* the preceding check is not just an optimization, but is *necessary*. We
* won't have let in any user sessions before we reach consistency.
*/
if (!reachedConsistency)
elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");

so we wouldn't get a correct xid when not if nobody is connected to a
database (and by implication when not yet consistent).

I'm wondering if it's time to move the latestRemovedXid computation for
this type of record to the primary - it's likely to be cheaper there and
avoids this kind of complication. Secondarily, it'd have the advantage
of making pluggable storage integration easier - there we have the
problem that we don't know which type of relation we're dealing with
during recovery, so such lookups make pluggability harder (zheap just
adds extra flags to signal that, but that's not extensible).

Another alternative would be to just prevent such index deletions for
catalog tables when wal_level = logical.

If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]
- enforce hot-standby to be enabled on the standby when logical slots
are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
above.
- Have a nicer conflict handling than what I implemented here. Craig's
approach deleted the slots, but I'm not sure I like that. Blocking
seems more appropriately here, after all it's likely that the
replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
optimized (e.g. check wal_level before opening rel etc).

Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=XCg@mail.gmail.com
[2]
https://archives.postgresql.org/message-id/CAMsr%2BYEbS8ZZ%2Bw18j7OPM2MZEeDtGN9wDVF68%3DMzpeW%3DKRZZ9Q%40mail.gmail.com

Thread: Minimal logical decoding on standbys

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment