Thread: Logical decoding on standby

Logical decoding on standby

From

Craig Ringer

Date:

21 November 2016, 08:18:14

Hi all

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys. Since it's a series of 20
smallish patches at the moment I won't attach it. You can find the
current version at time of writing here:

https://github.com/postgres/postgres/compare/c5f365f3ab...2ndQuadrant:dev/logical-decoding-on-standby-pg10-v1?expand=1

i.e the tag    dev/logical-decoding-on-standby-pg10-v1   in github
repo 2ndQuadrant/postgres.

and whatever I'm working on (subject to rebase, breakage, etc) lives
in the branch dev/logical-decoding-on-standby-pg10 .

Quickstart
===

Compile and install like usual; make sure to install test_decoding
too. To see the functionality in action, configure with
--enable-tap-tests and:
   make -C src/test/recovery check

To try manually, initdb a master, set pg_hba.conf to 'trust' on all
replication connections, append to postgresql.conf:
   wal_level = 'logical'   max_replication_slots = 4   max_wal_senders = 4   hot_standby_feedback = on

then start the master. Now
   psql -d 'master_connstr' -c "SELECT
pg_create_physical_replication_slot('standby1');"

and
   pg_basebackup -d 'master_connstr' -X stream -R --slot=standby1

and start the replica.

You can now use pg_recvlogical to create a slot on the replica and
decode changes from it, e.g.
   pg_recvlogical -d 'replica_connstr' -S test -P test_decoding --create-slot   pg_recvlogical -d 'replica_connstr' -S
'test'-f - --start
 

and you'll (hopefully) see subsequent changes you make on the master.
If not, tell me.


Patch series contents
===

This patch series incorporates the following changes:

* Timeline following for logical slots, so they can start decoding on
the correct timeline and follow timeline switches (with tests);
originally [3]

* Add --endpos to pg_recvlogical, with tests; originally [3]

* Splitting of xmin and catalog_xmin on hot standby feedback, so
logical slots on a replica only hold down catalog_xmin on the master,
not the xmin for user tables (with tests). Minimises upstream bloat;
originally [4]

* Suppress export of snapshot when starting logical decoding on
replica. Since we cannot allocate an xid, we cannot export snapshots
on standby. Decoding clients can still do their initial setup via a
slot on the master then switch over, do it via physical copy, etc.

* Require hot_standby_feedback to be enabled when starting logical
decoding on a standby.

* Drop any replication slots from a database when redo'ing database
drop, so we don't leave dangling slots on the replica (with tests).

* Make the walsender respect SIGUSR1 and exit via
RecoveryConflictInterrupt() when it gets
PROCSIG_RECOVERY_CONFLICT_DATABASE (with tests); see [6]

* PostgresNode.pm enhancements for the tests

* New test coverage for logical decoding on standby

Remaining issues
===

* The method used to make the walsender respect conflict with recovery
interrupts may not be entirely safe, see walsender
procsignal_sigusr1_handler thread [5];

* We probably terminate walsenders running inside an output plugin
with a virtual xact whose xmin is below the upstream's global xmin,
even though its catalog xmin is fine, in
ResolveRecoveryConflictWithSnapshot(...). Haven't been able to test
this. Need to only terminate them when the conflict affects relations
accessible in logical decoding, which likely needs the upstream to
send more info in WAL;

* logical decoding timeline following needs tests for cascading
physical replication where an intermediate node is promoted per
timeline following thread [3];

* walsender may need to maintain ThisTimeLineID in more places per
decoding timeline following v10 thread [3];

* it may be desirable to refactor the walsender to deliver cleaner
logical decoding timeline following per the decoding timeline
following v10 thread[3]

also:

* Nothing stops the user from disabling hot_standby_feedback on the
standby or dropping and re-creating the physical slot on the master,
causing needed catalog tuples to get vacuumed away. Since it's not
going to be safe to check slot shmem state from the
hot_standby_feedback verify hook and we let hot_standby_feedback
change at runtime this is going to be hard to fix comprehensively, so
we need to cope with what happens when feedback fails, but:

* We don't yet detect when upstream's catalog_xmin increases past our
needed catalog_xmin and needed catalog tuples are vacuumed away by the
upstream. So we don't invalidate the slot or terminate any active
decoding sessions using the slot. Active decoding sessions often won't
have a vtxid to use with ResolveRecoveryConflictWithVirtualXIDs(),
transaction cancel is not going to be sufficient, and anyway it'll
cancel too aggressively since it doesn't know it's safe to apply
changes that affect only (non-user-catalog) heap tables without
conflict with decoding sessions.

... so this is definitely NOT ready for commit. It does, however, make
logical decoding work on standby.

Next steps
===

Since it doesn't look practical to ensure there's never been a gap in
hot standby feedback or detect such a gap directly, I'm currently
looking at ways to reliably detect when the upstream has removed
tuples we need and error out. That means we need a way to tell when
upstream's catalog_xmin has advanced, which we don't currently have
from xlogs. Checkpoint's oldestXID is insufficient since advance
could've happened since last checkpoint.


Related threads
===

This series supercedes:
* Timeline following for logical slots [1]
https://www.postgresql.org/message-id/flat/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com
*  WIP: Failover Slots [2]
https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com


and incorporates the patches in:
* Logical decoding timeline following take II [3]
https://www.postgresql.org/message-id/flat/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

* Send catalog_xmin separately in hot standby feedback [4]
https://www.postgresql.org/message-id/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com

*  Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from
walsender?  [5]
https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com


Also relevant:
* Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() in walsender [6]
https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Andres Freund

Date:

21 November 2016, 19:14:44

Hi,

On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:
> I've prepared a working initial, somewhat raw implementation for
> logical decoding on physical standbys.

Please attach. Otherwise in a year or two it'll be impossible to look
this up.

Andres

Re: Logical decoding on standby

From

Craig Ringer

Date:

22 November 2016, 02:20:41

On 22 November 2016 at 03:14, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:
>> I've prepared a working initial, somewhat raw implementation for
>> logical decoding on physical standbys.
>
> Please attach. Otherwise in a year or two it'll be impossible to look
> this up.

Fair enough. Attached. Easy to apply with "git am".

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

That'll be recorded in a new CheckPoint.oldestCatalogXid field and in
ShmemVariableCache ( i.e. VariableCacheData.oldestCatalogXid ).

Vacuum will be responsible for advancing
VariableCacheData.oldestCatalogXid by writing an expanded
xl_heap_cleanup_info record with a new latestRemovedCatalogXid field
and then advancing the value in the ShmemVariableCache. Vacuum will
only remove rows of catalog or user-catalog tables that are older than
VariableCacheData.oldestCatalogXid.

This allows recovery on a standby to tell, based on the last
checkpoint + any xl_heap_cleanup_info records used to maintain
ShmemVariableCache, whether the upstream has removed catalog or
user-catalog records it needs. We can signal walsenders with running
xacts to terminate if their xmin passes the threshold, and when they
start a new xact they can check to see if they're still valid and bail
out.

xl_heap_cleanup_info isn't emitted much, but if adding a field there
is a problem we can instead add an additional xlog buffer that's only
appended when wal_level = logical.

Doing things this way avoids:

* the need for the standby to be able to tell at redo time whether a
RelFileNode is for a catalog or user-catalog relation without access
to relcache; or
* the need to add info on whether a catalog or user-catalog is being
affected to any xlog record that can cause a snapshot conflict on
standby; or
* a completely reliably way to ensure hot_standby_feedback can never
cease to affect the master even if the user does something dumb

at the cost of sometimes somewhat delaying removal of catalog or
user-catalog tuples when wal_level >= hot_standby, a new CheckPoint
field, and a new field in xl_heap_cleanup_info .

The above is not incorporated in the attached patch series, see the
prior post for status of the attached patches.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From

Craig Ringer

Date:

22 November 2016, 06:49:52

On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:

> I'm currently looking at making detection of replay conflict with a
> slot work by separating the current catalog_xmin into two effective
> parts - the catalog_xmin currently needed by any known slots
> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
> actually valid catalog_xmin where we know we haven't removed anything
> yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs. We don't
emit a WAL record when we advance oldestXid in
SetTransactionIdLimit(), and doing so is useless because vacuum will
have already removed needed tuples from needed catalogs before calling
SetTransactionIdLimit() from vac_truncate_clog(). We know that if
oldestXid is n, the true valid catalog_xmin where no needed tuples
have been removed must be >= n. But we need to know the lower bound of
valid catalog_xmin, which oldestXid doesn't give us.

So right now a standby has no way to reliably know if the catalog_xmin
requirement for a given replication slot can be satisfied. A standby
can't tell based on a xl_heap_cleanup_info record, xl_heap_clean
record, etc whether the affected table is a catalog or not, and
shouldn't generate conflicts for non-catalogs since otherwise it'll be
constantly clobbering walsenders.

A 2-phase advance of the global catalog_xmin would mean that
GetOldestXmin() would return a value from ShmemVariableCache, not the
oldest catalog xmin from ProcArray like it does now. (auto)vacuum
would then be responsible for:

* Reading the oldest catalog_xmin from procarray
* If it has advanced vs what's present in ShmemVariableCache, writing
a new xlog record type recording an advance of oldest catalog xmin
* advancing ShmemVariableCache's oldest catalog xmin

and would do so before it called GetOldestXmin via
vacuum_set_xid_limits() to determine what it can remove.

GetOldestXmin would return the ProcArray's copy of the oldest
catalog_xmin when in recovery, so we report it via hot_standby_fedback
to the upstream, it's recorded on our physical slot, and in turn
causes vacuum to advance the master's effective oldest catalog_xmin
for vacuum.

On the standby we'd replay the catalog_xmin advance record, advance
the standby's ShmemVariableCache's oldest catalog xmin, and check to
see if any replication slots, active or not, have a catalog_xmin <
than the new threshold. If none do, there's no conflict and we're
fine. If any do, we wait
max_standby_streaming_delay/max_standby_archive_delay as appropriate,
then generate recovery conflicts against all backends that have an
active replication slot based on the replication slot state in shmem.
Those backends - walsender or normal decoding backend - would promptly
die. New decoding sessions will check the ShmemVariableCache and
refuse to start if their catalog_xmin is < the threshold. Since we
advance it before generating recovery conflicts there's no race with
clients trying to reconnect after their backend is killed with a
conflict.

If we wanted to get fancy we could set the latches of walsender
backends at risk of conflicting and they could check
ShmemVariableCache's oldest valid catalog xmin, so they could send
immediate keepalives with reply_requested set and hopefully get flush
confirmation from the client and advance their catalog_xmin before we
terminate them as conflicting with recovery. But that can IMO be done
later/separately.

Going to prototype this.

Alternate approach:
---------------

The oldest xid in heap_xlog_cleanup_info is relation-specific, but the
standby has no way to know if it's a catalog relation or not during
redo and know whether to kill slots and decoding sessions based on its
latestRemovedXid. Same for xl_heap_clean and the other records that
can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page,
xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect).

Instead of adding a 2-phase advance of the global catalog_xmin, we can
instead add a rider to each of these records that identifies whether
it's a catalog table or not. This would only be emitted when wal_level
>= logical, but it *would* increase WAL sizes a bit when logical
decoding is enabled even if it's not going to be used on a standby.
The rider would be a simple:

typedef struct xl_rel_catalog_info
{ bool rel_accessible_from_logical_decoding;
} xl_catalog_info;

or similar. During redo we call a new
ResolveRecoveryConflictWithLogicalSlot function from each of those
records' redo routines that does what I outlined above.

This way add more info to more xlog records, and the upstream has to
use RelationIsAccessibleInLogicalDecoding() to set up the records when
writing the xlogs. In exchange, we don't have to add a new field to
CheckPoint or ShmemVariableCache or add a new xlog record type. It
seems the worse option to me.

(BTW, as comments on GetOldestSafeDecodingTransactionId() note, we
can't rely on KnownAssignedXidsGetOldestXmin() since it can be
incomplete at least on standby.)

-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Robert Haas

Date:

22 November 2016, 19:55:27

On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:
>> I'm currently looking at making detection of replay conflict with a
>> slot work by separating the current catalog_xmin into two effective
>> parts - the catalog_xmin currently needed by any known slots
>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
>> actually valid catalog_xmin where we know we haven't removed anything
>> yet.
>
> OK, more detailed plan.
>
> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
> are already held down by ProcArray's catalog_xmin. But that doesn't
> mean we haven't removed newer tuples from specific relations and
> logged that in xl_heap_clean, etc, including catalogs or user
> catalogs, it only means the clog still exists for those XIDs.

Really?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical decoding on standby

From

Craig Ringer

Date:

23 November 2016, 13:37:43

On 23 November 2016 at 03:55, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> I'm currently looking at making detection of replay conflict with a
>>> slot work by separating the current catalog_xmin into two effective
>>> parts - the catalog_xmin currently needed by any known slots
>>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
>>> actually valid catalog_xmin where we know we haven't removed anything
>>> yet.
>>
>> OK, more detailed plan.
>>
>> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
>> are already held down by ProcArray's catalog_xmin. But that doesn't
>> mean we haven't removed newer tuples from specific relations and
>> logged that in xl_heap_clean, etc, including catalogs or user
>> catalogs, it only means the clog still exists for those XIDs.
>
> Really?

(Note the double negative above).

Yes, necessarily so. You can't look up xids older than the clog
truncation threshold at oldestXid, per our discussion on txid_status()
and traceable commit. But the tuples from that xact aren't guaranteed
to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
which calls GetOldestXmin(...); that in turn scans ProcArray to find
the oldest xid any running xact cares about. It might bump it down
further if there's a replication slot requirement or based on
vacuum_defer_cleanup_age, but it doesn't care in the slightest about
oldestXmin.

oldestXmin cannot advance until vacuum has removed all tuples for that
xid and advanced the database's datfrozenxid. But a given oldestXmin
says nothing about which tuples, catalog or otherwise, still exist and
are acessible.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Robert Haas

Date:

26 November 2016, 15:40:10

On Wed, Nov 23, 2016 at 8:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
>>> are already held down by ProcArray's catalog_xmin. But that doesn't
>>> mean we haven't removed newer tuples from specific relations and
>>> logged that in xl_heap_clean, etc, including catalogs or user
>>> catalogs, it only means the clog still exists for those XIDs.
>>
>> Really?
>
> (Note the double negative above).
>
> Yes, necessarily so. You can't look up xids older than the clog
> truncation threshold at oldestXid, per our discussion on txid_status()
> and traceable commit. But the tuples from that xact aren't guaranteed
> to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
> which calls GetOldestXmin(...); that in turn scans ProcArray to find
> the oldest xid any running xact cares about. It might bump it down
> further if there's a replication slot requirement or based on
> vacuum_defer_cleanup_age, but it doesn't care in the slightest about
> oldestXmin.
>
> oldestXmin cannot advance until vacuum has removed all tuples for that
> xid and advanced the database's datfrozenxid. But a given oldestXmin
> says nothing about which tuples, catalog or otherwise, still exist and
> are acessible.

Right.  Sorry, my mistake.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical decoding on standby

From

Craig Ringer

Date:

27 November 2016, 11:39:18

<p dir="ltr"><p dir="ltr">On 26 Nov. 2016 23:40, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Wed, Nov 23, 2016 at
8:37AM, Craig Ringer <<a href="mailto:craig@2ndquadrant.com">craig@2ndquadrant.com</a>> wrote:<br /> >
>>>The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,<br /> > >>> are already
helddown by ProcArray's catalog_xmin. But that doesn't<br /> > >>> mean we haven't removed newer tuples
fromspecific relations and<br /> > >>> logged that in xl_heap_clean, etc, including catalogs or user<br />
>>>> catalogs, it only means the clog still exists for those XIDs.<br /> > >><br /> > >>
Really?<br/> > ><br /> > > (Note the double negative above).<br /> > ><br /> > > Yes,
necessarilyso. You can't look up xids older than the clog<br /> > > truncation threshold at oldestXid, per our
discussionon txid_status()<br /> > > and traceable commit. But the tuples from that xact aren't guaranteed<br />
>> to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)<br /> > > which calls
GetOldestXmin(...);that in turn scans ProcArray to find<br /> > > the oldest xid any running xact cares about. It
mightbump it down<br /> > > further if there's a replication slot requirement or based on<br /> > >
vacuum_defer_cleanup_age,but it doesn't care in the slightest about<br /> > > oldestXmin.<br /> > ><br />
>> oldestXmin cannot advance until vacuum has removed all tuples for that<br /> > > xid and advanced the
database'sdatfrozenxid. But a given oldestXmin<br /> > > says nothing about which tuples, catalog or otherwise,
stillexist and<br /> > > are acessible.<br /> ><br /> > Right.  Sorry, my mistake.<p dir="ltr">Phew. Had me
worriedthere.<p dir="ltr">Thanks for looking over it. Prototype looks promising so far.

Re: Logical decoding on standby

From

Petr Jelinek

Date:

27 November 2016, 21:18:06

Hi,

I did look at the code a bit. The first 6 patches seem reasonable.
I don't understand why some patches are separate tbh (like 7-10, or 11).

About the 0009:
> diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
> index 9985e3e..4fa3ad4 100644
> --- a/contrib/pg_visibility/pg_visibility.c
> +++ b/contrib/pg_visibility/pg_visibility.c
> @@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
>      if (all_visible)
>      {
>          /* Don't pass rel; that will fail in recovery. */
> -        OldestXmin = GetOldestXmin(NULL, true);
> +        OldestXmin = GetOldestXmin(NULL, true, false);
>      }
>  
>      rel = relation_open(relid, AccessShareLock);
> @@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
>                   * a buffer lock. And this shouldn't happen often, so it's
>                   * worth being careful so as to avoid false positives.
>                   */
> -                RecomputedOldestXmin = GetOldestXmin(NULL, true);
> +                RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
>  
>                  if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
>                      record_corrupt_item(items, &tuple.t_self);
> diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
> index f524fc4..5b33c97 100644
> --- a/contrib/pgstattuple/pgstatapprox.c
> +++ b/contrib/pgstattuple/pgstatapprox.c
> @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
>      TransactionId OldestXmin;
>      uint64        misc_count = 0;
>  
> -    OldestXmin = GetOldestXmin(rel, true);
> +    OldestXmin = GetOldestXmin(rel, true, false);
>      bstrategy = GetAccessStrategy(BAS_BULKREAD);
>  
>      nblocks = RelationGetNumberOfBlocks(rel);

This does not seem correct, you are sending false as pointer parameter.

0012:

I think there should be parameter saying if snapshot should be exported
or not and if user asks for it on standby it should fail.

0014 makes 0011 even more pointless.

Not going into deeper detail as this is still very WIP. I go agree with
the general design though.

This also replaces the previous timeline following and decoding
threads/CF entries so maybe those should be closed in CF?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

28 November 2016, 03:16:52

>> --- a/contrib/pgstattuple/pgstatapprox.c
>> +++ b/contrib/pgstattuple/pgstatapprox.c
>> @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
>>       TransactionId OldestXmin;
>>       uint64          misc_count = 0;
>>
>> -     OldestXmin = GetOldestXmin(rel, true);
>> +     OldestXmin = GetOldestXmin(rel, true, false);
>>       bstrategy = GetAccessStrategy(BAS_BULKREAD);
>>
>>       nblocks = RelationGetNumberOfBlocks(rel);
>
> This does not seem correct, you are sending false as pointer parameter.

Thanks. That's an oversight from the GetOldestXmin interface change
per your prior feedback. C doesn't care since null is 0 and false is
0, and I missed it when transforming the patch.

> 0012:
>
> I think there should be parameter saying if snapshot should be exported
> or not and if user asks for it on standby it should fail.

Sounds reasonable. That also means clients can suppress standby export
on master, which as we recently learned can be desirable sometimes.

> 0014 makes 0011 even more pointless.

Yeah, as I said, it's a bit WIP still and needs some rebasing and rearrangement.

> This also replaces the previous timeline following and decoding
> threads/CF entries so maybe those should be closed in CF?

I wasn't sure what to do about that, since it's all a set of related
functionality. I think it's going to get more traction as a single
"logical decoding onstandby" feature though, since the other parts are
hard to test and use in isolation. So yeah, probably, I'll do so
unless someone objects.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

07 December 2016, 06:05:54

On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> I've prepared a working initial, somewhat raw implementation for
> logical decoding on physical standbys.

Hi all

I've attached a significantly revised patch, which now incorporates
safeguards to ensure that we prevent decoding if the master has not
retained needed catalogs and cancel decoding sessions that are holding
up apply because they need too-old catalogs

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field. Vacuum notices if
procArray->replication_slot_catalog_xmin has advanced past
ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
record with the new value before it copies it to oldestCatalogXmin.
This means that a standby can now reliably tell when catalogs are
about to be removed or become candidates for removal, so it can pause
redo until logical decoding sessions on the standby advance far enough
that their catalog_xmin passes that point. It also means that if our
hot_standby_feedback somehow fails to lock in the catalogs our slots
need on a standby, we can cancel sessions with a conflict with
recovery.

If wal_level is < logical this won't do anything, since
replication_slot_catalog_xmin and oldestCatalogXmin will both always
be 0.

Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
new replication_slot_catalog_xmin, this won't impact catalog bloat.

Ideally this mechanism won't generally actually be needed, since
hot_standby_feedback stops the master from removing needed catalogs,
and we make an effort to ensure that the standby has
hot_standby_feedback enabled and is using a replication slot. We
cannot prevent the user from dropping and re-creating the physical
slot on the upstream, though, and it doesn't look simple to stop them
turning off hot_standby_feedback or turning off use of a physical slot
after creating logical slots, either. So we try to stop users shooting
themselves in the foot, but if they do it anyway we notice and cope
gracefully. Logging catalog_xmin also helps slots created on standbys
know where to start, and makes sure we can deal gracefully with a race
between hs_feedback and slot creation on a standby.

There can be a significant delay for slot creation on standby since we
have to wait until there's a new xl_running_xacts record logged. I'd
like to extend the hot_standby_feedback protocol a little to address
that and some other issues, but that's a separate step.

I haven't addressed Petr's point yet, that "there should be parameter
saying if snapshot should be exported
or not and if user asks for it on standby it should fail". Otherwise I
think it's looking fairly solid.

Due to the amount of churn I landed up flattening the patchset. It
probably makes sense to split it up, likely into the sequence of
changes listed in the commit message. I'll wait for a general opinion
on the validity of this approach first.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:

changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
Michael's comments, and applies on top of 0001.

I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
Petr's suggestion, so you can request a slot be created
WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
silently suppressing snapshot export when a slot was created on a
replica. It'll conflict (easily resolved) if applied on top of the
current series.

I have more to do before re-posting the full series, so waiting on
author at this point. The PostgresNode changes likely break later
tests, I'm just posting them so there's some progress here and so I
don't forget over the next few days' distraction.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From

Andrew Dunstan

Date:

23 December 2016, 15:48:39


On 12/22/2016 01:21 AM, Craig Ringer wrote:
>> +   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
>> 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
>> +   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
>> @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
>> '$slot_name'");
>> +   $result = undef if $result eq '';
>> +   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
>> Couldn't this portion be made more generic? Other queries could
>> benefit from that by having a routine that accepts as argument an
>> array of column names for example.
> Yeah, probably. I'm not sure where it should live though - TestLib.pm ?
>
> Not sure if there's an idomatic way to  pass a string (in this case
> queyr) in Perl with a placeholder for interpolation of values (in this
> case columns). in Python you'd pass it with pre-defined
> %(placeholders)s for %.
>
>



For direct interpolation of an expression, there is this slightly 
baroque gadget:
   my $str = "here it is @{[ arbitrary expression here ]}";

For indirect interpolation using placeholders, there is
   my $str = sprintf("format string",...);

which works much like C except that the string is returned as a result 
instead of being the first argument.

And as we always say, TIMTOWTDI.


cheers

andrew (japh)

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

04 January 2017, 07:08:55

On 23 December 2016 at 18:11, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
> Michael's comments, and applies on top of 0001.
>
> I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
> Petr's suggestion, so you can request a slot be created
> WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
> silently suppressing snapshot export when a slot was created on a
> replica. It'll conflict (easily resolved) if applied on top of the
> current series.

OK, patch series updated.

0001 incorporates the changes requested by Michael Paquier. Simon
expressed his intention to commit this after updates, in the separate
thread for

The pg_lsn patch is gone; I worked around it using the server to work with LSNs.

0002 (endpos) is unchanged.

0003 is new, some minimal tests for pg_recvlogical. It can be squashed
with 0002 (pg_recvlogical --endpos) if desired.

0004 (pg_recvlogical wrapper) is unchanged.

0005 (new streaming rep tests) is updated for the changes in 0001,
otherwise unchanged. Simon said he wanted to commit this soon.

0006 (timeline following) is unchanged except for updates to be
compatible with 0001.

0007 is the optional snapshot export requested by Petr.

0008 is unchanged.

0009 is unchanged except for updates vs 0001 and use of the
WITHOUT_SNAPSHOT option added in 0007.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

> so here's a rebased series on top of master. No other changes.

Now with actual patches.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Rebased series attached, on top of current master (which includes
logical replicaiton).

I'm inclined to think I should split out a few of the changes from
0005, roughly along the lines of the bullet points in its commit
message. Anyone feel strongly about how granular this should be?

This patch series is a pre-requisite for supporting logical
replication using a physical standby as a source, but does not its
self enable logical replication from a physical standby.



On 23 January 2017 at 23:03, Thom Brown <thom@linux.com> wrote:
> On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>>> so here's a rebased series on top of master. No other changes.
>>
>> Now with actual patches.
>
> Patch 5 no longer applies:
>
> patching file src/include/pgstat.h
> Hunk #1 FAILED at 745.
> 1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej
>
> --- src/include/pgstat.h
> +++ src/include/pgstat.h
> @@ -745,7 +745,8 @@ typedef enum
>         WAIT_EVENT_SYSLOGGER_MAIN,
>         WAIT_EVENT_WAL_RECEIVER_MAIN,
>         WAIT_EVENT_WAL_SENDER_MAIN,
> -       WAIT_EVENT_WAL_WRITER_MAIN
> +       WAIT_EVENT_WAL_WRITER_MAIN,
> +       WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
>  } WaitEventActivity;
>
>  /* ----------
>
> Could you rebase?
>
> Thanks
>
> Thom



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

> Patch 4 committed. Few others need rebase.

Since this patch series and initial data copy for logical replication
both add a facility for suppressing initial snapshot export on a
logical slot, I've dropped patch 0003 (make snapshot export on logical
slot creation) in favour of Petr's similar patch.

I will duplicate it in this patch series for ease of application. (The
version here is slightly extended over Petr's so I'll re-post the
modified version on the logical replication initial data copy thread
too).

The main thing I want to direct attention to for Simon, as committer,
is the xlog'ing of VACUUM's xid threshold before we advance it and
start removing tuples. This is necessary for the standby to know
whether a given replication slot is safe to use and fail with conflict
with recovery if it is not, or if it becomes unsafe due to master
vacuum activity. Note that we can _not_ use the various vacuum records
for this because we don't know which are catalogs and which aren't;
we'd have to add a separate "is catalog" field to each vacuum xlog
record, which is undesirable. The downstream can't look up whether
it's a catalog or not because it doesn't have relcache/syscache access
during decoding.

This change might look a bit similar to the vac_truncate_clog change
in the txid_status patch, but it isn't really related. The txid_status
change is about knowing when we can safely look up xids in clog and
preventing a race with clog truncation. This change is about knowing
when we can know all catalog tuples for a given xid will still be in
the heap, not vacuumed away. Both are about making sure standbys know
more about the state of the system in a low-cost way, though.

WaitForMasterCatalogXminReservation(...) in logical.c is also worth
looking more closely at.



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 19 March 2017 at 18:02, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

> Patch 1 fails since feature has already been applied. If other reason,
> let me know.

Nope, that's fine.

Rebased attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 20 March 2017 at 17:03, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>
>> 2.1 Why does call to ReplicationSlotAcquire() move earlier in
>> pg_logical_slot_get_changes_guts()?
>
> That appears to be an oversight from an earlier version where it
> looped over timelines in pg_logical_slot_get_changes_guts . Reverted.
>
>> 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
>> documented well.
>> The setting
>>   sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
>> should be
>>   sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
>
> Definitely wrong. Fixed.

Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi all

Updated timeline following patch attached.

There's a change in read_local_xlog_page to ensure we maintain
ThisTimeLineID properly, otherwise it's just comment changes.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Teach-xlogreader-to-follow-timeline-switches.patch

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

22 March 2017, 11:53:14

On 22 March 2017 at 10:51, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> Updated timeline following patch attached.
>
> There's a change in read_local_xlog_page to ensure we maintain
> ThisTimeLineID properly, otherwise it's just comment changes.

OK, so we're looking OK with the TL following.

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I will next be working on a bare-minimum facility for creating and
advancing logical slots on a replica without support for buffering
changes, creating historic snapshots or invoking output plugin. The
slots will become usable after the replica is promoted. They'll track
their own restart_lsn, etc, and will keep track of xids so they can
manage their catalog_xmin, so there'll be no need for dodgy slot
syncing from the master, but they'll avoid most of the complex and
messy bits. The application will be expected to make sure a slot on
the master exists and is advanced before the corresponding slot on the
replica to protect required catalogs.

Then if there's agreement that it's the right way forward I can add
the catalog_xmin xlog'ing stuff as the next patch.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:

> I'm splitting up the rest of the decoding on standby patch set with
> the goal of getting minimal functionality for creating and managing
> slots on standbys in, so we can maintain slots on standbys and use
> them when the standby is promoted to master.
>
> The first, to send catalog_xmin separately to the global xmin on
> hot_standby_feedback and store it in the upstream physical slot's
> catalog_xmin, is attached.
>
> These are extracted directly from the logical decoding on standby
> patch, with comments by Petr and Andres made re the relevant code
> addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Report-catalog_xmin-separately-to-xmin-in-hot-standb.v2.patch

Re: [HACKERS] Logical decoding on standby

From

Andres Freund

Date:

22 March 2017, 19:17:36

On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:
> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:
> 
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?
> 
> Not needed until the slot is in use, which is a later patch.

Hm? We need to drop slots, if they can exist / be created, on a standby,
and they're on a dropped database.  Otherwise we'll reserve resources,
while anyone connecting to the slot will likely just receive errors
because the database doesn't exist anymore.  It's also one of the
patches that can quite easily be developed / reviewed, because there
really isn't anything complicated about it.  Most of the code is already
in Craig's patch, it just needs some adjustments.

> > Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Sure but that's a separate feature unrelated to this patch and we're
> running out of time.

Hm? The patch implemented it.

> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Knowing how much WAL to retain is key.
> 
> Why would decoding tell you how much WAL to retain?

Because decoding already has the necessary logic?  (You need to retain
enough WAL to restart decoding for all in-progress transactions etc).

> We tried to implement this automatically from the master, which was
> rejected. So the only other way is manually. We need one or the other.

I think the current approach is roughly the right way - but that doesn't
make the patch ready.

Greetings,

Andres Freund

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 01:55:53

On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
>> > 0002 should be doable as a whole this release, I have severe doubts that
>> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
>> > there's a significant number of open ends.  I'd suggest breaking of bits
>> > that are independently useful, and work on getting those committed.
>>
>> That would be my preference too.
>
>
>> The parts I think are important for Pg10 are:
>
>> * Ability to create logical slots on replicas
>
> Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> allowing to drop all slots using a database upon DROP DATABASE, is a
> useful thing on its own.

Definitely beneficial, otherwise recovery will stop until you drop
slots, which isn't ideal.

>> * Ability to advance (via feedback or via SQL function) - no need to
>> actually decode and call output plugins at al
>
> That pretty much requires decoding, otherwise you really don't know how
> much WAL you have to retain.

Yes, and to update restart_lsn and catalog_xmin correctly.

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

>> * Ability to drop logical slots on replicas
>
> That shouldn't actually require any changes, no?

It doesn't, it works as-is. I have NFI why I wrote that.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Andres Freund

Date:

23 March 2017, 02:31:31

On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
> >> > 0002 should be doable as a whole this release, I have severe doubts that
> >> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> >> > there's a significant number of open ends.  I'd suggest breaking of bits
> >> > that are independently useful, and work on getting those committed.
> >>
> >> That would be my preference too.
> >
> >
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Definitely beneficial, otherwise recovery will stop until you drop
> slots, which isn't ideal.

s/isn't ideal/not acceptable/ ;)


> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Yes, and to update restart_lsn and catalog_xmin correctly.

> I was thinking that by disallowing snapshot use and output plugin
> invocation we'd avoid the need to support cancellation on recovery
> conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

Greetings,

Andres Freund

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 04:05:16

On 23 March 2017 at 00:17, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:
>> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:
>>
>> >> The parts I think are important for Pg10 are:
>> >
>> >> * Ability to create logical slots on replicas
>> >
>> > Doesn't this also imply recovery conflicts on DROP DATABASE?
>>
>> Not needed until the slot is in use, which is a later patch.
>
> Hm? We need to drop slots, if they can exist / be created, on a standby,
> and they're on a dropped database.  Otherwise we'll reserve resources,
> while anyone connecting to the slot will likely just receive errors
> because the database doesn't exist anymore.  It's also one of the
> patches that can quite easily be developed / reviewed, because there
> really isn't anything complicated about it.  Most of the code is already
> in Craig's patch, it just needs some adjustments.

Right, I'm not too concerned about doing that, and it's next on my
TODO as I clean up the split patch series.

>> >> * Ability to advance (via feedback or via SQL function) - no need to
>> >> actually decode and call output plugins at al
>> >
>> > That pretty much requires decoding, otherwise you really don't know how
>> > much WAL you have to retain.
>>
>> Knowing how much WAL to retain is key.
>>
>> Why would decoding tell you how much WAL to retain?
>
> Because decoding already has the necessary logic?  (You need to retain
> enough WAL to restart decoding for all in-progress transactions etc).

Indeed; after all, standby status updates from the decoding client
only contain the *flushed* LSN. The downstream doesn't know the
restartpoint LSN, it must be tracked by the upstream.

It's also necessary to maintain our catalog_xmin correctly on the
standby so we can send it via hot_standby_feedback to a physical
replication slot used on the master, ensuring the master doesn't
remove catalog tuples we may still need.

> I don't know what you're on about with that statement.  I've spent a
> good chunk of time looking at the 0003 patch, even though it's large
> and contains a lot of different things.  I suggested splitting things up.
> I even suggested what to move earlier after Craig agreed with that
> sentiment, in the mail you're replying to, because it seems
> independently doable.

I really appreciate the review, as I'm all too aware of how time
consuming it can be.

From my PoV, the difficulty I'm in is that this patch series has
languished for most of the Pg 10 release cycle with no real input from
stakeholders in the logical decoding area, so while the review is
important, the fact that it's now means that it pretty comprehensively
blocks the patch for Pg 10. I asked on list for input on structure
(if/how to split it up) literally months ago, for example.

I've been struggling to get some kind of support for logical decoding
on standbys for most of two release cycles, and there are people
climbing the walls wanting it. I'm trying to make sure it's done
right, but I can't do that alone, and it's hard to progress when I
don't know what will be expected until it's too late to do anything
about it.

I guess all we can do at this point is get the foundations in place
and carry on for Pg 11, where the presence of in-core logical
replication will offer a lever to actually push this in. In the mean
time I'll have to continue carrying the out-of-tree failover slots
patch for people who use logical decoding and want it to be reliable.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 04:14:07

On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

>> I was thinking that by disallowing snapshot use and output plugin
>> invocation we'd avoid the need to support cancellation on recovery
>> conflicts, etc, simplifying things considerably.
>
> That seems like it'd end up being pretty hacky - the likelihood that
> we'd run into snapbuild error cross-checks seems very high.

TBH I'm not following this. But I haven't touched snapbuild much yet,
Petr's done much more with snapbuild than I have.

We're not going to have robust logical replication that's suitable for
HA and failover use on high load systems until 2020 or so, with Pg 12.
We'll need concurrent decoding and apply, which nobody's even started
on AFAIK, we'll need sequence replication, and more.

So I'd really, really like to get some kind of HA picture other than
"none" in for logical decoding based systems. If it's imperfect, it's
still something.

I wish we'd just proceeded with failover slots. They were blocked in
favour of decoding on standby, and HA is possible if we have decoding
on standby with some more work by the application. But now we'll have
neither. If we'd just done failover slots we could've had logical
replication able to follow failover in Pg 10.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another
approach only to have it knocked back once the work is done.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Andres Freund

Date:

23 March 2017, 04:39:45

On 2017-03-23 09:14:07 +0800, Craig Ringer wrote:
> On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> 
> >> I was thinking that by disallowing snapshot use and output plugin
> >> invocation we'd avoid the need to support cancellation on recovery
> >> conflicts, etc, simplifying things considerably.
> >
> > That seems like it'd end up being pretty hacky - the likelihood that
> > we'd run into snapbuild error cross-checks seems very high.
> 
> TBH I'm not following this. But I haven't touched snapbuild much yet,
> Petr's done much more with snapbuild than I have.

We can't just assume that snapbuild is going to work correctly when it's
prerequisites - pinned xmin horizon - isn't working.

> We're not going to have robust logical replication that's suitable for
> HA and failover use on high load systems until 2020 or so, with Pg 12.
> We'll need concurrent decoding and apply, which nobody's even started
> on AFAIK, we'll need sequence replication, and more.

These seem largely unrelated to the topic at hand(nor do I agree on all
of them).

> So I'd really, really like to get some kind of HA picture other than
> "none" in for logical decoding based systems. If it's imperfect, it's
> still something.

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep.  It's a nice later-on
feature.  But that's an irrelevant aside.

I don't understand why you're making a "fundamental" argument here - I'm
not arguing against the goals of the patch at all. I want as much stuff
committed as we can in a good shape.

> What do _you_ see as the minimum acceptable way to achieve the ability
> for a logical decoding client to follow failover of an upstream to a
> physical standby? In the end, you're one of the main people whose view
> carries weight in this area, and I don't want to develop yet another

I think your approach here wasn't that bad? There's a lot of cleaning
up/shoring up needed, and we probably need a smarter feedback system.  I
don't think anybody here has objected to the fundamental approach?

Greetings,

Andres Freund

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 07:14:02

On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:

> We can't just assume that snapbuild is going to work correctly when it's
> prerequisites - pinned xmin horizon - isn't working.

Makes sense.

>> What do _you_ see as the minimum acceptable way to achieve the ability
>> for a logical decoding client to follow failover of an upstream to a
>> physical standby? In the end, you're one of the main people whose view
>> carries weight in this area, and I don't want to develop yet another
>
> I think your approach here wasn't that bad? There's a lot of cleaning
> up/shoring up needed, and we probably need a smarter feedback system.  I
> don't think anybody here has objected to the fundamental approach?

That's useful, thanks.

I'm not arguing that the patch as it stands is ready, and appreciate
the input re the general design.

> I still think decoding-on-standby is simply not the right approach as
> the basic/first HA approach for logical rep.  It's a nice later-on
> feature.  But that's an irrelevant aside.

I don't really agree that it's irrelevant.

Right now Pg has no HA capability for logical decoding clients. We've
now added logical replication, but it has no way to provide for
upstream node failure and ensure a consistent switch-over, whether to
a logical or physical replica. Since real world servers fail or need
maintenance, this is kind of a problem for practical production use.

Because of transaction serialization for commit-time order replay,
logical replication experiences saw-tooth replication lag, where large
or long xacts such as batch jobs effectively stall all later xacts
until they are fully replicated. We cannot currently start replicating
a big xact until it commits on the upstream, so that lag can easily be
~2x the runtime on the upstream.

So while you can do sync rep on a logical standby, it tends to result
in big delays on COMMITs relative to physical rep, even if app are
careful to keep transactions small. When the app DR planning people
come and ask you what the max data loss window / max sync rep lag is,
you have to say ".... dunno? depends on what else was running on the
server at the time."

AFAICT, changing those things will require the ability to begin
streaming reorder buffers for big xacts before commit, which as the
logical decoding on 2PC thread shows is not exactly trivial. We'll
also need to be able to apply them concurrently with other xacts on
the other end. Those are both big and complex things IMO, and I'll be
surprised if we can do either in Pg11 given that AFAIK nobody has even
started work on either of them or has a detailed plan.

Presuming we get some kind of failover to logical replica upstreams
into Pg11, it'll have significant limitations relative to what we can
deliver to users by using physical replication. Especially when it
comes to bounded-time lag for HA, sync rep, etc. And I haven't seen a
design for it, though Petr and I have discussed some with regards to
pglogical.

That's why I think we need to do HA on the physical side first.
Because it's going to take a long time to get equivalent functionality
for logical rep based upstreams, and when it is we'll still have to
teach management tools and other non-logical-rep logical decoding
clients about the new way of doing things.  Wheras for physical HA
setups to support logical downstreams requires only relatively minor
changes and gets us all the physical HA features _now_.

That's why we pursued failover slots - as a simple, minimal solution
to allowing logical decoding clients to inter-operate with Pg in a
physical HA configuration. TBH, I still think we should just add them.
Sure, they don't help us achieve decoding on standby, but they're a
lot simpler and they help Pg's behaviour with slots match user
expectations for how the rest of Pg behaves, i.e. if it's on the
master it'll be on the replica too.  And as you've said, decoding on
standby is a nice-to-have, wheras I think some kind of HA support is
rather more important.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 07:30:42

On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I'm splitting up the rest of the decoding on standby patch set with
>> the goal of getting minimal functionality for creating and managing
>> slots on standbys in, so we can maintain slots on standbys and use
>> them when the standby is promoted to master.
>>
>> The first, to send catalog_xmin separately to the global xmin on
>> hot_standby_feedback and store it in the upstream physical slot's
>> catalog_xmin, is attached.
>>
>> These are extracted directly from the logical decoding on standby
>> patch, with comments by Petr and Andres made re the relevant code
>> addressed.
>
> I've reduced your two patches back to one with a smaller blast radius.
>
> I'll commit this tomorrow morning, barring objections.

Thanks. I was tempted to refactor GetOldestXmin to use flags myself,
but thought it might be at higher risk of objections. Since Eiji Seki
has shown that there are other uses for excluding particular things
from GetOldestXmin it and that's committed now, it's nice to have the
impact of this patch reduced.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Andres Freund

Date:

23 March 2017, 07:41:26

On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
> On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:
> > I still think decoding-on-standby is simply not the right approach as
> > the basic/first HA approach for logical rep.  It's a nice later-on
> > feature.  But that's an irrelevant aside.
> 
> I don't really agree that it's irrelevant.

I'm not sure we have enough time for either getting some parts of your
patch in, or for figuring out long term goals. But we definitely don't
have time for both.

- Andres

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 08:11:13

On 23 March 2017 at 12:41, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
>> On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:
>> > I still think decoding-on-standby is simply not the right approach as
>> > the basic/first HA approach for logical rep.  It's a nice later-on
>> > feature.  But that's an irrelevant aside.
>>
>> I don't really agree that it's irrelevant.
>
> I'm not sure we have enough time for either getting some parts of your
> patch in, or for figuring out long term goals. But we definitely don't
> have time for both.

Fair.



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 11:07:37

On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I'm splitting up the rest of the decoding on standby patch set with
>> the goal of getting minimal functionality for creating and managing
>> slots on standbys in, so we can maintain slots on standbys and use
>> them when the standby is promoted to master.
>>
>> The first, to send catalog_xmin separately to the global xmin on
>> hot_standby_feedback and store it in the upstream physical slot's
>> catalog_xmin, is attached.
>>
>> These are extracted directly from the logical decoding on standby
>> patch, with comments by Petr and Andres made re the relevant code
>> addressed.
>
> I've reduced your two patches back to one with a smaller blast radius.
>
> I'll commit this tomorrow morning, barring objections.

This needs rebasing on top of

commit af4b1a0869bd3bb52e5f662e4491554b7f611489
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   Wed Mar 22 16:51:01 2017 +0000

    Refactor GetOldestXmin() to use flags

    Replace ignoreVacuum parameter with more flexible flags.

    Author: Eiji Seki
    Review: Haribabu Kommi


That patch landed up using PROCARRAY flags directly as flags to
GetOldestXmin, so it doesn't make much sense to add a flag like
PROCARRAY_REPLICATION_SLOTS . There won't be any corresponding PROC_
flag for PGXACT->vacuumFlags, replication slot xmin and catalog_xmin
are global state not tracked in individual proc entries.

Rather than add some kind of "PROC_RESERVED" flag in proc.h that would
never be used and only exist to reserve a bit for use for
PROCARRAY_REPLICATION_SLOTS, which we'd use a flag to GetOldestXmin, I
added a new argument to GetOldestXmin like the prior patch did.

If preferred I can instead add

proc.h:

#define PROC_RESERVED 0x20

procarray.h:

#define PROCARRAY_REPLICATION_SLOTS 0x20

and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

but that's kind of ugly to say the least, I'd rather just add another argument.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patch

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

23 March 2017, 12:44:47

On 23 March 2017 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote:

> If preferred I can instead add
>
> proc.h:
>
> #define PROC_RESERVED 0x20
>
> procarray.h:
>
> #define PROCARRAY_REPLICATION_SLOTS 0x20
>
> and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

Attached done that way.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patch

Re: [HACKERS] Logical decoding on standby

From

Craig Ringer

Date:

24 March 2017, 09:23:11

On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:

Minor update to catalog_xmin walsender patch to fix failure to
parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .

This one's ready to go. Working on drop slots on DB drop now.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-Report-catalog_xmin-separately-to-xmin-in-hot-standb.patch

Re: Logical decoding on standby

From

Simon Riggs

Date:

25 March 2017, 17:10:38

On 24 March 2017 at 06:23, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> Minor update to catalog_xmin walsender patch to fix failure to
> parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .
>
> This one's ready to go. Working on drop slots on DB drop now.

Committed. Next!

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

27 March 2017, 05:01:42

On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

> Have you checked how high the overhead of XLogReadDetermineTimeline is?
> A non-local function call, especially into a different translation-unit
> (no partial inlining), for every single page might end up being
> noticeable.  That's fine in the cases it actually adds functionality,
> but for a master streaming out data, that's not actually adding
> anything.

I haven't been able to measure any difference. But, since we require
the caller to ensure a reasonably up to date ThisTimeLineID, maybe
it's worth adding an inlineable function for the fast-path that tests
the "page cached" and "timeline is current and unchanged" conditions?

//xlogutils.h:
static inline void XLogReadDetermineTimeline(...)
{  ... first test for page already read-in and valid ...  ... second test for ThisTimeLineId ...
XLogReadCheckTimeLineChange(...)
}

XLogReadCheckTimeLineChange(...)
{  ... rest of function
}

(Yes, I know "inline" means little, but it's a hint for readers)

I'd rather avoid using a macro since it'd be pretty ugly, but it's
also an option if an inline func is undesirable.

#define XLOG_READ_DETERMINE_TIMELINE \ do { \   ... same as above ... } while (0);

Can be done after CF if needed anyway, it's just fiddling some code
around. Figured I'd mention though.

>> +             /*
>> +              * To avoid largely duplicating ReplicationSlotDropAcquired() or
>> +              * complicating it with already_locked flags for ProcArrayLock,
>> +              * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
>> +              * just release our ReplicationSlotControlLock to drop the slot.
>> +              *
>> +              * There's no race here: we acquired this slot, and no slot "behind"
>> +              * our scan can be created or become active with our target dboid due
>> +              * to our exclusive lock on the DB.
>> +              */
>> +             LWLockRelease(ReplicationSlotControlLock);
>> +             ReplicationSlotDropAcquired();
>> +             LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>
> I don't see much problem with this, but I'd change the code so you
> simply do a goto restart; if you released the slot.  Then there's a lot
> less chance / complications around temporarily releasing
> ReplicationSlotControlLock

I don't quite get this. I suspect I'm just not seeing the implications
as clearly as you do.

Do you mean we should restart the whole scan of the slot array if we
drop any slot? That'll be O(n log m) but since we don't expect to be
working on a big array or a lot of slots it's unlikely to matter.

The patch coming soon will assume we'll restart the whole scan, anyway.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

27 March 2017, 09:08:08

Hi

Here's the next patch in the split-up series, drop db-specific
(logical) replication slots on DROP DATABASE.

Current behaviour is to ERROR if logical slots exist on the DB,
whether in-use or not.

With this patch we can DROP a database if it has logical slots so long
as they are not active. I haven't added any sort of syntax for this,
it's just done unconditionally.

I don't see any sensible way to stop a slot becoming active after we
check for active slots and begin the actual database DROP, since
ReplicationSlotAcquire will happily acquire a db-specific slot for a
different DB and the only lock it takes is a shared lock on
ReplicationSlotControlLock, which we presumably don't want to hold
throughout DROP DATABASE.

So this patch makes ReplicationSlotAcquire check that the slot
database matches the current database and refuse to acquire the slot
if it does not. The only sensible reason to acquire a slot from a
different DB is to drop it, and then it's only a convenience at best.
Slot drop is the only user-visible behaviour change, since all other
activity on logical slots happened when the backend was already
connected to the slot's DB. Appropriate docs changes have been made
and tests added.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

0001-Make-DROP-DATABASE-drop-logical-slots-for-the-DB.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

27 March 2017, 11:03:48

On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:

> So this patch makes ReplicationSlotAcquire check that the slot
> database matches the current database and refuse to acquire the slot
> if it does not.

New patch attached that drops above requirement, so slots can still be
dropped from any DB.

This introduces a narrow race window where DROP DATABASE may ERROR if
somebody connects to a different database and runs a
pg_drop_replication_slot(...) for one of the slots being dropped by
DROP DATABASE after we check for active slots but before we've dropped
the slot. But it's hard to hit and it's pretty harmless; the worst
possible result is dropping one or more of the slots before we ERROR
out of the DROP. But you clearly didn't want them anyway, since you
were dropping the DB and dropping some slots at the same time.

I think this one's ready to go.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

drop-database-drop-slots-v2.patch

Re: Logical decoding on standby

From

Simon Riggs

Date:

27 March 2017, 11:20:58

On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:

> I think this one's ready to go.

Looks like something I could commit. Full review by me while offline
today, aiming to commit tomorrow barring issues raised.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

27 March 2017, 16:58:37

On 27 March 2017 at 16:20, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I think this one's ready to go.
>
> Looks like something I could commit. Full review by me while offline
> today, aiming to commit tomorrow barring issues raised.

Great.

Meanwhile I'm going to be trying to work with Stas on 2PC logical
decoding, while firming up the next patches in this series to see if
we can progress a bit further.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Andres Freund

Date:

28 March 2017, 18:22:38

Hi,

On 2017-03-27 16:03:48 +0800, Craig Ringer wrote:
> On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > So this patch makes ReplicationSlotAcquire check that the slot
> > database matches the current database and refuse to acquire the slot
> > if it does not.
> 
> New patch attached that drops above requirement, so slots can still be
> dropped from any DB.
> 
> This introduces a narrow race window where DROP DATABASE may ERROR if
> somebody connects to a different database and runs a
> pg_drop_replication_slot(...) for one of the slots being dropped by
> DROP DATABASE after we check for active slots but before we've dropped
> the slot. But it's hard to hit and it's pretty harmless; the worst
> possible result is dropping one or more of the slots before we ERROR
> out of the DROP. But you clearly didn't want them anyway, since you
> were dropping the DB and dropping some slots at the same time.
> 
> I think this one's ready to go.
> 
> -- 
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

> From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001
> From: Craig Ringer <craig@2ndquadrant.com>
> Date: Wed, 22 Mar 2017 13:21:09 +0800
> Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB
> 
> Automatically drop all logical replication slots associated with a
> database when the database is dropped.
> ---
>  doc/src/sgml/func.sgml                             |  3 +-
>  doc/src/sgml/protocol.sgml                         |  2 +
>  src/backend/commands/dbcommands.c                  | 32 +++++---
>  src/backend/replication/slot.c                     | 88 ++++++++++++++++++++++
>  src/include/replication/slot.h                     |  1 +
>  src/test/recovery/t/006_logical_decoding.pl        | 40 +++++++++-
>  .../recovery/t/010_logical_decoding_timelines.pl   | 30 +++++++-
>  7 files changed, 182 insertions(+), 14 deletions(-)
> 
> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index ba6f8dd..78508d7 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
>         <entry>
>          Drops the physical or logical replication slot
>          named <parameter>slot_name</parameter>. Same as replication protocol
> -        command <literal>DROP_REPLICATION_SLOT</>.
> +        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
> +        be called when connected to the same database the slot was created on.
>         </entry>
>        </row>
>  
> diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
> index b3a5026..5f97141 100644
> --- a/doc/src/sgml/protocol.sgml
> +++ b/doc/src/sgml/protocol.sgml
> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>       <para>
>        Drops a replication slot, freeing any reserved server-side resources. If
>        the slot is currently in use by an active connection, this command fails.
> +      If the slot is a logical slot that was created in a database other than
> +      the database the walsender is connected to, this command fails.
>       </para>
>       <variablelist>
>        <varlistentry>

Shouldn't the docs in the drop database section about this?


> +void
> +ReplicationSlotsDropDBSlots(Oid dboid)
> +{
> +    int            i;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData slotname;
> +        int active_pid;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* only logical slots are database specific, skip */
> +        if (!SlotIsLogical(s))
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != dboid)
> +            continue;
> +
> +        /* Claim the slot, as if ReplicationSlotAcquire()ing. */
> +        SpinLockAcquire(&s->mutex);
> +        strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
> +        NameStr(slotname)[NAMEDATALEN-1] = '\0';
> +        active_pid = s->active_pid;
> +        if (active_pid == 0)
> +        {
> +            MyReplicationSlot = s;
> +            s->active_pid = MyProcPid;
> +        }
> +        SpinLockRelease(&s->mutex);
> +
> +        /*
> +         * We might fail here if the slot was active. Even though we hold an
> +         * exclusive lock on the database object a logical slot for that DB can
> +         * still be active if it's being dropped by a backend connected to
> +         * another DB or is otherwise acquired.
> +         *
> +         * It's an unlikely race that'll only arise from concurrent user action,
> +         * so we'll just bail out.
> +         */
> +        if (active_pid)
> +            elog(ERROR, "replication slot %s is in use by pid %d",
> +                  NameStr(slotname), active_pid);
> +
> +        /*
> +         * To avoid largely duplicating ReplicationSlotDropAcquired() or
> +         * complicating it with already_locked flags for ProcArrayLock,
> +         * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
> +         * just release our ReplicationSlotControlLock to drop the slot.
> +         *
> +         * For safety we'll restart our scan from the beginning each
> +         * time we release the lock.
> +         */
> +        LWLockRelease(ReplicationSlotControlLock);
> +        ReplicationSlotDropAcquired();
> +        goto restart;
> +    }
> +    LWLockRelease(ReplicationSlotControlLock);
> +
> +    /* recompute limits once after all slots are dropped */
> +    ReplicationSlotsComputeRequiredXmin(false);
> +    ReplicationSlotsComputeRequiredLSN();

I was concerned for a second that we'd skip doing
ReplicationSlotsComputeRequired* if we ERROR out above - but
ReplicationSlotDropAcquired already calls these as necessary. I.e. they
should be dropped from here.


- Andres

Re: Logical decoding on standby

From

Craig Ringer

Date:

29 March 2017, 03:01:34

On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:

>> --- a/doc/src/sgml/protocol.sgml
>> +++ b/doc/src/sgml/protocol.sgml
>> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>>       <para>
>>        Drops a replication slot, freeing any reserved server-side resources. If
>>        the slot is currently in use by an active connection, this command fails.
>> +      If the slot is a logical slot that was created in a database other than
>> +      the database the walsender is connected to, this command fails.
>>       </para>
>>       <variablelist>
>>        <varlistentry>
>
> Shouldn't the docs in the drop database section about this?

DROP DATABASE doesn't really discuss all the resources it drops, but
I'm happy to add mention of replication slots handling.

I just notice that I failed to remove the docs changes regarding
dropping slots becoming db-specific, so I'll post a follow-up for that
in a sec.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

29 March 2017, 03:11:26

On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:

> I just notice that I failed to remove the docs changes regarding
> dropping slots becoming db-specific, so I'll post a follow-up for that
> in a sec.

Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

fix-slot-drop-docs.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

29 March 2017, 11:44:19

On 29 March 2017 at 08:11, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I just notice that I failed to remove the docs changes regarding
>> dropping slots becoming db-specific, so I'll post a follow-up for that
>> in a sec.
>
> Attached.

... and here's the next in the patch series. Both this and the
immediately prior minor patch fix-drop-slot-docs.patch are pending
now.

Notable changes in this patch since review:

* Split oldestCatalogXmin tracking into separate patch

* Critically, fix use of procArray->replication_slot_catalog_xmin in
GetSnapshotData's setting of RecentGlobalXmin and RecentGlobalDataXmin
so it instead uses ShmemVariableCache->oldestCatalogXmin . This
could've led to tuples newer than oldestCatalogXmin being removed.

* Memory barrier in UpdateOldestCatalogXmin and SetOldestCatalogXmin.
It still does a pre-check before deciding if it needs to take
ProcArrayLock, recheck, and advance, since we don't want to
unnecessarily contest ProcArrayLock.

* Remove unnecessary volatile usage (retained in
UpdateOldestCatalogXmin due to barrier)

* Remove unnecessary test for XLogInsertAllowed() in XactLogCatalogXminUpdate

* EnsureActiveLogicalSlotValid(void)  - add (void)

* pgidented changes in this diff; have left unrelated changes alone

Re:

> what does
>
> +       TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
> +                                                                         * is guaranteed to still exist */
>
> mean?  I complained about the overall justification in the commit
> already, but looking at this commit alone, the justification for this
> part of the change is quite hard to understand.

The patch now contains

    TransactionId oldestCatalogXmin; /* oldest xid it is guaranteed to be safe
                                      * to create a historic snapshot for; see
                                      * also
                                      * procArray->replication_slot_catalog_xmin
                                      * */

which I think is an improvement.

I've also sought to explain the purpose of this change better with

/*
 * If necessary, copy the current catalog_xmin needed by replication slots to
 * the effective catalog_xmin used for dead tuple removal and write a WAL
 * record recording the change.
 *
 * This allows standbys to know the oldest xid for which it is safe to create
 * a historic snapshot for logical decoding. VACUUM or other cleanup may have
 * removed catalog tuple versions needed to correctly decode transactions older
 * than this threshold. Standbys can use this information to cancel conflicting
 * decoding sessions and invalidate slots that need discarded information.
 *
 * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
 * this, since they don't identify the relation as a catalog or not.  Nor can a
 * standby look up the relcache to get the Relation for the affected
 * relfilenode to check if it is a catalog. The standby would also have no way
 * to know the oldest safe position at startup if it wasn't in the control
 * file.)
 */
void
UpdateOldestCatalogXmin(void)
{
...

Does that help?

(Sidenote for later: ResolveRecoveryConflictWithLogicalDecoding will
need a read barrier too, when the next patch adds it.)

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

log-catalog-xmin-advances-v2.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

29 March 2017, 12:17:07

On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

> * Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

So at worst on a fairly slow moving system or one with a super high
vacuum rate we'll write one per commit. But in most cases we'll write
a lot fewer than that. When running t/006_logical_decoding.pl for
example:

$ ../../../src/bin/pg_waldump/pg_waldump
tmp_check/data_master_daPa/pgdata/pg_wal/000000010000000000000001  |
grep CATALOG
rmgr: Transaction len (rec/tot):      4/    30, tx:          0, lsn:
0/01648D50, prev 0/01648D18, desc: CATALOG_XMIN catalog_xmin 555
rmgr: Transaction len (rec/tot):      4/    30, tx:          0, lsn:
0/0164C840, prev 0/0164C378, desc: CATALOG_XMIN catalog_xmin 0
pg_waldump: FATAL:  error in WAL record at 0/16BBF10: invalid record
length at 0/16BBF88: wanted 24, got 0


and of course, none at all unless you use logical decoding.



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Simon Riggs

Date:

29 March 2017, 18:13:35

On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> * Split oldestCatalogXmin tracking into separate patch
>
> Regarding this, Simon raised concerns about xlog volume here.
>
> It's pretty negligible.
>
> We only write a new record when a vacuum runs after catalog_xmin
> advances on the slot with the currently-lowest catalog_xmin (or, if
> vacuum doesn't run reasonably soon, when the bgworker next looks).

I'd prefer to slow things down a little, not be so eager.

If we hold back update of the catalog_xmin until when we run
GetRunningTransactionData() we wouldn't need to produce any WAL
records at all AND we wouldn't need to have VACUUM do
UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
task.

That would also make this patch about half the length it is.

Let me know what you think.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

30 March 2017, 06:34:01

On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>>> * Split oldestCatalogXmin tracking into separate patch
>>
>> Regarding this, Simon raised concerns about xlog volume here.
>>
>> It's pretty negligible.
>>
>> We only write a new record when a vacuum runs after catalog_xmin
>> advances on the slot with the currently-lowest catalog_xmin (or, if
>> vacuum doesn't run reasonably soon, when the bgworker next looks).
>
> I'd prefer to slow things down a little, not be so eager.
>
> If we hold back update of the catalog_xmin until when we run
> GetRunningTransactionData() we wouldn't need to produce any WAL
> records at all AND we wouldn't need to have VACUUM do
> UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
> task.
>
> That would also make this patch about half the length it is.
>
> Let me know what you think.

Good idea.

We can always add a heuristic later to make xl_running_xacts get
emitted more often at high transaction rates if it's necessary.

Patch coming soon.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

30 March 2017, 11:07:05

On 30 March 2017 at 11:34, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>> On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>>>
>>>> * Split oldestCatalogXmin tracking into separate patch
>>>
>>> Regarding this, Simon raised concerns about xlog volume here.
>>>
>>> It's pretty negligible.
>>>
>>> We only write a new record when a vacuum runs after catalog_xmin
>>> advances on the slot with the currently-lowest catalog_xmin (or, if
>>> vacuum doesn't run reasonably soon, when the bgworker next looks).
>>
>> I'd prefer to slow things down a little, not be so eager.
>>
>> If we hold back update of the catalog_xmin until when we run
>> GetRunningTransactionData() we wouldn't need to produce any WAL
>> records at all AND we wouldn't need to have VACUUM do
>> UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
>> task.
>>
>> That would also make this patch about half the length it is.
>>
>> Let me know what you think.
>
> Good idea.
>
> We can always add a heuristic later to make xl_running_xacts get
> emitted more often at high transaction rates if it's necessary.
>
> Patch coming soon.

Attached.

A bit fiddlier than expected, but I like the result more.

In the process I identified an issue with both the prior patch and
this one where we don't check slot validity for slots that existed on
standby prior to promotion of standby to master. We were just assuming
that being the master was good enough, since it controls
replication_slot_catalog_xmin, but that's not true for pre-existing
slots.

Fixed by forcing update of the persistent safe catalog xmin after the
first slot is created on the master - which is now done by doing an
immediate LogStandbySnapshot() after assigning the slot's
catalog_xmin.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

log-catalog-xmin-advances-v3.patch

Re: Logical decoding on standby

From

Simon Riggs

Date:

30 March 2017, 17:26:02

On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:

> Attached.

* Cleaned up in 3 places
* Added code for faked up RunningTransactions in xlog.c
* Ensure catalog_xmin doesn't go backwards

All else looks good. Comments before commit?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

log-catalog-xmin-advances-v4.patch

Re: Logical decoding on standby

From

Andres Freund

Date:

30 March 2017, 17:27:52

On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:
> On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > Attached.
> 
> * Cleaned up in 3 places
> * Added code for faked up RunningTransactions in xlog.c
> * Ensure catalog_xmin doesn't go backwards
> 
> All else looks good. Comments before commit?

Can you give me till after lunch?

- Andres

Re: Logical decoding on standby

From

Simon Riggs

Date:

30 March 2017, 17:30:56

On 30 March 2017 at 15:27, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:
>> On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>> > Attached.
>>
>> * Cleaned up in 3 places
>> * Added code for faked up RunningTransactions in xlog.c
>> * Ensure catalog_xmin doesn't go backwards
>>
>> All else looks good. Comments before commit?
>
> Can you give me till after lunch?

Sure, np

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical decoding on standby

From

Andres Freund

Date:

30 March 2017, 20:16:07

> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>  
>          /*
> +         * There can be no concurrent writers to oldestCatalogXmin during
> +         * recovery, so no need to take ProcArrayLock.
> +         */
> +        ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

s/writers/writes/?

> @@ -9731,6 +9748,15 @@ xlog_redo(XLogReaderState *record)
>                                    checkPoint.oldestXid))
>              SetTransactionIdLimit(checkPoint.oldestXid,
>                                    checkPoint.oldestXidDB);
> +
> +        /*
> +         * There can be no concurrent writers to oldestCatalogXmin during
> +         * recovery, so no need to take ProcArrayLock.
> +         */
> +        if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
> +                                    checkPoint.oldestCatalogXmin)
> +            ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

dito.



> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>  
>      ReplicationSlotsComputeRequiredXmin(true);
>  
> +    /*
> +     * If this is the first slot created on the master we won't have a
> +     * persistent record of the oldest safe xid for historic snapshots yet.
> +     * Force one to be recorded so that when we go to replay from this slot we
> +     * know it's safe.
> +     */
> +    force_standby_snapshot =
> +        !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);

s/first slot/first logical slot/. Also, the reference to master doesn't
seem right.


>      LWLockRelease(ProcArrayLock);
>  
> +    /* Update ShmemVariableCache->oldestCatalogXmin */
> +    if (force_standby_snapshot)
> +        LogStandbySnapshot();

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.


>      /*
>       * tell the snapshot builder to only assemble snapshot once reaching the
>       * running_xact's record with the respective xmin.
> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>          start_lsn = slot->data.confirmed_flush;
>      }
>  
> +    EnsureActiveLogicalSlotValid();

This seems like it should be in a separate patch, and seperately
reviewed. It's code that's currently unreachable (and thus untestable).


> +/*
> + * Test to see if the active logical slot is usable.
> + */
> +static void
> +EnsureActiveLogicalSlotValid(void)
> +{
> +    TransactionId shmem_catalog_xmin;
> +
> +    Assert(MyReplicationSlot != NULL);
> +
> +    /*
> +     * A logical slot can become unusable if we're doing logical decoding on a
> +     * standby or using a slot created before we were promoted from standby
> +     * to master.

Neither of those is currently possible...


> If the master advanced its global catalog_xmin past the
> +     * threshold we need it could've removed catalog tuple versions that
> +     * we'll require to start decoding at our restart_lsn.
> +     *
> +     * We need a barrier so that if we decode in recovery on a standby we
> +     * don't allow new decoding sessions to start after redo has advanced
> +     * the threshold.
> +     */
> +    if (RecoveryInProgress())
> +        pg_memory_barrier();

I don't think this is a meaningful locking protocol.  It's a bad idea to
use lock-free programming without need, especially when the concurrency
protocol isn't well defined.  With what other barrier does this pair
with?  What prevents the data being out of date by the time we actually
do the check?


> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> index cfc3fba..cdc5f95 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
>       * be energy wasted - the worst lost information can do here is give us
>       * wrong information in a statistics view - we'll just potentially be more
>       * conservative in removing files.
> +     *
> +     * We don't have to do any effective_xmin / effective_catalog_xmin testing
> +     * here either, like for LogicalConfirmReceivedLocation. If we received
> +     * the xmin and catalog_xmin from downstream replication slots we know they
> +     * were already confirmed there,
>       */
>  }

This comment reads as if LogicalConfirmReceivedLocation had
justification for not touching / checking catalog_xmin. But it does.



>      /*
> +     * Update our knowledge of the oldest xid we can safely create historic
> +     * snapshots for.
> +     *
> +     * There can be no concurrent writers to oldestCatalogXmin during
> +     * recovery, so no need to take ProcArrayLock.

By now I think is pretty flawed logic, because there can be concurrent
readers, that need to be safe against oldestCatalogXmin advancing
concurrently.


>      /*
> -     * It's important *not* to include the limits set by slots here because
> +     * It's important *not* to include the xmin set by slots here because
>       * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>       * were to be included here the initial value could never increase because
>       * of a circular dependency where slots only increase their limits when
>       * running xacts increases oldestRunningXid and running xacts only
>       * increases if slots do.
> +     *
> +     * We can include the catalog_xmin limit here; there's no similar
> +     * circularity, and we need it to log xl_running_xacts records for
> +     * standbys.
>       */

Those comments seem to need some more heavyhanded reconciliation.

>   *
>   * Return the current slot xmin limits. That's useful to be able to remove
>   * data that's older than those limits.
> + *
> + * For logical replication slots' catalog_xmin, we return both the effective

This seems to need some light editing.


>  /*
>   * Record an enhanced snapshot of running transactions into WAL.
>   *
> + * We also record the value of procArray->replication_slot_catalog_xmin
> + * obtained from GetRunningTransactionData here, so standbys know we're about
> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
> + * removing dead catalog tuples below that threshold.

I think needs some rephrasing. We're not necessarily about to remove
catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

> +static void
> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
> +{
> +    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +    if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
> +        || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
> +        ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
> +    LWLockRelease(ProcArrayLock);
> +}

Doing TransactionIdPrecedes before ensuring
ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
me as a good idea.  Generally, the expression as it stands is hard to
understand.

> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
> index d25a2dd..a4ecfb7 100644
> --- a/src/include/access/transam.h
> +++ b/src/include/access/transam.h
> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>                                           * aborted */
>  
>      /*
> +     * This field is protected by ProcArrayLock except
> +     * during recovery, when it's set unlocked.
> +     *
> +     * oldestCatalogXmin is the oldest xid it is
> +     * guaranteed to be safe to create a historic
> +     * snapshot for. See also
> +     * procArray->replication_slot_catalog_xmin
> +     */
> +    TransactionId oldestCatalogXmin;

Maybe it'd be better to rephrase that do something like
"oldestCatalogXmin guarantees that no valid catalog tuples >= than it
are removed. That property is used for logical decoding.". or similar?

>  /*
>   * Each page of XLOG file has a header like this:
>   */
> -#define XLOG_PAGE_MAGIC 0xD097    /* can be used as WAL version indicator */
> +#define XLOG_PAGE_MAGIC 0xD100    /* can be used as WAL version indicator */

We normally only advance this by one, it's not tied to the poistgres version.


I'm sorry, but to me this patch isn't ready.  I'm also doubtful that it
makes a whole lot of sense on its own, without having finished the
design for decoding on a standby - it seems quite likely that we'll have
to redesign the mechanisms in here a bit for that.  For 10 this seems to
do nothing but add overhead?

Greetings,

Andres Freund

Re: Logical decoding on standby

From

Andres Freund

Date:

30 March 2017, 20:16:53

On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:
> On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:
> 
> >> --- a/doc/src/sgml/protocol.sgml
> >> +++ b/doc/src/sgml/protocol.sgml
> >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
> >>       <para>
> >>        Drops a replication slot, freeing any reserved server-side resources. If
> >>        the slot is currently in use by an active connection, this command fails.
> >> +      If the slot is a logical slot that was created in a database other than
> >> +      the database the walsender is connected to, this command fails.
> >>       </para>
> >>       <variablelist>
> >>        <varlistentry>
> >
> > Shouldn't the docs in the drop database section about this?
> 
> DROP DATABASE doesn't really discuss all the resources it drops, but
> I'm happy to add mention of replication slots handling.

I don't think that's really comparable, because the other things aren't
global objects, which replication slots are.

- Andres

Re: Logical decoding on standby

From

Simon Riggs

Date:

30 March 2017, 21:40:08

On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:

>>  /*
>>   * Each page of XLOG file has a header like this:
>>   */
>> -#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
>> +#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */
>
> We normally only advance this by one, it's not tied to the poistgres version.

That was my addition. I rounded it up cos this is release 10. No biggie.

(Poistgres? Is that the Manhattan spelling?)

> I'm sorry, but to me this patch isn't ready.  I'm also doubtful that it
> makes a whole lot of sense on its own, without having finished the
> design for decoding on a standby - it seems quite likely that we'll have
> to redesign the mechanisms in here a bit for that.  For 10 this seems to
> do nothing but add overhead?

I'm sure we can reword the comments.

We've been redesigning the mechanisms for 2 years now, so it seems
unlikely that further redesign can be required. If it is required,
this patch is fairly low touch and change is possible later,
incremental development etc. As regards overhead, this adds a small
amount of time to a background process executed every 10 secs,
generates no new WAL records.

So I don't see any reason not to commit this feature, after the minor
corrections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical decoding on standby

From

Andres Freund

Date:

30 March 2017, 23:49:58

On 2017-03-30 19:40:08 +0100, Simon Riggs wrote:
> On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:
> 
> >>  /*
> >>   * Each page of XLOG file has a header like this:
> >>   */
> >> -#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
> >> +#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */
> >
> > We normally only advance this by one, it's not tied to the poistgres version.
> 
> That was my addition. I rounded it up cos this is release 10. No biggie.

We'll probably upgrade that more than once again this release...

> (Poistgres? Is that the Manhattan spelling?)

Tiredness spelling ;)

> We've been redesigning the mechanisms for 2 years now, so it seems
> unlikely that further redesign can be required.

I don't think that's true *at all* - the mechanism previously
fundamentally different.

The whole topic has largely seen activity shortly before the code
freeze, both last time round and now.  I don't think it's surprising
that it thus doesn't end up being ready.

> If it is required,
> this patch is fairly low touch and change is possible later,
> incremental development etc. As regards overhead, this adds a small
> amount of time to a background process executed every 10 secs,
> generates no new WAL records.
> 
> So I don't see any reason not to commit this feature, after the minor
> corrections.

It doesn't have any benefit on its own, the locking model doesn't seem
fully there.  I don't see much reason to get this in before the release.

- Andres

Re: Logical decoding on standby

From

Craig Ringer

Date:

31 March 2017, 06:52:39

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:
>> On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:
>>
>> >> --- a/doc/src/sgml/protocol.sgml
>> >> +++ b/doc/src/sgml/protocol.sgml
>> >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>> >>       <para>
>> >>        Drops a replication slot, freeing any reserved server-side resources. If
>> >>        the slot is currently in use by an active connection, this command fails.
>> >> +      If the slot is a logical slot that was created in a database other than
>> >> +      the database the walsender is connected to, this command fails.
>> >>       </para>
>> >>       <variablelist>
>> >>        <varlistentry>
>> >
>> > Shouldn't the docs in the drop database section about this?
>>
>> DROP DATABASE doesn't really discuss all the resources it drops, but
>> I'm happy to add mention of replication slots handling.
>
> I don't think that's really comparable, because the other things aren't
> global objects, which replication slots are.

Fine by me.

Patch fix-slot-drop-docs.patch, upthread, adds the passage

+
+  <para>
+   Active <link linkend="logicaldecoding-replication-slots">logical
+   replication slots</> count as connections and will prevent a
+   database from being dropped. Inactive slots will be automatically
+   dropped when the database is dropped.
+  </para>

to the notes section of the DROP DATABASE docs; that should do the
trick. I'm not convinced it's worth going into the exceedingly
unlikely race with concurrent slot drop, and we don't seem to in other
places in the docs, like the race you mentioned with connecting to a
db as it's being dropped.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

31 March 2017, 07:49:59

On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>>               SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>
>>               /*
>> +              * There can be no concurrent writers to oldestCatalogXmin during
>> +              * recovery, so no need to take ProcArrayLock.
>> +              */
>> +             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>
> s/writers/writes/?

I meant writers, i.e. nothing else can be writing to it. But writes works too.


>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>
>>       ReplicationSlotsComputeRequiredXmin(true);
>>
>> +     /*
>> +      * If this is the first slot created on the master we won't have a
>> +      * persistent record of the oldest safe xid for historic snapshots yet.
>> +      * Force one to be recorded so that when we go to replay from this slot we
>> +      * know it's safe.
>> +      */
>> +     force_standby_snapshot =
>> +             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>
> s/first slot/first logical slot/. Also, the reference to master doesn't
> seem right.

Unsure what you mean re reference to master not seeming right.

If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
from the new slot so we need to make sure it gets advanced one we've
decided on our starting catalog_xmin.

>>       LWLockRelease(ProcArrayLock);
>>
>> +     /* Update ShmemVariableCache->oldestCatalogXmin */
>> +     if (force_standby_snapshot)
>> +             LogStandbySnapshot();
>
> The comment and code don't quite square to me - it's far from obvious
> that LogStandbySnapshot does something like that. I'd even say it's a
> bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

I don't have much preference; I liked the small code reduction of
Simon's proposed approach, but it landed up being a bit awkward in
terms of ordering and locking. I don't think catalog_xmin tracking is
really closely related to the standby snapshot stuff and it feels a
bit like it's a tacked-on afterthought where it is now.

>>       /*
>>        * tell the snapshot builder to only assemble snapshot once reaching the
>>        * running_xact's record with the respective xmin.
>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>               start_lsn = slot->data.confirmed_flush;
>>       }
>>
>> +     EnsureActiveLogicalSlotValid();
>
> This seems like it should be in a separate patch, and seperately
> reviewed. It's code that's currently unreachable (and thus untestable).

It is reached and is run, those checks run whenever creating a
non-initial decoding context on master or replica.

The failure case is reachable if a replica has hot_standby_feedback
off or it's not using a physical slot and loses its connection. If
promoted, any slot existing on that replica (from a file system level
copy when the replica was created) will fail. I agree it's contrived
since we can't create and maintain slots on replicas, which is the
main point of it.


>> +/*
>> + * Test to see if the active logical slot is usable.
>> + */
>> +static void
>> +EnsureActiveLogicalSlotValid(void)
>> +{
>> +     TransactionId shmem_catalog_xmin;
>> +
>> +     Assert(MyReplicationSlot != NULL);
>> +
>> +     /*
>> +      * A logical slot can become unusable if we're doing logical decoding on a
>> +      * standby or using a slot created before we were promoted from standby
>> +      * to master.
>
> Neither of those is currently possible...

Right. Because it's foundations for decoding on standby.

>> If the master advanced its global catalog_xmin past the
>> +      * threshold we need it could've removed catalog tuple versions that
>> +      * we'll require to start decoding at our restart_lsn.
>> +      *
>> +      * We need a barrier so that if we decode in recovery on a standby we
>> +      * don't allow new decoding sessions to start after redo has advanced
>> +      * the threshold.
>> +      */
>> +     if (RecoveryInProgress())
>> +             pg_memory_barrier();
>
> I don't think this is a meaningful locking protocol.  It's a bad idea to
> use lock-free programming without need, especially when the concurrency
> protocol isn't well defined.

Yeah. The intended interaction is with recovery conflict on standby
which doesn't look likely to land in this release due to patch
split/cleanup etc. (Not a complaint).

Better to just take a brief shared ProcArrayLock.

>> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
>> index cfc3fba..cdc5f95 100644
>> --- a/src/backend/replication/walsender.c
>> +++ b/src/backend/replication/walsender.c
>> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
>>        * be energy wasted - the worst lost information can do here is give us
>>        * wrong information in a statistics view - we'll just potentially be more
>>        * conservative in removing files.
>> +      *
>> +      * We don't have to do any effective_xmin / effective_catalog_xmin testing
>> +      * here either, like for LogicalConfirmReceivedLocation. If we received
>> +      * the xmin and catalog_xmin from downstream replication slots we know they
>> +      * were already confirmed there,
>>        */
>>  }
>
> This comment reads as if LogicalConfirmReceivedLocation had
> justification for not touching / checking catalog_xmin. But it does.

It touches it, what it doesn't do is test and only advance if the new
value is greater, like for xmin as referenced in the prior par. Will
clarify.

>>       /*
>> +      * Update our knowledge of the oldest xid we can safely create historic
>> +      * snapshots for.
>> +      *
>> +      * There can be no concurrent writers to oldestCatalogXmin during
>> +      * recovery, so no need to take ProcArrayLock.
>
> By now I think is pretty flawed logic, because there can be concurrent
> readers, that need to be safe against oldestCatalogXmin advancing
> concurrently.

You're right, we'll need a lock or suitable barriers here to ensure
that slot conflict with recovery and startup of new decoding sessions
doesn't see outdated values. This would be the peer of the
pg_memory_barrier() above. Or could just take a lock; there's enough
other locking activity in redo that it should be fine.

>>       /*
>> -      * It's important *not* to include the limits set by slots here because
>> +      * It's important *not* to include the xmin set by slots here because
>>        * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>>        * were to be included here the initial value could never increase because
>>        * of a circular dependency where slots only increase their limits when
>>        * running xacts increases oldestRunningXid and running xacts only
>>        * increases if slots do.
>> +      *
>> +      * We can include the catalog_xmin limit here; there's no similar
>> +      * circularity, and we need it to log xl_running_xacts records for
>> +      * standbys.
>>        */
>
> Those comments seem to need some more heavyhanded reconciliation.

OK. To me it seems clear that the first refers to xmin, the second to
catalog_xmin. But after all I wrote it, and the important thing is
what it says to people who are not me. Will adjust.

>>   *
>>   * Return the current slot xmin limits. That's useful to be able to remove
>>   * data that's older than those limits.
>> + *
>> + * For logical replication slots' catalog_xmin, we return both the effective
>
> This seems to need some light editing.

catalog_xmin => catalog_xmins I guess.

>>  /*
>>   * Record an enhanced snapshot of running transactions into WAL.
>>   *
>> + * We also record the value of procArray->replication_slot_catalog_xmin
>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>> + * removing dead catalog tuples below that threshold.
>
> I think needs some rephrasing. We're not necessarily about to remove
> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

Agreed

>> +static void
>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>> +{
>> +     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> +     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>> +             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
>> +             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>> +     LWLockRelease(ProcArrayLock);
>> +}
>
> Doing TransactionIdPrecedes before ensuring
> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
> me as a good idea.  Generally, the expression as it stands is hard to
> understand.

OK.

I found other formulations to be long and hard to read. Expressing it
as "if validity has changed or value has increased" made more sense.
Agree order should change.

>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>> index d25a2dd..a4ecfb7 100644
>> --- a/src/include/access/transam.h
>> +++ b/src/include/access/transam.h
>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>>                                                                                * aborted */
>>
>>       /*
>> +      * This field is protected by ProcArrayLock except
>> +      * during recovery, when it's set unlocked.
>> +      *
>> +      * oldestCatalogXmin is the oldest xid it is
>> +      * guaranteed to be safe to create a historic
>> +      * snapshot for. See also
>> +      * procArray->replication_slot_catalog_xmin
>> +      */
>> +     TransactionId oldestCatalogXmin;
>
> Maybe it'd be better to rephrase that do something like
> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
> are removed. That property is used for logical decoding.". or similar?

Fine by me.

I'll adjust this per discussion and per a comment Simon made
separately. Whether we use it right away or not it's worth having it
updated while it's still freshly in mind.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

31 March 2017, 08:38:19

On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

>> The comment and code don't quite square to me - it's far from obvious
>> that LogStandbySnapshot does something like that. I'd even say it's a
>> bad idea to have it do that.
>
> So you prefer the prior approach with separate xl_catalog_xmin advance records?

Alternately, we can record the creation timeline on slots, so we know
if there's been a promotion. It wouldn't make sense to do this if that
were the only use of timelines on slots. But I'm aware you'd rather
keep slots timeline-agnostic and I tend to agree.

Anyway, per your advice will split out the validation step.

(I'd like replication origins to be able to track time alongside lsn,
and to pass the timeline of each LSN to output plugin callbacks so we
can detect if a physical promotion results in us backtracking down a
fork in history, but that doesn't affect slots.)

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

03 April 2017, 08:46:08

On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
>>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>>>               SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>>
>>>               /*
>>> +              * There can be no concurrent writers to oldestCatalogXmin during
>>> +              * recovery, so no need to take ProcArrayLock.
>>> +              */
>>> +             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>>
>> s/writers/writes/?
>
> I meant writers, i.e. nothing else can be writing to it. But writes works too.
>

Fixed.

>>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>>
>>>       ReplicationSlotsComputeRequiredXmin(true);
>>>
>>> +     /*
>>> +      * If this is the first slot created on the master we won't have a
>>> +      * persistent record of the oldest safe xid for historic snapshots yet.
>>> +      * Force one to be recorded so that when we go to replay from this slot we
>>> +      * know it's safe.
>>> +      */
>>> +     force_standby_snapshot =
>>> +             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>>
>> s/first slot/first logical slot/. Also, the reference to master doesn't
>> seem right.
>
> Unsure what you mean re reference to master not seeming right.
>
> If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
> from the new slot so we need to make sure it gets advanced one we've
> decided on our starting catalog_xmin.

Moved to next patch, will address there.

>>>       LWLockRelease(ProcArrayLock);
>>>
>>> +     /* Update ShmemVariableCache->oldestCatalogXmin */
>>> +     if (force_standby_snapshot)
>>> +             LogStandbySnapshot();
>>
>> The comment and code don't quite square to me - it's far from obvious
>> that LogStandbySnapshot does something like that. I'd even say it's a
>> bad idea to have it do that.
>
> So you prefer the prior approach with separate xl_catalog_xmin advance records?
>
> I don't have much preference; I liked the small code reduction of
> Simon's proposed approach, but it landed up being a bit awkward in
> terms of ordering and locking. I don't think catalog_xmin tracking is
> really closely related to the standby snapshot stuff and it feels a
> bit like it's a tacked-on afterthought where it is now.

This code moved to next patch. But we do need to agree on the best approach.

If we're not going to force a standby snapshot here, then it's
probably better to use the separate xl_catalog_xmin design instead.

>>>       /*
>>>        * tell the snapshot builder to only assemble snapshot once reaching the
>>>        * running_xact's record with the respective xmin.
>>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>>               start_lsn = slot->data.confirmed_flush;
>>>       }
>>>
>>> +     EnsureActiveLogicalSlotValid();
>>
>> This seems like it should be in a separate patch, and seperately
>> reviewed. It's code that's currently unreachable (and thus untestable).
>
> It is reached and is run, those checks run whenever creating a
> non-initial decoding context on master or replica.

Again, moved to next patch.

>>>       /*
>>> +      * Update our knowledge of the oldest xid we can safely create historic
>>> +      * snapshots for.
>>> +      *
>>> +      * There can be no concurrent writers to oldestCatalogXmin during
>>> +      * recovery, so no need to take ProcArrayLock.
>>
>> By now I think is pretty flawed logic, because there can be concurrent
>> readers, that need to be safe against oldestCatalogXmin advancing
>> concurrently.
>
> You're right, we'll need a lock or suitable barriers here to ensure
> that slot conflict with recovery and startup of new decoding sessions
> doesn't see outdated values. This would be the peer of the
> pg_memory_barrier() above. Or could just take a lock; there's enough
> other locking activity in redo that it should be fine.

Now takes ProcArrayLock briefly.

oldestCatalogXmin is also used in GetOldestSafeDecodingTransactionId,
and there we want to prevent it from being advanced. But on further
thought, relying on oldestCatalogXmin there is actually unsafe; on the
master, we might've already logged our intent to advance it to some
greater value of procArray->replication_slot_catalog_xmin and will do
so as ProcArrayLock is released. On standby we're also better off
relying on procArray->replication_slot_catalog_xmin since that's what
we'll be sending in feedback.

Went back to using replication_slot_catalog_xmin here, with comment

     *
     * We don't use ShmemVariableCache->oldestCatalogXmin here because another
     * backend may have already logged its intention to advance it to a higher
     * value (still <= replication_slot_catalog_xmin) and just be waiting on
     * ProcArrayLock to actually apply the change. On a standby
     * replication_slot_catalog_xmin is what the walreceiver will be sending in
     * hot_standby_feedback, not oldestCatalogXmin.
     */


>>>       /*
>>> -      * It's important *not* to include the limits set by slots here because
>>> +      * It's important *not* to include the xmin set by slots here because
>>>        * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>>>        * were to be included here the initial value could never increase because
>>>        * of a circular dependency where slots only increase their limits when
>>>        * running xacts increases oldestRunningXid and running xacts only
>>>        * increases if slots do.
>>> +      *
>>> +      * We can include the catalog_xmin limit here; there's no similar
>>> +      * circularity, and we need it to log xl_running_xacts records for
>>> +      * standbys.
>>>        */
>>
>> Those comments seem to need some more heavyhanded reconciliation.
>
> OK. To me it seems clear that the first refers to xmin, the second to
> catalog_xmin. But after all I wrote it, and the important thing is
> what it says to people who are not me. Will adjust.

Changed to

     * We can safely report the catalog_xmin limit for replication slots here
     * because it's only used to advance oldestCatalogXmin. Slots'
     * catalog_xmin advance does not depend on it so there's no circularity.



>
>>>   *
>>>   * Return the current slot xmin limits. That's useful to be able to remove
>>>   * data that's older than those limits.
>>> + *
>>> + * For logical replication slots' catalog_xmin, we return both the effective
>>
>> This seems to need some light editing.
>
> catalog_xmin => catalog_xmins I guess.

Amended.

>>>  /*
>>>   * Record an enhanced snapshot of running transactions into WAL.
>>>   *
>>> + * We also record the value of procArray->replication_slot_catalog_xmin
>>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>>> + * removing dead catalog tuples below that threshold.
>>
>> I think needs some rephrasing. We're not necessarily about to remove
>> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.
>
> Agreed

 * We also record the value of procArray->replication_slot_catalog_xmin
 * obtained from GetRunningTransactionData here. We intend to advance
 * ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
 * of the new value, which will permit removal of previously-protected dead
 * catalog tuples. The standby needs to know about that before any WAL
 * records containing such tuple removals could possibly arrive.


>>> +static void
>>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>>> +{
>>> +     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>> +     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>>> +             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
>>> +             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>>> +     LWLockRelease(ProcArrayLock);
>>> +}
>>
>> Doing TransactionIdPrecedes before ensuring
>> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
>> me as a good idea.  Generally, the expression as it stands is hard to
>> understand.
>
> OK.
>
> I found other formulations to be long and hard to read. Expressing it
> as "if validity has changed or value has increased" made more sense.
> Agree order should change.

Re-ordered, otherwise left the same.

Could add a comment like

"we must set oldestCatalogXmin if its validity has changed or it is advancing"

but seems rather redundant to the code.


>>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>>> index d25a2dd..a4ecfb7 100644
>>> --- a/src/include/access/transam.h
>>> +++ b/src/include/access/transam.h
>>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>>>                                                                                * aborted */
>>>
>>>       /*
>>> +      * This field is protected by ProcArrayLock except
>>> +      * during recovery, when it's set unlocked.
>>> +      *
>>> +      * oldestCatalogXmin is the oldest xid it is
>>> +      * guaranteed to be safe to create a historic
>>> +      * snapshot for. See also
>>> +      * procArray->replication_slot_catalog_xmin
>>> +      */
>>> +     TransactionId oldestCatalogXmin;
>>
>> Maybe it'd be better to rephrase that do something like
>> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
>> are removed. That property is used for logical decoding.". or similar?
>
> Fine by me.
>
> I'll adjust this per discussion and per a comment Simon made
> separately. Whether we use it right away or not it's worth having it
> updated while it's still freshly in mind.

OK, updated catalog_xmin logging patch attached.

Important fix included: when faking up a RunningTransactions snapshot
in StartupXLOG for replay of shutdown checkpoints, copy the
checkpoint's oldestCatalogXmin so we apply it instead of clobbering
the replica's value. It's kind of roundabout to set this once when we
apply the checkpoint and again via ProcArrayApplyRecoveryInfo, but
it's necessary if we're using xl_running_xacts to carry
oldestCatalogXmin info.

Found another issue too. We log our intention to increase
oldestCatalogXmin in LogStandbySnapshot when we write
xl_running_xacts. We then release ProcArrayLock to re-acquire it
LW_EXCLUSIVE so we can increment oldestCatalogXmin in shmem. But a
checkpoint runs and copies the old oldestCatalogXmin value after we
wrote xlog but before we updated in shmem. On the standby, redo will
apply the new value then clobber it with the old one.

To fix this, take CheckpointLock in LogStandbySnapshot (if not called
during a checkpoint) so we can't have the xl_running_xacts with the
new oldestCatalogXmin land up in WAL before a checkpoint with an older
value. Also take oldestCatalogXmin's value after we've forced
LogStandbySnapshot in a checkpoint.

Extended tests a bit to cover redo on standbys.

Personally I'm not a huge fan of how integrating this with logging
standby snapshots has turned out. It seemed to make sense initially,
but I think the way it works out is more convoluted than necessary for
little benefit. I'll prep an updated version of the
xl_advance_catalog_xmin patch with the same fixes for side by side
comparison.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

log-catalog-xmin-advances-v4.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

03 April 2017, 10:27:15

On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:

> OK, updated catalog_xmin logging patch attached.

Ahem, that should be v5.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

log-catalog-xmin-advances-v5.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

03 April 2017, 10:41:35

On 3 April 2017 at 15:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> OK, updated catalog_xmin logging patch attached.
>
> Ahem, that should be v5.

... and here's v6, which returns to the separate
xl_xact_catalog_xmin_advance approach.

pgintented.

This is what I favour proceeding with.

Now updating/amending recovery conflict patch.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

log-catalog-xmin-advances-v6.patch

Re: Logical decoding on standby

From

Craig Ringer

Date:

04 April 2017, 17:32:40

Hi all

Here's the final set of three patches on top of what's already committed.

The first is catalog_xmin logging, which is unchanged from the prior post.

The 2nd is support for conflict with recovery, with changes that
should address Andres's concerns there.

The 3rd actually enables decoding on standby. Unlike the prior
version, no attempt is made to check the walsender configuration for
slot use, etc. The ugly code to try to mitigate races is also removed.
Instead, if wal_level is logical the catalog_xmin sent by
hot_standby_feedback is now the same as the xmin if there's no local
slot holding it down. So we're always sending a catalog_xmin in
feedback and we should always expect to have a valid local
oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
race in slot creation no worse than the existing race between
hot_standby_feedback establishment and the first queries run on a
downstream, albeit with more annoying consequences. Apps can still
ensure a slot created on standby is guaranteed safe and conflict-free
by having a slot on the master first.

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Logical decoding on standby

From

Craig Ringer

Date:

04 April 2017, 17:33:14

On 4 April 2017 at 22:32, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> Here's the final set of three patches on top of what's already committed.
>
> The first is catalog_xmin logging, which is unchanged from the prior post.
>
> The 2nd is support for conflict with recovery, with changes that
> should address Andres's concerns there.
>
> The 3rd actually enables decoding on standby. Unlike the prior
> version, no attempt is made to check the walsender configuration for
> slot use, etc. The ugly code to try to mitigate races is also removed.
> Instead, if wal_level is logical the catalog_xmin sent by
> hot_standby_feedback is now the same as the xmin if there's no local
> slot holding it down. So we're always sending a catalog_xmin in
> feedback and we should always expect to have a valid local
> oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
> race in slot creation no worse than the existing race between
> hot_standby_feedback establishment and the first queries run on a
> downstream, albeit with more annoying consequences. Apps can still
> ensure a slot created on standby is guaranteed safe and conflict-free
> by having a slot on the master first.
>
> I'm much happier with this. I'm still fixing some issues in the tests
> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> reviewed in their proper context now.

Dammit. Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From

Andres Freund

Date:

04 April 2017, 23:19:15

On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
> I'm much happier with this. I'm still fixing some issues in the tests
> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

- Andres

Re: Logical decoding on standby

From

Craig Ringer

Date:

05 April 2017, 12:18:24

On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
>> I'm much happier with this. I'm still fixing some issues in the tests
>> for 03 and tidying them up, but 03 should allow 01 and 02 to be
>> reviewed in their proper context now.
>
> To me this very clearly is too late for v10, and now should be moved to
> the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

I've just fixed an issue where hot_standby_feedback on a physical slot
could cause oldestCatalogXmin to go backwards. When the slot's
catalog_xmin was 0 and is being set for the first time the standby's
supplied catalog_xmin is trusted. To fix it, in
PhysicalReplicationSlotNewXmin when setting catalog_xmin from 0, clamp
the value to the master's GetOldestSafeDecodingTransactionId().

Tests are cleaned up and fixed.

This series adds full support for logical decoding on a standby.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From

Andres Freund

Date:

05 April 2017, 17:32:18

On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:
> On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
> >> I'm much happier with this. I'm still fixing some issues in the tests
> >> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> >> reviewed in their proper context now.
> >
> > To me this very clearly is too late for v10, and now should be moved to
> > the next CF.
> 
> I tend to agree that it's late in the piece. It's still worth cleaning
> it up into a state ready for early pg11 though.

Totally agreed.

- Andres

Re: Logical decoding on standby

From

Robert Haas

Date:

05 April 2017, 18:25:50

On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:
>> On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
>> > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
>> >> I'm much happier with this. I'm still fixing some issues in the tests
>> >> for 03 and tidying them up, but 03 should allow 01 and 02 to be
>> >> reviewed in their proper context now.
>> >
>> > To me this very clearly is too late for v10, and now should be moved to
>> > the next CF.
>>
>> I tend to agree that it's late in the piece. It's still worth cleaning
>> it up into a state ready for early pg11 though.
>
> Totally agreed.

Based on this exchange, marked as "Moved to next CF".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company