Thread: Logical decoding on standby

Logical decoding on standby

From
Craig Ringer
Date:
Hi all

I've prepared a working initial, somewhat raw implementation for
logical decoding on physical standbys. Since it's a series of 20
smallish patches at the moment I won't attach it. You can find the
current version at time of writing here:

https://github.com/postgres/postgres/compare/c5f365f3ab...2ndQuadrant:dev/logical-decoding-on-standby-pg10-v1?expand=1

i.e the tag    dev/logical-decoding-on-standby-pg10-v1   in github
repo 2ndQuadrant/postgres.

and whatever I'm working on (subject to rebase, breakage, etc) lives
in the branch dev/logical-decoding-on-standby-pg10 .

Quickstart
===

Compile and install like usual; make sure to install test_decoding
too. To see the functionality in action, configure with
--enable-tap-tests and:
   make -C src/test/recovery check

To try manually, initdb a master, set pg_hba.conf to 'trust' on all
replication connections, append to postgresql.conf:
   wal_level = 'logical'   max_replication_slots = 4   max_wal_senders = 4   hot_standby_feedback = on

then start the master. Now
   psql -d 'master_connstr' -c "SELECT
pg_create_physical_replication_slot('standby1');"

and
   pg_basebackup -d 'master_connstr' -X stream -R --slot=standby1

and start the replica.

You can now use pg_recvlogical to create a slot on the replica and
decode changes from it, e.g.
   pg_recvlogical -d 'replica_connstr' -S test -P test_decoding --create-slot   pg_recvlogical -d 'replica_connstr' -S
'test'-f - --start
 

and you'll (hopefully) see subsequent changes you make on the master.
If not, tell me.


Patch series contents
===

This patch series incorporates the following changes:

* Timeline following for logical slots, so they can start decoding on
the correct timeline and follow timeline switches (with tests);
originally [3]

* Add --endpos to pg_recvlogical, with tests; originally [3]

* Splitting of xmin and catalog_xmin on hot standby feedback, so
logical slots on a replica only hold down catalog_xmin on the master,
not the xmin for user tables (with tests). Minimises upstream bloat;
originally [4]

* Suppress export of snapshot when starting logical decoding on
replica. Since we cannot allocate an xid, we cannot export snapshots
on standby. Decoding clients can still do their initial setup via a
slot on the master then switch over, do it via physical copy, etc.

* Require hot_standby_feedback to be enabled when starting logical
decoding on a standby.

* Drop any replication slots from a database when redo'ing database
drop, so we don't leave dangling slots on the replica (with tests).

* Make the walsender respect SIGUSR1 and exit via
RecoveryConflictInterrupt() when it gets
PROCSIG_RECOVERY_CONFLICT_DATABASE (with tests); see [6]

* PostgresNode.pm enhancements for the tests

* New test coverage for logical decoding on standby

Remaining issues
===

* The method used to make the walsender respect conflict with recovery
interrupts may not be entirely safe, see walsender
procsignal_sigusr1_handler thread [5];

* We probably terminate walsenders running inside an output plugin
with a virtual xact whose xmin is below the upstream's global xmin,
even though its catalog xmin is fine, in
ResolveRecoveryConflictWithSnapshot(...). Haven't been able to test
this. Need to only terminate them when the conflict affects relations
accessible in logical decoding, which likely needs the upstream to
send more info in WAL;

* logical decoding timeline following needs tests for cascading
physical replication where an intermediate node is promoted per
timeline following thread [3];

* walsender may need to maintain ThisTimeLineID in more places per
decoding timeline following v10 thread [3];

* it may be desirable to refactor the walsender to deliver cleaner
logical decoding timeline following per the decoding timeline
following v10 thread[3]

also:

* Nothing stops the user from disabling hot_standby_feedback on the
standby or dropping and re-creating the physical slot on the master,
causing needed catalog tuples to get vacuumed away. Since it's not
going to be safe to check slot shmem state from the
hot_standby_feedback verify hook and we let hot_standby_feedback
change at runtime this is going to be hard to fix comprehensively, so
we need to cope with what happens when feedback fails, but:

* We don't yet detect when upstream's catalog_xmin increases past our
needed catalog_xmin and needed catalog tuples are vacuumed away by the
upstream. So we don't invalidate the slot or terminate any active
decoding sessions using the slot. Active decoding sessions often won't
have a vtxid to use with ResolveRecoveryConflictWithVirtualXIDs(),
transaction cancel is not going to be sufficient, and anyway it'll
cancel too aggressively since it doesn't know it's safe to apply
changes that affect only (non-user-catalog) heap tables without
conflict with decoding sessions.

... so this is definitely NOT ready for commit. It does, however, make
logical decoding work on standby.

Next steps
===

Since it doesn't look practical to ensure there's never been a gap in
hot standby feedback or detect such a gap directly, I'm currently
looking at ways to reliably detect when the upstream has removed
tuples we need and error out. That means we need a way to tell when
upstream's catalog_xmin has advanced, which we don't currently have
from xlogs. Checkpoint's oldestXID is insufficient since advance
could've happened since last checkpoint.


Related threads
===

This series supercedes:
* Timeline following for logical slots [1]
https://www.postgresql.org/message-id/flat/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com
*  WIP: Failover Slots [2]
https://www.postgresql.org/message-id/CAMsr+YFqtf6ecDVmJSLpC_G8T6KoNpKZZ_XgksODwPN+f=evqg@mail.gmail.com


and incorporates the patches in:
* Logical decoding timeline following take II [3]
https://www.postgresql.org/message-id/flat/CAMsr+YEQB3DbxmCUTTTX4RZy8J2uGrmb5+_ar_joFZNXa81Fug@mail.gmail.com

* Send catalog_xmin separately in hot standby feedback [4]
https://www.postgresql.org/message-id/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com

*  Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() from
walsender?  [5]
https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com


Also relevant:
* Use procsignal_sigusr1_handler and RecoveryConflictInterrupt() in walsender [6]
https://www.postgresql.org/message-id/CAMsr+YFb3R-t5O0jPGvz9_nsAt2GwwZiLSnYu3=X6mR9RnrbEw@mail.gmail.com


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Andres Freund
Date:
Hi,

On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:
> I've prepared a working initial, somewhat raw implementation for
> logical decoding on physical standbys.

Please attach. Otherwise in a year or two it'll be impossible to look
this up.

Andres



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 22 November 2016 at 03:14, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2016-11-21 16:17:58 +0800, Craig Ringer wrote:
>> I've prepared a working initial, somewhat raw implementation for
>> logical decoding on physical standbys.
>
> Please attach. Otherwise in a year or two it'll be impossible to look
> this up.

Fair enough. Attached. Easy to apply with "git am".

I'm currently looking at making detection of replay conflict with a
slot work by separating the current catalog_xmin into two effective
parts - the catalog_xmin currently needed by any known slots
(ProcArray->replication_slot_catalog_xmin, as now), and the oldest
actually valid catalog_xmin where we know we haven't removed anything
yet.

That'll be recorded in a new CheckPoint.oldestCatalogXid field and in
ShmemVariableCache ( i.e. VariableCacheData.oldestCatalogXid ).

Vacuum will be responsible for advancing
VariableCacheData.oldestCatalogXid by writing an expanded
xl_heap_cleanup_info record with a new latestRemovedCatalogXid field
and then advancing the value in the ShmemVariableCache. Vacuum will
only remove rows of catalog or user-catalog tables that are older than
VariableCacheData.oldestCatalogXid.

This allows recovery on a standby to tell, based on the last
checkpoint + any xl_heap_cleanup_info records used to maintain
ShmemVariableCache, whether the upstream has removed catalog or
user-catalog records it needs. We can signal walsenders with running
xacts to terminate if their xmin passes the threshold, and when they
start a new xact they can check to see if they're still valid and bail
out.

xl_heap_cleanup_info isn't emitted much, but if adding a field there
is a problem we can instead add an additional xlog buffer that's only
appended when wal_level = logical.

Doing things this way avoids:

* the need for the standby to be able to tell at redo time whether a
RelFileNode is for a catalog or user-catalog relation without access
to relcache; or
* the need to add info on whether a catalog or user-catalog is being
affected to any xlog record that can cause a snapshot conflict on
standby; or
* a completely reliably way to ensure hot_standby_feedback can never
cease to affect the master even if the user does something dumb

at the cost of sometimes somewhat delaying removal of catalog or
user-catalog tuples when wal_level >= hot_standby, a new CheckPoint
field, and a new field in xl_heap_cleanup_info .

The above is not incorporated in the attached patch series, see the
prior post for status of the attached patches.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:

> I'm currently looking at making detection of replay conflict with a
> slot work by separating the current catalog_xmin into two effective
> parts - the catalog_xmin currently needed by any known slots
> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
> actually valid catalog_xmin where we know we haven't removed anything
> yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs. We don't
emit a WAL record when we advance oldestXid in
SetTransactionIdLimit(), and doing so is useless because vacuum will
have already removed needed tuples from needed catalogs before calling
SetTransactionIdLimit() from vac_truncate_clog(). We know that if
oldestXid is n, the true valid catalog_xmin where no needed tuples
have been removed must be >= n. But we need to know the lower bound of
valid catalog_xmin, which oldestXid doesn't give us.

So right now a standby has no way to reliably know if the catalog_xmin
requirement for a given replication slot can be satisfied. A standby
can't tell based on a xl_heap_cleanup_info record, xl_heap_clean
record, etc whether the affected table is a catalog or not, and
shouldn't generate conflicts for non-catalogs since otherwise it'll be
constantly clobbering walsenders.

A 2-phase advance of the global catalog_xmin would mean that
GetOldestXmin() would return a value from ShmemVariableCache, not the
oldest catalog xmin from ProcArray like it does now. (auto)vacuum
would then be responsible for:

* Reading the oldest catalog_xmin from procarray
* If it has advanced vs what's present in ShmemVariableCache, writing
a new xlog record type recording an advance of oldest catalog xmin
* advancing ShmemVariableCache's oldest catalog xmin

and would do so before it called GetOldestXmin via
vacuum_set_xid_limits() to determine what it can remove.

GetOldestXmin would return the ProcArray's copy of the oldest
catalog_xmin when in recovery, so we report it via hot_standby_fedback
to the upstream, it's recorded on our physical slot, and in turn
causes vacuum to advance the master's effective oldest catalog_xmin
for vacuum.

On the standby we'd replay the catalog_xmin advance record, advance
the standby's ShmemVariableCache's oldest catalog xmin, and check to
see if any replication slots, active or not, have a catalog_xmin <
than the new threshold. If none do, there's no conflict and we're
fine. If any do, we wait
max_standby_streaming_delay/max_standby_archive_delay as appropriate,
then generate recovery conflicts against all backends that have an
active replication slot based on the replication slot state in shmem.
Those backends - walsender or normal decoding backend - would promptly
die. New decoding sessions will check the ShmemVariableCache and
refuse to start if their catalog_xmin is < the threshold. Since we
advance it before generating recovery conflicts there's no race with
clients trying to reconnect after their backend is killed with a
conflict.

If we wanted to get fancy we could set the latches of walsender
backends at risk of conflicting and they could check
ShmemVariableCache's oldest valid catalog xmin, so they could send
immediate keepalives with reply_requested set and hopefully get flush
confirmation from the client and advance their catalog_xmin before we
terminate them as conflicting with recovery. But that can IMO be done
later/separately.

Going to prototype this.



Alternate approach:
---------------

The oldest xid in heap_xlog_cleanup_info is relation-specific, but the
standby has no way to know if it's a catalog relation or not during
redo and know whether to kill slots and decoding sessions based on its
latestRemovedXid. Same for xl_heap_clean and the other records that
can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page,
xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect).

Instead of adding a 2-phase advance of the global catalog_xmin, we can
instead add a rider to each of these records that identifies whether
it's a catalog table or not. This would only be emitted when wal_level
>= logical, but it *would* increase WAL sizes a bit when logical
decoding is enabled even if it's not going to be used on a standby.
The rider would be a simple:

typedef struct xl_rel_catalog_info
{   bool rel_accessible_from_logical_decoding;
} xl_catalog_info;

or similar. During redo we call a new
ResolveRecoveryConflictWithLogicalSlot function from each of those
records' redo routines that does what I outlined above.

This way add more info to more xlog records, and the upstream has to
use RelationIsAccessibleInLogicalDecoding() to set up the records when
writing the xlogs. In exchange, we don't have to add a new field to
CheckPoint or ShmemVariableCache or add a new xlog record type. It
seems the worse option to me.


(BTW, as comments on GetOldestSafeDecodingTransactionId() note, we
can't rely on KnownAssignedXidsGetOldestXmin() since it can be
incomplete at least on standby.)

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Robert Haas
Date:
On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:
>> I'm currently looking at making detection of replay conflict with a
>> slot work by separating the current catalog_xmin into two effective
>> parts - the catalog_xmin currently needed by any known slots
>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
>> actually valid catalog_xmin where we know we haven't removed anything
>> yet.
>
> OK, more detailed plan.
>
> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
> are already held down by ProcArray's catalog_xmin. But that doesn't
> mean we haven't removed newer tuples from specific relations and
> logged that in xl_heap_clean, etc, including catalogs or user
> catalogs, it only means the clog still exists for those XIDs.

Really?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 23 November 2016 at 03:55, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 22, 2016 at 1:49 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> I'm currently looking at making detection of replay conflict with a
>>> slot work by separating the current catalog_xmin into two effective
>>> parts - the catalog_xmin currently needed by any known slots
>>> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
>>> actually valid catalog_xmin where we know we haven't removed anything
>>> yet.
>>
>> OK, more detailed plan.
>>
>> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
>> are already held down by ProcArray's catalog_xmin. But that doesn't
>> mean we haven't removed newer tuples from specific relations and
>> logged that in xl_heap_clean, etc, including catalogs or user
>> catalogs, it only means the clog still exists for those XIDs.
>
> Really?

(Note the double negative above).

Yes, necessarily so. You can't look up xids older than the clog
truncation threshold at oldestXid, per our discussion on txid_status()
and traceable commit. But the tuples from that xact aren't guaranteed
to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
which calls GetOldestXmin(...); that in turn scans ProcArray to find
the oldest xid any running xact cares about. It might bump it down
further if there's a replication slot requirement or based on
vacuum_defer_cleanup_age, but it doesn't care in the slightest about
oldestXmin.

oldestXmin cannot advance until vacuum has removed all tuples for that
xid and advanced the database's datfrozenxid. But a given oldestXmin
says nothing about which tuples, catalog or otherwise, still exist and
are acessible.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Robert Haas
Date:
On Wed, Nov 23, 2016 at 8:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
>>> are already held down by ProcArray's catalog_xmin. But that doesn't
>>> mean we haven't removed newer tuples from specific relations and
>>> logged that in xl_heap_clean, etc, including catalogs or user
>>> catalogs, it only means the clog still exists for those XIDs.
>>
>> Really?
>
> (Note the double negative above).
>
> Yes, necessarily so. You can't look up xids older than the clog
> truncation threshold at oldestXid, per our discussion on txid_status()
> and traceable commit. But the tuples from that xact aren't guaranteed
> to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)
> which calls GetOldestXmin(...); that in turn scans ProcArray to find
> the oldest xid any running xact cares about. It might bump it down
> further if there's a replication slot requirement or based on
> vacuum_defer_cleanup_age, but it doesn't care in the slightest about
> oldestXmin.
>
> oldestXmin cannot advance until vacuum has removed all tuples for that
> xid and advanced the database's datfrozenxid. But a given oldestXmin
> says nothing about which tuples, catalog or otherwise, still exist and
> are acessible.

Right.  Sorry, my mistake.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical decoding on standby

From
Craig Ringer
Date:
<p dir="ltr"><p dir="ltr">On 26 Nov. 2016 23:40, "Robert Haas" <<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>>wrote:<br /> ><br /> > On Wed, Nov 23, 2016 at
8:37AM, Craig Ringer <<a href="mailto:craig@2ndquadrant.com">craig@2ndquadrant.com</a>> wrote:<br /> >
>>>The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,<br /> > >>> are already
helddown by ProcArray's catalog_xmin. But that doesn't<br /> > >>> mean we haven't removed newer tuples
fromspecific relations and<br /> > >>> logged that in xl_heap_clean, etc, including catalogs or user<br />
>>>> catalogs, it only means the clog still exists for those XIDs.<br /> > >><br /> > >>
Really?<br/> > ><br /> > > (Note the double negative above).<br /> > ><br /> > > Yes,
necessarilyso. You can't look up xids older than the clog<br /> > > truncation threshold at oldestXid, per our
discussionon txid_status()<br /> > > and traceable commit. But the tuples from that xact aren't guaranteed<br />
>> to exist in any given relation; vacuum uses vacuum_set_xid_limits(...)<br /> > > which calls
GetOldestXmin(...);that in turn scans ProcArray to find<br /> > > the oldest xid any running xact cares about. It
mightbump it down<br /> > > further if there's a replication slot requirement or based on<br /> > >
vacuum_defer_cleanup_age,but it doesn't care in the slightest about<br /> > > oldestXmin.<br /> > ><br />
>> oldestXmin cannot advance until vacuum has removed all tuples for that<br /> > > xid and advanced the
database'sdatfrozenxid. But a given oldestXmin<br /> > > says nothing about which tuples, catalog or otherwise,
stillexist and<br /> > > are acessible.<br /> ><br /> > Right.  Sorry, my mistake.<p dir="ltr">Phew. Had me
worriedthere.<p dir="ltr">Thanks for looking over it. Prototype looks promising so far. 

Re: Logical decoding on standby

From
Petr Jelinek
Date:
Hi,

I did look at the code a bit. The first 6 patches seem reasonable.
I don't understand why some patches are separate tbh (like 7-10, or 11).

About the 0009:
> diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
> index 9985e3e..4fa3ad4 100644
> --- a/contrib/pg_visibility/pg_visibility.c
> +++ b/contrib/pg_visibility/pg_visibility.c
> @@ -538,7 +538,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
>      if (all_visible)
>      {
>          /* Don't pass rel; that will fail in recovery. */
> -        OldestXmin = GetOldestXmin(NULL, true);
> +        OldestXmin = GetOldestXmin(NULL, true, false);
>      }
>  
>      rel = relation_open(relid, AccessShareLock);
> @@ -660,7 +660,7 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
>                   * a buffer lock. And this shouldn't happen often, so it's
>                   * worth being careful so as to avoid false positives.
>                   */
> -                RecomputedOldestXmin = GetOldestXmin(NULL, true);
> +                RecomputedOldestXmin = GetOldestXmin(NULL, true, false);
>  
>                  if (!TransactionIdPrecedes(OldestXmin, RecomputedOldestXmin))
>                      record_corrupt_item(items, &tuple.t_self);
> diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
> index f524fc4..5b33c97 100644
> --- a/contrib/pgstattuple/pgstatapprox.c
> +++ b/contrib/pgstattuple/pgstatapprox.c
> @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
>      TransactionId OldestXmin;
>      uint64        misc_count = 0;
>  
> -    OldestXmin = GetOldestXmin(rel, true);
> +    OldestXmin = GetOldestXmin(rel, true, false);
>      bstrategy = GetAccessStrategy(BAS_BULKREAD);
>  
>      nblocks = RelationGetNumberOfBlocks(rel);

This does not seem correct, you are sending false as pointer parameter.

0012:

I think there should be parameter saying if snapshot should be exported
or not and if user asks for it on standby it should fail.

0014 makes 0011 even more pointless.

Not going into deeper detail as this is still very WIP. I go agree with
the general design though.

This also replaces the previous timeline following and decoding
threads/CF entries so maybe those should be closed in CF?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
>> --- a/contrib/pgstattuple/pgstatapprox.c
>> +++ b/contrib/pgstattuple/pgstatapprox.c
>> @@ -70,7 +70,7 @@ statapprox_heap(Relation rel, output_type *stat)
>>       TransactionId OldestXmin;
>>       uint64          misc_count = 0;
>>
>> -     OldestXmin = GetOldestXmin(rel, true);
>> +     OldestXmin = GetOldestXmin(rel, true, false);
>>       bstrategy = GetAccessStrategy(BAS_BULKREAD);
>>
>>       nblocks = RelationGetNumberOfBlocks(rel);
>
> This does not seem correct, you are sending false as pointer parameter.

Thanks. That's an oversight from the GetOldestXmin interface change
per your prior feedback. C doesn't care since null is 0 and false is
0, and I missed it when transforming the patch.

> 0012:
>
> I think there should be parameter saying if snapshot should be exported
> or not and if user asks for it on standby it should fail.

Sounds reasonable. That also means clients can suppress standby export
on master, which as we recently learned can be desirable sometimes.

> 0014 makes 0011 even more pointless.

Yeah, as I said, it's a bit WIP still and needs some rebasing and rearrangement.

> This also replaces the previous timeline following and decoding
> threads/CF entries so maybe those should be closed in CF?

I wasn't sure what to do about that, since it's all a set of related
functionality. I think it's going to get more traction as a single
"logical decoding onstandby" feature though, since the other parts are
hard to test and use in isolation. So yeah, probably, I'll do so
unless someone objects.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> I've prepared a working initial, somewhat raw implementation for
> logical decoding on physical standbys.

Hi all

I've attached a significantly revised patch, which now incorporates
safeguards to ensure that we prevent decoding if the master has not
retained needed catalogs and cancel decoding sessions that are holding
up apply because they need too-old catalogs

The biggest change in this patch, and the main intrusive part, is that
procArray->replication_slot_catalog_xmin is no longer directly used by
vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
added, with a corresponding CheckPoint field. Vacuum notices if
procArray->replication_slot_catalog_xmin has advanced past
ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
record with the new value before it copies it to oldestCatalogXmin.
This means that a standby can now reliably tell when catalogs are
about to be removed or become candidates for removal, so it can pause
redo until logical decoding sessions on the standby advance far enough
that their catalog_xmin passes that point. It also means that if our
hot_standby_feedback somehow fails to lock in the catalogs our slots
need on a standby, we can cancel sessions with a conflict with
recovery.

If wal_level is < logical this won't do anything, since
replication_slot_catalog_xmin and oldestCatalogXmin will both always
be 0.

Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
new replication_slot_catalog_xmin, this won't impact catalog bloat.

Ideally this mechanism won't generally actually be needed, since
hot_standby_feedback stops the master from removing needed catalogs,
and we make an effort to ensure that the standby has
hot_standby_feedback enabled and is using a replication slot. We
cannot prevent the user from dropping and re-creating the physical
slot on the upstream, though, and it doesn't look simple to stop them
turning off hot_standby_feedback or turning off use of a physical slot
after creating logical slots, either. So we try to stop users shooting
themselves in the foot, but if they do it anyway we notice and cope
gracefully. Logging catalog_xmin also helps slots created on standbys
know where to start, and makes sure we can deal gracefully with a race
between hs_feedback and slot creation on a standby.

There can be a significant delay for slot creation on standby since we
have to wait until there's a new xl_running_xacts record logged. I'd
like to extend the hot_standby_feedback protocol a little to address
that and some other issues, but that's a separate step.

I haven't addressed Petr's point yet, that "there should be parameter
saying if snapshot should be exported
or not and if user asks for it on standby it should fail". Otherwise I
think it's looking fairly solid.

Due to the amount of churn I landed up flattening the patchset. It
probably makes sense to split it up, likely into the sequence of
changes listed in the commit message. I'll wait for a general opinion
on the validity of this approach first.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Logical decoding on standby

From
Petr Jelinek
Date:
On 07/12/16 07:05, Craig Ringer wrote:
> On 21 November 2016 at 16:17, Craig Ringer <craig@2ndquadrant.com> wrote:
>> Hi all
>>
>> I've prepared a working initial, somewhat raw implementation for
>> logical decoding on physical standbys.
> 
> Hi all
> 
> I've attached a significantly revised patch, which now incorporates
> safeguards to ensure that we prevent decoding if the master has not
> retained needed catalogs and cancel decoding sessions that are holding
> up apply because they need too-old catalogs
> 
> The biggest change in this patch, and the main intrusive part, is that
> procArray->replication_slot_catalog_xmin is no longer directly used by
> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
> added, with a corresponding CheckPoint field. Vacuum notices if
> procArray->replication_slot_catalog_xmin has advanced past
> ShmemVariableCache->oldestCatalogXmin and writes a new xact rmgr
> record with the new value before it copies it to oldestCatalogXmin.
> This means that a standby can now reliably tell when catalogs are
> about to be removed or become candidates for removal, so it can pause
> redo until logical decoding sessions on the standby advance far enough
> that their catalog_xmin passes that point. It also means that if our
> hot_standby_feedback somehow fails to lock in the catalogs our slots
> need on a standby, we can cancel sessions with a conflict with
> recovery.
> 
> If wal_level is < logical this won't do anything, since
> replication_slot_catalog_xmin and oldestCatalogXmin will both always
> be 0.
> 
> Because oldestCatalogXmin advances eagerly as soon as vacuum sees the
> new replication_slot_catalog_xmin, this won't impact catalog bloat.
> 
> Ideally this mechanism won't generally actually be needed, since
> hot_standby_feedback stops the master from removing needed catalogs,
> and we make an effort to ensure that the standby has
> hot_standby_feedback enabled and is using a replication slot. We
> cannot prevent the user from dropping and re-creating the physical
> slot on the upstream, though, and it doesn't look simple to stop them
> turning off hot_standby_feedback or turning off use of a physical slot
> after creating logical slots, either. So we try to stop users shooting
> themselves in the foot, but if they do it anyway we notice and cope
> gracefully. Logging catalog_xmin also helps slots created on standbys
> know where to start, and makes sure we can deal gracefully with a race
> between hs_feedback and slot creation on a standby.
> 

Hi,

If this mechanism would not be needed most of the time, wouldn't it be
better to not have it and just have a way to ask physical slot about
what's the current reserved catalog_xmin (in most cases the standby
should actually know what it is since it's sending the hs_feedback, but
first slot created on such standby may need to check). WRT preventing
hs_feedback going off, we can refuse to start with hs_feedback off when
there are logical slots detected. We can also refuse to connect to the
master without physical slot if there are logical slots detected. I
don't see problem with either of those.

You may ask what if user drops the slot and recreates or somehow
otherwise messes up catalog_xmin on master, well, considering that under
my proposal we'd first (on connect) check the slot for catalog_xmin we'd
know about it so we'd either mark the local slots broken/drop them or
plainly refuse to connect to the master same way as if it didn't have
required WAL anymore (not sure which behavior is better). Note that user
could mess up catalog_xmin even in your design the same way, so it's not
really a regression.

In general I don't think that it's necessary to WAL log anything for
this. It will not work without slot and therefore via archiving anyway
so writing to WAL does not seem to buy us anything. There are some
interesting side effects of cascading (ie having A->B->C replication and
creating logical slot on C) but those should not be insurmountable. Plus
it might even be okay to only allow creating logical slots on standbys
connected directly to master in v1.


That's about approach, but since there are prerequisite patches in the
patchset that don't really depend on the approach I will comment about
them as well.

0001 and 0002 add testing infrastructure and look fine to me, possibly
committable.

But in 0003 I don't understand following code:
> +    if (endpos != InvalidXLogRecPtr && !do_start_slot)
> +    {
> +        fprintf(stderr,
> +                _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
> +                progname);
> +        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
> +                progname);
> +        exit(1);
> +    }

Why is --create-slot and --endpos not allowed together?

0004 again looks good but depends on 0003.

0005 is timeline following which is IMHO ready for committer, as is 0006
and 0008 and I still maintain the opinion that these should go in soon.

0007 is unfinished as you said in your mail (missing option to specify
behavior). And the last one 0009 is the implementation discussed above,
which I think needs rework. IMHO 0007 and 0009 should be ultimately merged.

I think parts of this could be committed separately and are ready for
committer IMHO, but there is no way in CF application to mark only part
of patch-set for committer to attract the attention.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

>> The biggest change in this patch, and the main intrusive part, is that
>> procArray->replication_slot_catalog_xmin is no longer directly used by
>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>> added, with a corresponding CheckPoint field.
>> [snip]
>
> If this mechanism would not be needed most of the time, wouldn't it be
> better to not have it and just have a way to ask physical slot about
> what's the current reserved catalog_xmin (in most cases the standby
> should actually know what it is since it's sending the hs_feedback, but
> first slot created on such standby may need to check).

Yes, and that was actually my originally preferred approach, though
the one above does offer the advantage that if something goes wrong we
can detect it and fail gracefully. Possibly not worth the complexity
though.

Your approach requires us to make very sure that hot_standby_feedback
does not get turned off by user or become ineffective once we're
replicating, though, since we won't have any way to detect when needed
tuples are removed. We'd probably just bail out with relcache/syscache
lookup errors, but I can't guarantee we wouldn't crash if we tried
logical decoding on WAL where needed catalogs have been removed.

I initially ran into trouble doing that, but now think I have a way forward.

> WRT preventing
> hs_feedback going off, we can refuse to start with hs_feedback off when
> there are logical slots detected.

Yes. There are some ordering issues there though. We load slots quite
late in startup and they don't get tracked in checkpoints. So we might
launch the walreceiver before we load slots and notice their needed
xmin/catalog_xmin. So we need to prevent sending of
hot_standby_feedback until slots are loaded, or load slots earlier in
startup. The former sounds less intrusive and safer - probably just
add an "initialized" flag to ReplicationSlotCtlData and suppress
hs_feedback until it becomes true.

We'd also have to suppress the validation callback action on the
hot_standby_feedback GUC until we know replication slot state is
initialised, and perform the check during slot startup instead. The
hot_standby_feedback GUC validation check might get called before
shmem is even set up so we have to guard against attempts to access a
shmem segment that may not event exist yet.

The general idea is workable though. Refuse to start if logical slots
exist and hot_standby_feedback is off or walreceiver isn't using a
physical slot. Refuse to allow hot_standby_feedback to change

> We can also refuse to connect to the
> master without physical slot if there are logical slots detected. I
> don't see problem with either of those.

Agreed. We must also be able to reliably enforce that the walreceiver
is using a replication slot to connect to the master and refuse to
start if it is not. The user could change recovery.conf and restart
the walreceiver while we're running, after we perform that check, so
walreceiver must also refuse to start if logical replication slots
exist but it has no primary slot name configured.

> You may ask what if user drops the slot and recreates or somehow
> otherwise messes up catalog_xmin on master, well, considering that under
> my proposal we'd first (on connect) check the slot for catalog_xmin we'd
> know about it so we'd either mark the local slots broken/drop them or
> plainly refuse to connect to the master same way as if it didn't have
> required WAL anymore (not sure which behavior is better). Note that user
> could mess up catalog_xmin even in your design the same way, so it's not
> really a regression.

Agreed. Checking catalog_xmin of the slot when we connect is
sufficient to guard against that, assuming we can trust that the
catalog_xmin is actually in effect on the master. Consider cascading
setups, where we set our catalog_xmin but it might not be "locked in"
until the middle cascaded server relays it to the master.

I have a proposed solution to that which I'll outline in a separate
patch+post; it ties in to some work on addressing the race between hot
standby feedback taking effect and queries starting on the hot
standby.  It boils down to "add a hot_standby_feedback reply protocol
message".

> Plus
> it might even be okay to only allow creating logical slots on standbys
> connected directly to master in v1.

True. I didn't consider that.

We haven't had much luck in the past with such limitations, but
personally I'd consider it a perfectly reasonable one.

> But in 0003 I don't understand following code:
>> +     if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> +     {
>> +             fprintf(stderr,
>> +                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
>> +                             progname);
>> +             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> +                             progname);
>> +             exit(1);
>> +     }
>
> Why is --create-slot and --endpos not allowed together?

What would --create-slot with --endpos do?

Ah. I had not realised that it is legal to do

pg_recvlogical -S test --create-slot --start -f - -d 'test'

i.e. "create a slot then in the same invocation begin decoding from it".

I misread and thought that --create-slot and --start were mutually exclusive.

I will address that.

> 0005 is timeline following which is IMHO ready for committer, as is 0006
> and 0008 and I still maintain the opinion that these should go in soon.

I wonder if I should re-order 0005 and 0006 so we can commit the
hot_standby test improvements before logical decoding timeline
following.

> I think parts of this could be committed separately and are ready for
> committer IMHO, but there is no way in CF application to mark only part
> of patch-set for committer to attract the attention.

Yeah. I raised that before and nobody was really sure what to do about
it. It's confusing to post patches on the same thread on separate CF
entries. It's also confusing to post patches on a nest of
inter-related threads to allow each thread to be tracked by a separate
CF entry.

At the moment I'm aiming to progressively get the underlying
infrastructure/test stuff in so we can focus on the core feature.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

> But in 0003 I don't understand following code:
>> +     if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> +     {
>> +             fprintf(stderr,
>> +                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
>> +                             progname);
>> +             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> +                             progname);
>> +             exit(1);
>> +     }
>
> Why is --create-slot and --endpos not allowed together?

Actually, the test is fine, the error is just misleading due to my
misunderstanding.

The fix is simply to change the error message to
                               _("%s: --endpos may only be specified
with --start\n"),

so I won't post a separate followup patch.

Okano Naoki tried to bring this to my attention earlier, but I didn't
understand as I hadn't yet realised you could use --create-slot
--start, they weren't mutually exclusive.

(We test to ensure --start --drop-slot isn't specified earlier so no
test for do_drop_slot is required).

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Petr Jelinek
Date:
On 21/12/16 04:31, Craig Ringer wrote:
> On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
> 
>> But in 0003 I don't understand following code:
>>> +     if (endpos != InvalidXLogRecPtr && !do_start_slot)
>>> +     {
>>> +             fprintf(stderr,
>>> +                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
>>> +                             progname);
>>> +             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>>> +                             progname);
>>> +             exit(1);
>>> +     }
>>
>> Why is --create-slot and --endpos not allowed together?
> 
> Actually, the test is fine, the error is just misleading due to my
> misunderstanding.
> 
> The fix is simply to change the error message to
> 
>                                 _("%s: --endpos may only be specified
> with --start\n"),
> 
> so I won't post a separate followup patch.
> 

Ah okay makes sense. The --create-slot + --endpos should definitely be
allowed combination, especially now that we can extend this to
optionally use temporary slot.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Logical decoding on standby

From
Petr Jelinek
Date:
On 21/12/16 04:06, Craig Ringer wrote:
> On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
> 
>>> The biggest change in this patch, and the main intrusive part, is that
>>> procArray->replication_slot_catalog_xmin is no longer directly used by
>>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>>> added, with a corresponding CheckPoint field.
>>> [snip]
>>
>> If this mechanism would not be needed most of the time, wouldn't it be
>> better to not have it and just have a way to ask physical slot about
>> what's the current reserved catalog_xmin (in most cases the standby
>> should actually know what it is since it's sending the hs_feedback, but
>> first slot created on such standby may need to check).
> 
> Yes, and that was actually my originally preferred approach, though
> the one above does offer the advantage that if something goes wrong we
> can detect it and fail gracefully. Possibly not worth the complexity
> though.
> 
> Your approach requires us to make very sure that hot_standby_feedback
> does not get turned off by user or become ineffective once we're
> replicating, though, since we won't have any way to detect when needed
> tuples are removed. We'd probably just bail out with relcache/syscache
> lookup errors, but I can't guarantee we wouldn't crash if we tried
> logical decoding on WAL where needed catalogs have been removed.
> 
> I initially ran into trouble doing that, but now think I have a way forward.
> 
>> WRT preventing
>> hs_feedback going off, we can refuse to start with hs_feedback off when
>> there are logical slots detected.
> 
> Yes. There are some ordering issues there though. We load slots quite
> late in startup and they don't get tracked in checkpoints. So we might
> launch the walreceiver before we load slots and notice their needed
> xmin/catalog_xmin. So we need to prevent sending of
> hot_standby_feedback until slots are loaded, or load slots earlier in
> startup. The former sounds less intrusive and safer - probably just
> add an "initialized" flag to ReplicationSlotCtlData and suppress
> hs_feedback until it becomes true.
> 
> We'd also have to suppress the validation callback action on the
> hot_standby_feedback GUC until we know replication slot state is
> initialised, and perform the check during slot startup instead. The
> hot_standby_feedback GUC validation check might get called before
> shmem is even set up so we have to guard against attempts to access a
> shmem segment that may not event exist yet.
> 
> The general idea is workable though. Refuse to start if logical slots
> exist and hot_standby_feedback is off or walreceiver isn't using a
> physical slot. Refuse to allow hot_standby_feedback to change
> 

These are all problems associated with replication slots being in shmem
if I understand correctly. I wonder, could we put just bool someplace
which is available early that says if there are any logical slots
defined? We don't actually need all the slot info, just to know if there
are some.

> 
>> You may ask what if user drops the slot and recreates or somehow
>> otherwise messes up catalog_xmin on master, well, considering that under
>> my proposal we'd first (on connect) check the slot for catalog_xmin we'd
>> know about it so we'd either mark the local slots broken/drop them or
>> plainly refuse to connect to the master same way as if it didn't have
>> required WAL anymore (not sure which behavior is better). Note that user
>> could mess up catalog_xmin even in your design the same way, so it's not
>> really a regression.
> 
> Agreed. Checking catalog_xmin of the slot when we connect is
> sufficient to guard against that, assuming we can trust that the
> catalog_xmin is actually in effect on the master. Consider cascading
> setups, where we set our catalog_xmin but it might not be "locked in"
> until the middle cascaded server relays it to the master.
> 
> I have a proposed solution to that which I'll outline in a separate
> patch+post; it ties in to some work on addressing the race between hot
> standby feedback taking effect and queries starting on the hot
> standby.  It boils down to "add a hot_standby_feedback reply protocol
> message".
> 
>> Plus
>> it might even be okay to only allow creating logical slots on standbys
>> connected directly to master in v1.
> 
> True. I didn't consider that.
> 
> We haven't had much luck in the past with such limitations, but
> personally I'd consider it a perfectly reasonable one.
> 

I think it's infinitely better with that limitation than the status quo.
Especially for failover scenario (you usually won't failover to replica
down the cascade as it's always more behind). Not to mention that with
every level of cascading you get automatically more lag which means more
bloat so it might not even be all that desirable to go that route
immediately in v1 when we don't have way to control that bloat/maximum
slot lag.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Logical decoding on standby

From
Robert Haas
Date:
On Tue, Dec 20, 2016 at 10:06 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 20 December 2016 at 15:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
>>> The biggest change in this patch, and the main intrusive part, is that
>>> procArray->replication_slot_catalog_xmin is no longer directly used by
>>> vacuum. Instead, a new ShmemVariableCache->oldestCatalogXmin field is
>>> added, with a corresponding CheckPoint field.
>>> [snip]
>>
>> If this mechanism would not be needed most of the time, wouldn't it be
>> better to not have it and just have a way to ask physical slot about
>> what's the current reserved catalog_xmin (in most cases the standby
>> should actually know what it is since it's sending the hs_feedback, but
>> first slot created on such standby may need to check).
>
> Yes, and that was actually my originally preferred approach, though
> the one above does offer the advantage that if something goes wrong we
> can detect it and fail gracefully. Possibly not worth the complexity
> though.
>
> Your approach requires us to make very sure that hot_standby_feedback
> does not get turned off by user or become ineffective once we're
> replicating, though, since we won't have any way to detect when needed
> tuples are removed. We'd probably just bail out with relcache/syscache
> lookup errors, but I can't guarantee we wouldn't crash if we tried
> logical decoding on WAL where needed catalogs have been removed.

I dunno, Craig, I think your approach sounds more robust.  It's not
very nice to introduce a bunch of random prohibitions on what works
with what, and it doesn't sound like it's altogether watertight
anyway.  Incorporating an occasional, small record into the WAL stream
to mark the advancement of the reserved catalog_xmin seems like a
cleaner and safer solution.  We certainly do NOT want to find out
about corruption only because of random relcache/syscache lookup
failures, let alone crashes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Logical decoding on standby

From
Michael Paquier
Date:
On Tue, Dec 20, 2016 at 4:03 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> That's about approach, but since there are prerequisite patches in the
> patchset that don't really depend on the approach I will comment about
> them as well.
>
> 0001 and 0002 add testing infrastructure and look fine to me, possibly
> committable.
>
> But in 0003 I don't understand following code:
>> +     if (endpos != InvalidXLogRecPtr && !do_start_slot)
>> +     {
>> +             fprintf(stderr,
>> +                             _("%s: cannot use --create-slot or --drop-slot together with --endpos\n"),
>> +                             progname);
>> +             fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
>> +                             progname);
>> +             exit(1);
>> +     }
>
> Why is --create-slot and --endpos not allowed together?
>
> 0004 again looks good but depends on 0003.
>
> 0005 is timeline following which is IMHO ready for committer, as is 0006
> and 0008 and I still maintain the opinion that these should go in soon.
>
> 0007 is unfinished as you said in your mail (missing option to specify
> behavior). And the last one 0009 is the implementation discussed above,
> which I think needs rework. IMHO 0007 and 0009 should be ultimately merged.
>
> I think parts of this could be committed separately and are ready for
> committer IMHO, but there is no way in CF application to mark only part
> of patch-set for committer to attract the attention.

Craig has pinged me about looking at 0001, 0002, 0004 and 0006 as
those involve the TAP infrastructure.

So, for 0001:
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -93,6 +93,7 @@ use RecursiveCopy;use Socket;use Test::More;use TestLib ();
+use pg_lsn qw(parse_lsn);use Scalar::Util qw(blessed);
This depends on 0002, so the order should be reversed.

+sub lsn
+{
+   my $self = shift;
+   return $self->safe_psql('postgres', 'select case when
pg_is_in_recovery() then pg_last_xlog_replay_location() else
pg_current_xlog_insert_location() end as lsn;');
+}
The design of the test should be in charge of choosing which value it
wants to get, and the routine should just blindly do the work. More
flexibility is more useful to design tests. So it would be nice to
have one routine able to switch at will between 'flush', 'insert',
'write', 'receive' and 'replay modes to get values from the
corresponding xlog functions.

-       die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
+       die "error running SQL: '$$stderr'\nwhile running
'@psql_params' with sql '$sql'"         if $ret == 3;
That's separate from this patch, and definitely useful.

+   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
+       die "valid modes are restart, confirmed_flush";
+   }
+   if (!defined($target_lsn)) {
+       $target_lsn = $self->lsn;
+   }
That's not really the style followed by the perl scripts present in
the code regarding the use of the brackets. Do we really need to care
about the object type checks by the way?

Regarding wait_for_catchup, there are two ways to do things. Either
query the standby like in the way 004_timeline_switch.pl does it or
the way this routine does. The way of this routine looks more
straight-forward IMO, and other tests should be refactored to use it.
In short I would make the target LSN a mandatory argument, and have
the caller send a standby's application_name instead of a PostgresNode
object, the current way to enforce the value of $standby_name being
really confusing.

+   my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
'replay' => 1 );
What's actually the point of 'sent'?

+   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
+   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
@fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
'$slot_name'");
+   $result = undef if $result eq '';
+   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
Couldn't this portion be made more generic? Other queries could
benefit from that by having a routine that accepts as argument an
array of column names for example.

Now looking at 0002....
One whitespace:
$ git diff HEAD~1 --check
src/test/perl/pg_lsn.pm:139: trailing whitespace.
+=cut

pg_lsn sounds like a fine name, now we are more used to camel case for
module names. And routines are written as lower case separated by an
underscore.

+++ b/src/test/perl/t/002_pg_lsn.pl
@@ -0,0 +1,68 @@
+use strict;
+use warnings;
+use Test::More tests => 42;
+use Scalar::Util qw(blessed);
Most of the tests added don't have a description. This makes things
harder to debug when tracking an issue.

It may be good to begin using this module within the other tests in
this patch as well. Now do we actually need it? Most of the existing
tests I recall rely on the backend's operators for the pg_lsn data
type, so this is actually duplicating an exiting facility. And all the
values are just passed as-is.

+++ b/src/test/perl/t/001_load.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+use Test::More tests => 5;
I can guess the meaning of this test, having a comment on top of it to
explain the purpose of the test is good practice though.

Looking at 0004...
+Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
s/pg_recvlogial/pg_recvlogical

+sub pg_recvlogical_upto
+{
This looks like a good idea for your tests.

+my $endpos = $node_master->safe_psql('postgres', "SELECT location
FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY
location DESC LIMIT 1;");
+diag "waiting to replay $endpos";
On the same wave as the pg_recvlogical wrapper, you may want to
consider some kind of wrapper at SQL level calling the slot functions.

And finally 0006.
+$node_standby_1->append_conf('recovery.conf', "primary_slot_name =
$slotname_1\n");
+$node_standby_1->append_conf('postgresql.conf',
"wal_receiver_status_interval = 1\n");
+$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
No need to call multiple times this routine.

Increasing the test coverage is definitely worth it.
-- 
Michael



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 22 December 2016 at 13:43, Michael Paquier <michael.paquier@gmail.com> wrote:

> So, for 0001:
> --- a/src/test/perl/PostgresNode.pm
> +++ b/src/test/perl/PostgresNode.pm
> @@ -93,6 +93,7 @@ use RecursiveCopy;
>  use Socket;
>  use Test::More;
>  use TestLib ();
> +use pg_lsn qw(parse_lsn);
>  use Scalar::Util qw(blessed);
> This depends on 0002, so the order should be reversed.

Will do. That was silly.

I think I should probably also move the standby tests earlier, then
add a patch to update them when the results change.

> +sub lsn
> +{
> +   my $self = shift;
> +   return $self->safe_psql('postgres', 'select case when
> pg_is_in_recovery() then pg_last_xlog_replay_location() else
> pg_current_xlog_insert_location() end as lsn;');
> +}
> The design of the test should be in charge of choosing which value it
> wants to get, and the routine should just blindly do the work. More
> flexibility is more useful to design tests. So it would be nice to
> have one routine able to switch at will between 'flush', 'insert',
> 'write', 'receive' and 'replay modes to get values from the
> corresponding xlog functions.

Fair enough. I can amend that.

> -       die "error running SQL: '$$stderr'\nwhile running '@psql_params'"
> +       die "error running SQL: '$$stderr'\nwhile running
> '@psql_params' with sql '$sql'"
>           if $ret == 3;
> That's separate from this patch, and definitely useful.

Yeah. Slipped through. I don't think it really merits a separate patch
though tbh.

> +   if (!($mode eq 'restart' || $mode eq 'confirmed_flush')) {
> +       die "valid modes are restart, confirmed_flush";
> +   }
> +   if (!defined($target_lsn)) {
> +       $target_lsn = $self->lsn;
> +   }
> That's not really the style followed by the perl scripts present in
> the code regarding the use of the brackets. Do we really need to care
> about the object type checks by the way?

Brackets: will look / fix.

Type checks (not in quoted snippet above): that's a convenience to let
you pass a PostgresNode instance or a string name. Maybe there's a
more idiomatic Perl-y way to write it. My Perl is pretty dire.

> Regarding wait_for_catchup, there are two ways to do things. Either
> query the standby like in the way 004_timeline_switch.pl does it or
> the way this routine does. The way of this routine looks more
> straight-forward IMO, and other tests should be refactored to use it.
> In short I would make the target LSN a mandatory argument, and have
> the caller send a standby's application_name instead of a PostgresNode
> object, the current way to enforce the value of $standby_name being
> really confusing.

Hm, ok. I'll take a look. Making LSN mandatory so you have to pass
$self->lsn is ok with me.

> +   my %valid_modes = ( 'sent' => 1, 'write' => 1, 'flush' => 1,
> 'replay' => 1 );
> What's actually the point of 'sent'?

Pretty useless, but we expose it in Pg, so we might as well in the tests.

> +   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
> 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
> +   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
> @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
> '$slot_name'");
> +   $result = undef if $result eq '';
> +   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
> Couldn't this portion be made more generic? Other queries could
> benefit from that by having a routine that accepts as argument an
> array of column names for example.

Yeah, probably. I'm not sure where it should live though - TestLib.pm ?

Not sure if there's an idomatic way to  pass a string (in this case
queyr) in Perl with a placeholder for interpolation of values (in this
case columns). in Python you'd pass it with pre-defined
%(placeholders)s for %.

> Now looking at 0002....
> One whitespace:
> $ git diff HEAD~1 --check
> src/test/perl/pg_lsn.pm:139: trailing whitespace.
> +=cut

Will fix.

> pg_lsn sounds like a fine name, now we are more used to camel case for
> module names. And routines are written as lower case separated by an
> underscore.

Unsure what the intent of this is.

> +++ b/src/test/perl/t/002_pg_lsn.pl
> @@ -0,0 +1,68 @@
> +use strict;
> +use warnings;
> +use Test::More tests => 42;
> +use Scalar::Util qw(blessed);
> Most of the tests added don't have a description. This makes things
> harder to debug when tracking an issue.
>
> It may be good to begin using this module within the other tests in
> this patch as well. Now do we actually need it? Most of the existing
> tests I recall rely on the backend's operators for the pg_lsn data
> type, so this is actually duplicating an exiting facility. And all the
> values are just passed as-is.

I added it mainly for ordered tests of whether some expected lsn had
passed/increased. But maybe it makes sense to just call into the
server and let it evaluate such tests.

> +++ b/src/test/perl/t/001_load.pl
> @@ -0,0 +1,9 @@
> +use strict;
> +use warnings;
> +use Test::More tests => 5;
> I can guess the meaning of this test, having a comment on top of it to
> explain the purpose of the test is good practice though.

Will.

> Looking at 0004...
> +Disallows pg_recvlogial from internally retrying on error by passing --no-loop.
> s/pg_recvlogial/pg_recvlogical

Thanks.

> +sub pg_recvlogical_upto
> +{
> This looks like a good idea for your tests.

Yeah, and likely others too as we start doing more with logical
replication in future.

> +my $endpos = $node_master->safe_psql('postgres', "SELECT location
> FROM pg_logical_slot_peek_changes('test_slot', NULL, NULL) ORDER BY
> location DESC LIMIT 1;");
> +diag "waiting to replay $endpos";
> On the same wave as the pg_recvlogical wrapper, you may want to
> consider some kind of wrapper at SQL level calling the slot functions.

I'd really rather beg off that until needed later. The SQL functions
are simple to invoke from PostgresNode::psql in the mean time; not so
much so with pg_recvlogical.

> And finally 0006.
> +$node_standby_1->append_conf('recovery.conf', "primary_slot_name =
> $slotname_1\n");
> +$node_standby_1->append_conf('postgresql.conf',
> "wal_receiver_status_interval = 1\n");
> +$node_standby_1->append_conf('postgresql.conf', "max_replication_slots = 4\n");
> No need to call multiple times this routine.
>
> Increasing the test coverage is definitely worth it.

Thanks.

I'll follow up with amendments. I've also implemented Petr's
suggestion to allow explicit omission of a snapshot on slot creation.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:

changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
Michael's comments, and applies on top of 0001.

I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
Petr's suggestion, so you can request a slot be created
WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
silently suppressing snapshot export when a slot was created on a
replica. It'll conflict (easily resolved) if applied on top of the
current series.

I have more to do before re-posting the full series, so waiting on
author at this point. The PostgresNode changes likely break later
tests, I'm just posting them so there's some progress here and so I
don't forget over the next few days' distraction.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Andrew Dunstan
Date:

On 12/22/2016 01:21 AM, Craig Ringer wrote:
>> +   my @fields = ('plugin', 'slot_type', 'datoid', 'database',
>> 'active', 'active_pid', 'xmin', 'catalog_xmin', 'restart_lsn');
>> +   my $result = $self->safe_psql('postgres', 'SELECT ' . join(', ',
>> @fields) . " FROM pg_catalog.pg_replication_slots WHERE slot_name =
>> '$slot_name'");
>> +   $result = undef if $result eq '';
>> +   # hash slice, see http://stackoverflow.com/a/16755894/398670 .
>> Couldn't this portion be made more generic? Other queries could
>> benefit from that by having a routine that accepts as argument an
>> array of column names for example.
> Yeah, probably. I'm not sure where it should live though - TestLib.pm ?
>
> Not sure if there's an idomatic way to  pass a string (in this case
> queyr) in Perl with a placeholder for interpolation of values (in this
> case columns). in Python you'd pass it with pre-defined
> %(placeholders)s for %.
>
>



For direct interpolation of an expression, there is this slightly 
baroque gadget:
   my $str = "here it is @{[ arbitrary expression here ]}";

For indirect interpolation using placeholders, there is
   my $str = sprintf("format string",...);

which works much like C except that the string is returned as a result 
instead of being the first argument.

And as we always say, TIMTOWTDI.


cheers

andrew (japh)



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 December 2016 at 18:11, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 22 December 2016 at 14:21, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> changes-in-0001-v2.diff shows the changes to PostgresNode.pm per
> Michael's comments, and applies on top of 0001.
>
> I also attach a patch to add a new CREATE_REPLICATION_SLOT option per
> Petr's suggestion, so you can request a slot be created
> WITHOUT_SNAPSHOT. This replaces the patch series's behaviour of
> silently suppressing snapshot export when a slot was created on a
> replica. It'll conflict (easily resolved) if applied on top of the
> current series.

OK, patch series updated.

0001 incorporates the changes requested by Michael Paquier. Simon
expressed his intention to commit this after updates, in the separate
thread for

The pg_lsn patch is gone; I worked around it using the server to work with LSNs.

0002 (endpos) is unchanged.

0003 is new, some minimal tests for pg_recvlogical. It can be squashed
with 0002 (pg_recvlogical --endpos) if desired.

0004 (pg_recvlogical wrapper) is unchanged.

0005 (new streaming rep tests) is updated for the changes in 0001,
otherwise unchanged. Simon said he wanted to commit this soon.

0006 (timeline following) is unchanged except for updates to be
compatible with 0001.

0007 is the optional snapshot export requested by Petr.

0008 is unchanged.

0009 is unchanged except for updates vs 0001 and use of the
WITHOUT_SNAPSHOT option added in 0007.





-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 4 January 2017 at 12:08, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> 0001 incorporates the changes requested by Michael Paquier. Simon
> expressed his intention to commit this after updates, in the separate
> thread [...]

...

> 0005 (new streaming rep tests) is updated for the changes in 0001,
> otherwise unchanged. Simon said he wanted to commit this soon.
>
> 0006 (timeline following) is unchanged except for updates to be
> compatible with 0001.
>
> 0007 is the optional snapshot export requested by Petr.
>
> 0008 is unchanged.
>
> 0009 is unchanged except for updates vs 0001 and use of the
> WITHOUT_SNAPSHOT option added in 0007.

Oh, note that it's possible to commit 0001 then 0005, skipping over
2..4. I should probably have ordered them that way.

That's particularly relevant to you Simon as you expressed a wish to
commit the new streaming rep tests.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote:

> That's particularly relevant to you Simon as you expressed a wish to
> commit the new streaming rep tests.

Patches 0001 and 0005 in this series also posted as
https://www.postgresql.org/message-id/CAMsr+YHxTMrY1woH_m4bEF3f5+kxX_T=sDuyXf4d2-+e-56iFg@mail.gmail.com
, since they're really pre-requisites not part of decoding on standby
as such. I'll post a new series with them removed once committed.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 4 January 2017 at 16:19, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 4 January 2017 at 12:15, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> That's particularly relevant to you Simon as you expressed a wish to
>> commit the new streaming rep tests.

Simon committed 1, 2, 3 and 5:

* Extra PostgresNode methods
* pg_recvlogical --endpos
* Tests for pg_recvlogical
* Expand streaming replication tests to cover hot standby

so here's a rebased series on top of master. No other changes.

The first patch to add a pg_recvlogical wrapper to PostgresNode is
really only needed to test the rest of the patches, so it can be
committed together with them.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:

> so here's a rebased series on top of master. No other changes.

Now with actual patches.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Michael Paquier
Date:
On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> so here's a rebased series on top of master. No other changes.
>
> Now with actual patches.

Looking at the PostgresNode code in 0001...
+=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
timeout_secs, ...)
+
This format is incorrect. I think that you also need to fix the
brackets for the do{} and the eval{] blocks.

+    push @cmd, '--endpos', $endpos if ($endpos);
endpos should be made a mandatory argument. If it is not defined that
would make the test calling this routine being stuck.
-- 
Michael



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>>> so here's a rebased series on top of master. No other changes.
>>
>> Now with actual patches.
>
> Looking at the PostgresNode code in 0001...
> +=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
> timeout_secs, ...)
> +
> This format is incorrect. I think that you also need to fix the
> brackets for the do{} and the eval{] blocks.
>
> +    push @cmd, '--endpos', $endpos if ($endpos);
> endpos should be made a mandatory argument. If it is not defined that
> would make the test calling this routine being stuck.
> --
> Michael

Acknowledged and agreed. I'll fix both in the next revision.  I'm
currently working on hot standby replies, but will return to this
ASAP.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Michael Paquier
Date:
On Fri, Jan 6, 2017 at 1:07 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 5 January 2017 at 13:12, Michael Paquier <michael.paquier@gmail.com> wrote:
>> On Thu, Jan 5, 2017 at 10:21 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>>>
>>>> so here's a rebased series on top of master. No other changes.
>>>
>>> Now with actual patches.
>>
>> Looking at the PostgresNode code in 0001...
>> +=pod $node->pg_recvlogical_upto(self, dbname, slot_name, endpos,
>> timeout_secs, ...)
>> +
>> This format is incorrect. I think that you also need to fix the
>> brackets for the do{} and the eval{] blocks.
>>
>> +    push @cmd, '--endpos', $endpos if ($endpos);
>> endpos should be made a mandatory argument. If it is not defined that
>> would make the test calling this routine being stuck.
>
> Acknowledged and agreed. I'll fix both in the next revision.  I'm
> currently working on hot standby replies, but will return to this
> ASAP.

By the way, be sure to fix as well the =pod blocks for the new
routines. perldoc needs to use both =pod and =item.
-- 
Michael



Re: [HACKERS] Logical decoding on standby

From
Thom Brown
Date:
On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> so here's a rebased series on top of master. No other changes.
>
> Now with actual patches.

Patch 5 no longer applies:

patching file src/include/pgstat.h
Hunk #1 FAILED at 745.
1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej

--- src/include/pgstat.h
+++ src/include/pgstat.h
@@ -745,7 +745,8 @@ typedef enum       WAIT_EVENT_SYSLOGGER_MAIN,       WAIT_EVENT_WAL_RECEIVER_MAIN,
WAIT_EVENT_WAL_SENDER_MAIN,
-       WAIT_EVENT_WAL_WRITER_MAIN
+       WAIT_EVENT_WAL_WRITER_MAIN,
+       WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE} WaitEventActivity;
/* ----------

Could you rebase?

Thanks

Thom



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
Rebased series attached, on top of current master (which includes
logical replicaiton).

I'm inclined to think I should split out a few of the changes from
0005, roughly along the lines of the bullet points in its commit
message. Anyone feel strongly about how granular this should be?

This patch series is a pre-requisite for supporting logical
replication using a physical standby as a source, but does not its
self enable logical replication from a physical standby.



On 23 January 2017 at 23:03, Thom Brown <thom@linux.com> wrote:
> On 5 January 2017 at 01:21, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 5 January 2017 at 09:19, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>>> so here's a rebased series on top of master. No other changes.
>>
>> Now with actual patches.
>
> Patch 5 no longer applies:
>
> patching file src/include/pgstat.h
> Hunk #1 FAILED at 745.
> 1 out of 1 hunk FAILED -- saving rejects to file src/include/pgstat.h.rej
>
> --- src/include/pgstat.h
> +++ src/include/pgstat.h
> @@ -745,7 +745,8 @@ typedef enum
>         WAIT_EVENT_SYSLOGGER_MAIN,
>         WAIT_EVENT_WAL_RECEIVER_MAIN,
>         WAIT_EVENT_WAL_SENDER_MAIN,
> -       WAIT_EVENT_WAL_WRITER_MAIN
> +       WAIT_EVENT_WAL_WRITER_MAIN,
> +       WAIT_EVENT_STANDBY_LOGICAL_SLOT_CREATE
>  } WaitEventActivity;
>
>  /* ----------
>
> Could you rebase?
>
> Thanks
>
> Thom



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Michael Paquier
Date:
On Tue, Jan 24, 2017 at 7:37 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> Rebased series attached, on top of current master (which includes
> logical replicaiton).
>
> I'm inclined to think I should split out a few of the changes from
> 0005, roughly along the lines of the bullet points in its commit
> message. Anyone feel strongly about how granular this should be?
>
> This patch series is a pre-requisite for supporting logical
> replication using a physical standby as a source, but does not its
> self enable logical replication from a physical standby.

There are patches but no reviews yet so moved to CF 2017-03.
-- 
Michael



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 24 January 2017 at 06:37, Craig Ringer <craig@2ndquadrant.com> wrote:
> Rebased series attached, on top of current master (which includes
> logical replicaiton).
>
> I'm inclined to think I should split out a few of the changes from
> 0005, roughly along the lines of the bullet points in its commit
> message. Anyone feel strongly about how granular this should be?
>
> This patch series is a pre-requisite for supporting logical
> replication using a physical standby as a source, but does not its
> self enable logical replication from a physical standby.

Patch 4 committed. Few others need rebase.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

> Patch 4 committed. Few others need rebase.

Since this patch series and initial data copy for logical replication
both add a facility for suppressing initial snapshot export on a
logical slot, I've dropped patch 0003 (make snapshot export on logical
slot creation) in favour of Petr's similar patch.

I will duplicate it in this patch series for ease of application. (The
version here is slightly extended over Petr's so I'll re-post the
modified version on the logical replication initial data copy thread
too).

The main thing I want to direct attention to for Simon, as committer,
is the xlog'ing of VACUUM's xid threshold before we advance it and
start removing tuples. This is necessary for the standby to know
whether a given replication slot is safe to use and fail with conflict
with recovery if it is not, or if it becomes unsafe due to master
vacuum activity. Note that we can _not_ use the various vacuum records
for this because we don't know which are catalogs and which aren't;
we'd have to add a separate "is catalog" field to each vacuum xlog
record, which is undesirable. The downstream can't look up whether
it's a catalog or not because it doesn't have relcache/syscache access
during decoding.

This change might look a bit similar to the vac_truncate_clog change
in the txid_status patch, but it isn't really related. The txid_status
change is about knowing when we can safely look up xids in clog and
preventing a race with clog truncation. This change is about knowing
when we can know all catalog tuples for a given xid will still be in
the heap, not vacuumed away. Both are about making sure standbys know
more about the state of the system in a low-cost way, though.

WaitForMasterCatalogXminReservation(...) in logical.c is also worth
looking more closely at.



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>
>> Patch 4 committed. Few others need rebase.
>
> Since this patch series and initial data copy for logical replication
> both add a facility for suppressing initial snapshot export on a
> logical slot, I've dropped patch 0003 (make snapshot export on logical
> slot creation) in favour of Petr's similar patch.
>
> I will duplicate it in this patch series for ease of application. (The
> version here is slightly extended over Petr's so I'll re-post the
> modified version on the logical replication initial data copy thread
> too).
>
> The main thing I want to direct attention to for Simon, as committer,
> is the xlog'ing of VACUUM's xid threshold before we advance it and
> start removing tuples. This is necessary for the standby to know
> whether a given replication slot is safe to use and fail with conflict
> with recovery if it is not, or if it becomes unsafe due to master
> vacuum activity. Note that we can _not_ use the various vacuum records
> for this because we don't know which are catalogs and which aren't;
> we'd have to add a separate "is catalog" field to each vacuum xlog
> record, which is undesirable. The downstream can't look up whether
> it's a catalog or not because it doesn't have relcache/syscache access
> during decoding.
>
> This change might look a bit similar to the vac_truncate_clog change
> in the txid_status patch, but it isn't really related. The txid_status
> change is about knowing when we can safely look up xids in clog and
> preventing a race with clog truncation. This change is about knowing
> when we can know all catalog tuples for a given xid will still be in
> the heap, not vacuumed away. Both are about making sure standbys know
> more about the state of the system in a low-cost way, though.
>
> WaitForMasterCatalogXminReservation(...) in logical.c is also worth
> looking more closely at.

I should also note that because the TAP tests currently take a long
time, I recommend skipping the tests for this patch by default and
running them only when actually touching logical decoding.

I'm looking at ways to make them faster, but they're inevitably going
to take a while until we can get hot standby feedback replies in
place, including cascading support. Which I have as WIP, but won't
make this release.

Changing the test import to
    use Test::More skip_all => "disabled by default, too slow";

will be sufficient.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 13 March 2017 at 10:56, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 7 March 2017 at 21:08, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>
>> Patch 4 committed. Few others need rebase.
>
> Since this patch series

Patch 1 fails since feature has already been applied. If other reason,
let me know.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 19 March 2017 at 18:02, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

> Patch 1 fails since feature has already been applied. If other reason,
> let me know.

Nope, that's fine.

Rebased attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Petr Jelinek
Date:
Hi,

I don't know how well I can review the 0001 (the TAP infra patch) but it
looks okay to me.

I don't really have any complaints about 0002 either. I like that it's
more or less one self-contained function and there are no weird ifdefs
anymore like in 9.6 version (btw your commit message talks about 9.5 but
it was 9.6). I also like the clever test :)

I am slightly worried about impact of the readTimeLineHistory() call but
I think it should be called so little that it should not matter.

That brings us to the big patch 0003.

I still don't like the "New in 10.0" comments in documentation, for one
it's 10, not 10.0 and mainly we don't generally write stuff like this to
documentation, that's what release notes are for.

There is large amounts of whitespace mess (empty lines with only
whitespace, spaces at the end of the lines), nothing horrible, but
should be cleaned up.

One thing I don't understand much is the wal_level change and turning
off catalogXmin tracking. I don't really see anywhere that the
catalogXmin would be reset in control file for example. There is TODO in
UpdateOldestCatalogXmin() that seems related but tbh I don't follow
what's happening there - comment says about not doing anything, but the
code inside the if block are only Asserts.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 19 March 2017 at 21:12, Craig Ringer <craig@2ndquadrant.com> wrote:

> Rebased attached.

Patch1 looks good to go. I'll correct a spelling mistake in the tap
test when I commit that later today.

Patch2 has a couple of points

2.1 Why does call to ReplicationSlotAcquire() move earlier in
pg_logical_slot_get_changes_guts()?

2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
documented well.
The setting sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
should be sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

but that doesn't cause failure because in read_local_xlog_page() we
say that we are reading from history when
state->currTLI != ThisTimeLineID explicitly rather than use
sendTimeLineIsHistoric

So it looks like we could do with a few extra comments
If you correct these I'll commit it tomorrow.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:

> 2.1 Why does call to ReplicationSlotAcquire() move earlier in
> pg_logical_slot_get_changes_guts()?

That appears to be an oversight from an earlier version where it
looped over timelines in pg_logical_slot_get_changes_guts . Reverted.

> 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
> documented well.
> The setting
>   sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
> should be
>   sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);

Definitely wrong. Fixed.

> but that doesn't cause failure because in read_local_xlog_page() we
> say that we are reading from history when
> state->currTLI != ThisTimeLineID explicitly rather than use
> sendTimeLineIsHistoric

XLogRead(...), as called by logical_read_xlog_page, does test it. It's
part of the walsender-local log read callback. We don't hit
read_local_xlog_page at all when we're doing walsender based logical
decoding.

We have two parallel code paths for reading xlogs, one for walsender,
one for normal backends. The walsender one is glued together with a
bunch of globals that pass state "around" the xlogreader - we set it
up before calling into xlogreader, and then examine it when xlogreader
calls back into walsender.c with logical_read_xlog_page.

I really want to refactor that at some stage, getting rid of the use
of walsender globals for timeline state tracking and sharing more of
the xlog reading logic between walsender and normal backends. But
-ENOTIME, especially to do it as carefully as it must be done.

There are comments on read_local_xlog_page, logical_read_xlog_page
that mention this. Also XLogRead in
src/backend/access/transam/xlogutils.c  (which has the same name as
XLogRead in src/backend/replication/walsender.c). I have a draft for a
timeline following readme that would address some of this but don't
expect to be able to finish it off for this release cycle, and I'd
really rather clean it up instead.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 20 March 2017 at 17:03, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 20 March 2017 at 14:57, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>
>> 2.1 Why does call to ReplicationSlotAcquire() move earlier in
>> pg_logical_slot_get_changes_guts()?
>
> That appears to be an oversight from an earlier version where it
> looped over timelines in pg_logical_slot_get_changes_guts . Reverted.
>
>> 2.2 sendTimeLineIsHistoric looks incorrect, and at least isn't really
>> documented well.
>> The setting
>>   sendTimeLineIsHistoric = state->currTLI == ThisTimeLineID;
>> should be
>>   sendTimeLineIsHistoric = (state->currTLI != ThisTimeLineID);
>
> Definitely wrong. Fixed.

Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
Hi,

Have you checked how high the overhead of XLogReadDetermineTimeline is?
A non-local function call, especially into a different translation-unit
(no partial inlining), for every single page might end up being
noticeable.  That's fine in the cases it actually adds functionality,
but for a master streaming out data, that's not actually adding
anything.

Did you check whether you changes to read_local_xlog_page could cause
issues with twophase.c? Because that now also uses it.

Did you check whether ThisTimeLineID is actually always valid in the
processes logical decoding could run in?  IIRC it's not consistently
update during recovery in any process but the startup process.



On 2017-03-19 21:12:23 +0800, Craig Ringer wrote:
> From 2fa891a555ea4fb200d75b8c906c6b932699b463 Mon Sep 17 00:00:00 2001
> From: Craig Ringer <craig@2ndquadrant.com>
> Date: Thu, 1 Sep 2016 10:16:55 +0800
> Subject: [PATCH 2/3] Follow timeline switches in logical decoding

FWIW, the title doesn't really seem accurate to me.


> Logical slots cannot actually be created on a replica without use of
> the low-level C slot management APIs so this is mostly foundation work
> for subsequent changes to enable logical decoding on standbys.

Everytime I read references to anything like this my blood starts to
boil.  I kind of regret not having plastered RecoveryInProgress() errors
all over this code.


> From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001
> From: Craig Ringer <craig@2ndquadrant.com>
> Date: Mon, 5 Sep 2016 15:30:53 +0800
> Subject: [PATCH 3/3] Logical decoding on standby
>
> * Make walsender aware of ProcSignal and recovery conflicts, make walsender
>   exit with recovery conflict on upstream drop database when it has an active
>   logical slot on that database.
> * Allow GetOldestXmin to omit catalog_xmin, be called already locked.

"be called already locked"?


> * Send catalog_xmin separately in hot_standby_feedback messages.
> * Store catalog_xmin separately on a physical slot if received in hot_standby_feedback

What does separate mean?


> * Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
>   requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
>   it in checkpoints, make vacuum and bgwriter advance it.

I can't parse that sentence.


> * Add a new recovery conflict type for conflict with catalog_xmin. Abort
>   in-progress logical decoding sessions with conflict with recovery where needed
>   catalog_xmin is too old

Are we retaining WAL for slots broken in that way?


> * Make extra efforts to reserve master's catalog_xmin during decoding startup
>   on standby.

What does that mean?


> * Remove checks preventing starting logical decoding on standby

To me that's too many different things in one commit.  A bunch of them
seem like it'd be good if they'd get independent buildfarm cycles too.




> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
> index d7f65a5..36bbb98 100644
> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
>      if (!state->rs_logical_rewrite)
>          return;
>
> -    ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
> +    /* Use the catalog_xmin being retained by vacuum */
> +    ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);

What does that comment mean? Vacuum isn't the only thing that prunes old
records.


> +/*
> + * Set the global oldest catalog_xmin used to determine when tuples
> + * may be removed from catalogs and user-catalogs accessible from logical
> + * decoding.
> + *
> + * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
> + * which ensures the update is properly written to xlog first.
> + */
> +void
> +SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
> +{
> +    Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));

Uh, that's long-ish.  And doesn't agree with the comment above
(s/startup process/process performing recovery/?).

This is a long enough list that I'd consider just dropping the assert.


> +    else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
> +    {
> +        xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
> +
> +        /*
> +         * Unless logical decoding is possible on this node, we don't care about
> +         * this record.
> +         */
> +        if (!XLogLogicalInfoActive() || max_replication_slots == 0)
> +            return;

Too many negatives for my taste, but whatever.


> +        /*
> +         * Apply the new catalog_xmin limit immediately. New decoding sessions
> +         * will refuse to start if their slot is past it, and old ones will
> +         * notice when we signal them with a recovery conflict. There's no
> +         * effect on the catalogs themselves yet, so it's safe for backends
> +         * with older catalog_xmins to still exist.
> +         *
> +         * We don't have to take ProcArrayLock since only the startup process
> +         * is allowed to change oldestCatalogXmin when we're in recovery.
> +         */
> +        SetOldestCatalogXmin(xlrec->new_catalog_xmin);

Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's
lwlock acquisition for barriers?


> +/*
> + * Record when we advance the catalog_xmin used for tuple removal
> + * so standbys find out before we remove catalog tuples they might
> + * need for logical decoding.
> + */
> +XLogRecPtr
> +XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
> +{
> +    XLogRecPtr ptr = InvalidXLogRecPtr;
> +
> +    if (XLogInsertAllowed())
> +    {
> +        xl_xact_catalog_xmin_advance xlrec;
> +
> +        xlrec.new_catalog_xmin = new_catalog_xmin;
> +
> +        XLogBeginInsert();
> +        XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
> +
> +        ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
> +    }

Huh, why is this test needed and ok?


> @@ -9449,6 +9456,16 @@ XLogReportParameters(void)
>              XLogFlush(recptr);
>          }
>
> +        /*
> +         * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
> +         * require oldestCatalogXmin in checkpoints and it no longer
> +         * makes sense, so update shmem and xlog the change. This will
> +         * get written out in the next checkpoint.
> +         */
> +        if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
> +            wal_level < WAL_LEVEL_LOGICAL)
> +            UpdateOldestCatalogXmin(true);

What if we crash before this happens?


> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
> index ff633fa..2d16bf0 100644
> --- a/src/backend/commands/vacuum.c
> +++ b/src/backend/commands/vacuum.c
> @@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
>      MultiXactId safeMxactLimit;
>
>      /*
> +     * When logical decoding is enabled, we must write any advance of
> +     * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
> +     * This ensures that any standbys doing logical decoding can cancel
> +     * decoding sessions and invalidate slots if we remove tuples they
> +     * still need.
> +     */
> +    UpdateOldestCatalogXmin(false);

I'm on a first read-through through this, but it appears you don't do
anything similar in heap_page_prune()?  And we can't just start emitting
loads of additional records there, because it's called much more often...


>  /*
>   * Make sure the current settings & environment are capable of doing logical
>   * decoding.
> @@ -87,23 +95,53 @@ CheckLogicalDecodingRequirements(void)
>                  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                   errmsg("logical decoding requires a database connection")));
>
> -    /* ----
> -     * TODO: We got to change that someday soon...
> -     *
> -     * There's basically three things missing to allow this:
> -     * 1) We need to be able to correctly and quickly identify the timeline a
> -     *      LSN belongs to
> -     * 2) We need to force hot_standby_feedback to be enabled at all times so
> -     *      the primary cannot remove rows we need.
> -     * 3) support dropping replication slots referring to a database, in
> -     *      dbase_redo. There can't be any active ones due to HS recovery
> -     *      conflicts, so that should be relatively easy.
> -     * ----
> -     */
>      if (RecoveryInProgress())
> -        ereport(ERROR,
> -                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -               errmsg("logical decoding cannot be used while in recovery")));
> +    {
> +        bool walrcv_running, walrcv_has_slot;
> +
> +        SpinLockAcquire(&WalRcv->mutex);
> +        walrcv_running = WalRcv->pid != 0;
> +        walrcv_has_slot = WalRcv->slotname[0] != '\0';
> +        SpinLockRelease(&WalRcv->mutex);
> +
> +        /*
> +         * The walreceiver should be running when we try to create a slot. If
> +         * we're unlucky enough to catch the walreceiver just as it's
> +         * restarting after an error, well, the client can just retry. We don't
> +         * bother to sleep and re-check.
> +         */
> +        if (!walrcv_running)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                     errmsg("streaming replication is not active"),
> +                     errhint("Logical decoding on standby requires that streaming replication be configured and
active.Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication errors in the
logs.")));


That seems quite problematic. What if there's a momentaneous connection
failure?  This also has the issue that just because you checked that
walrcv_running at some point, doesn't guarantee anything by the time you
actually check.  Seems like life were easier if recovery.conf were
guc-ified already - checking for primary_conninfo/primary_slot_name etc
wouldn't have that issue (and can't be changed while running).

Usage of a slot doesn't actually guarantee much in cascased setups, does
it?


> @@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
>       * xmin horizons by other backends, get the safe decoding xid, and inform
>       * the slot machinery about the new limit. Once that's done the
>       * ProcArrayLock can be released as the slot machinery now is
> -     * protecting against vacuum.
> +     * protecting against vacuum - if we're on the master. If we're running on
> +     * a replica, we have to wait until hot_standby_feedback locks in our
> +     * needed catalogs, per details on WaitForMasterCatalogXminReservation().
>       * ----
>       */
>      LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> @@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
>
>      ReplicationSlotsComputeRequiredXmin(true);
>
> +    if (RecoveryInProgress())
> +        WaitForMasterCatalogXminReservation(slot);
> +
> +    Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
> +                                         slot->data.catalog_xmin));
> +
>      LWLockRelease(ProcArrayLock);

I think it's quite a bad idea to do a blocking operation like
WaitForMasterCatalogXminReservation while holding ProcArrayLock.


> +/*
> + * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
> + * if needed. Caller must hold exclusive ProcArrayLock, which this function will
> + * temporarily release while sleeping but will re-acquire.

Ah. I see. Hm :(.


> + * We're pretty much just hoping that, if someone else already has a
> + * catalog_xmin reservation affecting the master, it stays where we want it
> + * until our own hot_standby_feedback can pin it down.

Hm.

> + * When we're creating a slot on a standby we can't directly set the
> + * master's catalog_xmin; the catalog_xmin is set locally, then relayed
> + * over hot_standby_feedback. The master may remove the catalogs we
> + * asked to reserve between when we set a local catalog_xmin and when
> + * hs feedback makes that take effect on the master. We need a feedback
> + * reply mechanism here, where:
> + *
> + * - we tentatively reserve catalog_xmin locally

Will that already trigger recovery conflicts?


> + * - we wake the walreceiver by setting its latch
> + * - walreceiver sends hs_feedback
> + * - upstream walsender sends a new 'hs_feedback reply' message with
> + *   actual (xmin, catalog_xmin) reserved.
> + * - walreceiver sees reply and updates ShmemVariableCache or some other
> + *   handy bit of shmem with hs feedback reservations from reply

"or some other handy bit"?


> + * - we poll the reservations while we wait
> + * - we set our catalog_xmin to that value, which might be later if
> + *   we missed our requested reservation, or might be earlier if
> + *   someone else is holding down catalog_xmin on master. We got a hs
> + *   feedback reply so we know it's reserved.
> + *
> + * For cascading, the actual reservation will need to cascade up the
> + * chain by walsender setting its own walreceiver's latch in turn, etc.
> + *
> + * For now, we just set the local slot catalog_xmin and sleep until
> + * oldestCatalogXmin equals or passes our reservation. This is fine if we're
> + * the only decoding session, but it is vulnerable to races if slots on the
> + * master or other decoding sessions on other standbys connected to the same
> + * master exist. They might advance their reservation before our hs_feedback
> + * locks it down, allowing vacuum to remove tuples we need. So we might start
> + * decoding on our slot then error with a conflict with recovery when we see
> + * catalog_xmin advance.
> + */

I was about to list some of these issues.  That's a bit unsatisfying.


Pondering this for a bit, but I'm ~9h into a flight, so maybe not
tonight^Wthis morning^Wwhaaaa.


> +static void
> +WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
> +{


This comment seems to duplicate some of the function header
comment. Such duplication usually leads to either or both getting out of
date rather quickly.


Not commenting line-by-line on the code here, but I'm extremely doubtful
that this approach is stable enough, and that the effect of holding
ProcArrayLock exclusively over prolonged amounts of time is acceptable.

> +    ReplicationSlotsComputeRequiredXmin(true);
>
Why do we need this? The caller does it too, no?


> +    /* Tell the master what catalog_xmin we settled on */
> +    WalRcvForceReply();
> +
> +    /* Reset ps display if we changed it */
> +    if (new_status)
> +    {
> +        set_ps_display(new_status, false);
> +        pfree(new_status);
> +    }

We really shouldn't do stuff like this while holding ProcArrayLock.


> +/*
> + * Test to see if the active logical slot is usable.
> + */
> +static void
> +EnsureActiveLogicalSlotValid()
> +{

Missing (void).


> +/*
> + * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
> + * passed database oid. The caller should hold an exclusive lock on the database
> + * to ensure no replication slots on the database are in use.

Stuff like this really should be it's own commit.  It can trivially be
tested on its own, is useful on its own (just have DROP DATABASE do it),
...


> + * If we fail here we'll leave the in-memory state of replication slots
> + * inconsistent with its on-disk state, so we need to PANIC.

We worked quite hard to make it extremely unlikely for that to happen in
practice.  I also don't see why there should be any new PANICs in this
code.


> + * This routine isn't as efficient as it could be - but we don't drop databases
> + * often, especially databases with lots of slots.

That seems fine.


> +void
> +ReplicationSlotsDropDBSlots(Oid dboid)
> +{
> +    int            i;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +    /*
> +     * We only need a shared lock here even though we activate slots,
> +     * because we have an exclusive lock on the database we're dropping
> +     * slots on and don't touch other databases' slots.
> +     */
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

Hm? Acquiring a slot always only takes a shared lock, no?


I don't really see how "database is locked" guarantees enough for your
logic - it's already possible to drop slots from other databases, and
dropping a slot acquires it temporarily?


> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData slotname;
> +        int active_pid;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* only logical slots are database specific, skip */
> +        if (!SlotIsLogical(s))
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != dboid)
> +            continue;
> +
> +        /* Claim the slot, as if ReplicationSlotAcquire()ing */
> +        SpinLockAcquire(&s->mutex);
> +        strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
> +        NameStr(slotname)[NAMEDATALEN-1] = '\0';
> +        active_pid = s->active_pid;
> +        if (active_pid == 0)
> +        {
> +            MyReplicationSlot = s;
> +            s->active_pid = MyProcPid;
> +        }
> +        SpinLockRelease(&s->mutex);
> +
> +        /*
> +         * The caller should have an exclusive lock on the database so
> +         * we'll never have any in-use slots, but just in case...
> +         */
> +        if (active_pid)
> +            elog(PANIC, "replication slot %s is in use by pid %d",
> +                 NameStr(slotname), active_pid);

So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of
PANICing? There seems to be absolutely no correctness reason for a PANIC
here?


> +        /*
> +         * To avoid largely duplicating ReplicationSlotDropAcquired() or
> +         * complicating it with already_locked flags for ProcArrayLock,
> +         * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
> +         * just release our ReplicationSlotControlLock to drop the slot.
> +         *
> +         * There's no race here: we acquired this slot, and no slot "behind"
> +         * our scan can be created or become active with our target dboid due
> +         * to our exclusive lock on the DB.
> +         */
> +        LWLockRelease(ReplicationSlotControlLock);
> +        ReplicationSlotDropAcquired();
> +        LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

I don't see much problem with this, but I'd change the code so you
simply do a goto restart; if you released the slot.  Then there's a lot
less chance / complications around temporarily releasing
ReplicationSlotControlLock.


> +                         *
> +                         * If logical decoding information is enabled, we also
> +                         * send immediate hot standby feedback so as to reduce
> +                         * the delay before our needed catalogs are locked in.

"logical decoding information ... enabled" and "catalogs are locked in"
are a bit too imprecise descriptions for my taste.


> @@ -1175,8 +1181,8 @@ XLogWalRcvSendHSFeedback(bool immed)
>  {
>      TimestampTz now;
>      TransactionId nextXid;
> -    uint32        nextEpoch;
> -    TransactionId xmin;
> +    uint32        xmin_epoch, catalog_xmin_epoch;
> +    TransactionId xmin, catalog_xmin;
>      static TimestampTz sendTime = 0;
>      /* initially true so we always send at least one feedback message */
>      static bool master_has_standby_xmin = true;
> @@ -1221,29 +1227,57 @@ XLogWalRcvSendHSFeedback(bool immed)
>       * everything else has been checked.
>       */
>      if (hot_standby_feedback)
> -        xmin = GetOldestXmin(NULL, false);
> +    {
> +        /*
> +         * Usually GetOldestXmin() would include the catalog_xmin in its
> +         * calculations, but we don't want to hold upstream back from vacuuming
> +         * normal user table tuples just because they're within the
> +         * catalog_xmin horizon of logical replication slots on this standby.
> +         * Instead we report the catalog_xmin to the upstream separately.
> +         */

I again don't think it's good to refer to vacuum as it's not the only
thing that can remove tuple versions.


> +        xmin = GetOldestXmin(NULL,
> +                             false, /* don't ignore vacuum */
> +                             true /* ignore catalog xmin */);
> +
> +        /*
> +         * The catalog_Xmin reported by GetOldestXmin is the effective
> +         * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
> +         * records from the master. Sending it back to the master would be
> +         * circular and prevent its catalog_xmin ever advancing once set.
> +         * We should only send the catalog_xmin we actually need for slots.
> +         */
> +        ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);

Given that you don't have catalog_xmin set by GetOldestXmin that comment
is a bit misleading.



> @@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
>          NormalTransactionIdPrecedes(replication_slot_xmin, result))
>          result = replication_slot_xmin;
>
> +    if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
> +    {
> +        /*
> +         * After locks have been released and defer_cleanup_age has been applied,
> +         * check whether we need to back up further to make logical decoding
> +         * safe. We need to do so if we're computing the global limit (rel =
> +         * NULL) or if the passed relation is a catalog relation of some kind.
> +         */
> +        if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
> +            NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
> +            result = replication_slot_catalog_xmin;
> +    }

The nesting of these checks, and the comments about them, is a bit
weird.

> +/*
> + * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
> + * to reflect an advance in procArray->replication_slot_catalog_xmin or
> + * it becoming newly set or unset.
> + *
> + */
> +static bool
> +CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
> +{
> +    return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
> +            || (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
> +}

Your lines are really long - pgindent (which you really should run) will
much this.  I think it'd be better to rephrase this.


> +/*
> + * If necessary, copy the current catalog_xmin needed by repliation slots to

Typo: repliation

> + * the effective catalog_xmin used for dead tuple removal.
> + *
> + * When logical decoding is enabled we write a WAL record before advancing the
> + * effective value so that standbys find out if catalog tuples they still need
> + * get removed, and can properly cancel decoding sessions and invalidate slots.
> + *
> + * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
> + * and need to clear the shmem state, since we want to bypass the wal_level
> + * check and force xlog writing.
> + */
> +void
> +UpdateOldestCatalogXmin(bool force)

I'm a bit confused by this function and variable name.  What does

+    TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
+                                      * is guaranteed to still exist */

mean?  I complained about the overall justification in the commit
already, but looking at this commit alone, the justification for this
part of the change is quite hard to understand.


> +{
> +    TransactionId vacuum_catalog_xmin;
> +    TransactionId slots_catalog_xmin;
> +
> +    /*
> +     * If we're not recording logical decoding information, catalog_xmin
> +     * must be unset and we don't need to do any work here.

If we don't need to do any work, shouldn't we return early?


> +    if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
> +    {
> +        XactLogCatalogXminUpdate(slots_catalog_xmin);
> +
> +        LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +        /*
> +         * A concurrent updater could've changed these values so we need to re-check
> +         * under ProcArrayLock before updating.
> +         */
> +        vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
> +        slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);

why are there volatile reads here?

> +        if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
> +            SetOldestCatalogXmin(slots_catalog_xmin);

Why don't we check force here, but above?


> @@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
>      oldestSafeXid = ShmemVariableCache->nextXid;
>
>      /*
> -     * If there's already a slot pegging the xmin horizon, we can start with
> -     * that value, it's guaranteed to be safe since it's computed by this
> -     * routine initially and has been enforced since.
> +     * If there's already an effectiveCatalogXmin held down by vacuum
> +     * it's definitely safe to start there, and it can't advance
> +     * while we hold ProcArrayLock.

What does "held down by vacuum" mean?


>  /*
> + * Notify a logical decoding session that it conflicts with a
> + * newly set catalog_xmin from the master.
> + */
> +void
> +CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
> +{
> +    ProcArrayStruct *arrayP = procArray;
> +    int            index;
> +
> +    /*
> +     * We have to scan ProcArray to find the process and set a pending recovery
> +     * conflict even though we know the pid. At least we can get the BackendId
> +     * and void a ProcSignal scan later.
> +     *
> +     * The pid might've gone away, in which case we got the desired
> +     * outcome anyway.
> +     */
> +    LWLockAcquire(ProcArrayLock, LW_SHARED);
> +
> +    for (index = 0; index < arrayP->numProcs; index++)
> +    {
> +        int            pgprocno = arrayP->pgprocnos[index];
> +        volatile PGPROC *proc = &allProcs[pgprocno];
> +
> +        if (proc->pid == session_pid)
> +        {
> +            VirtualTransactionId procvxid;
> +
> +            GET_VXID_FROM_PGPROC(procvxid, *proc);
> +
> +            proc->recoveryConflictPending = true;
> +
> +            /*
> +             * Kill the pid if it's still here. If not, that's what we
> +             * wanted so ignore any errors.
> +             */
> +            (void) SendProcSignal(session_pid,
> +                PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
> +
> +            break;
> +        }
> +    }
> +
> +    LWLockRelease(ProcArrayLock);

Doesn't seem ok to do this while holding ProcArrayLock.


> +/*
> + * Scan to see if any clients are using replication slots that are below the
> + * new catalog_xmin theshold and sigal them to terminate with a recovery
> + * conflict.
> + *
> + * We already applied the new catalog_xmin record and updated the shmem
> + * catalog_xmin state, so new clients that try to use a replication slot
> + * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
> + * don't have to guard against them here.
> + *
> + * Replay can only continue safely once every slot that needs the catalogs
> + * we're going to free for removal is gone. So if any conflicting sessions
> + * exist, wait for any standby conflict grace period then signal them to exit.
> + *
> + * The master might clear its reserved catalog_xmin if all upstream slots are
> + * removed or clear their feedback reservations, sending us
> + * InvalidTransactionId. If we're concurrently trying to create a new slot and
> + * reserve catalogs the InvalidXid reservation report might come in while we
> + * have a slot waiting for hs_feedback confirmation of its reservation. That
> + * would cause the waiting process to get canceled with a conflict with
> + * recovery here since its tentative reservation conflicts with the master's
> + * report of 'nothing reserved'. To allow it to continue to seek a startpoint
> + * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
> + * still looking for where to start. We'll sometimes notice a conflict but the
> + * slot will advance its catalog_xmin to a more recent nextXid and cease to
> + * conflict when we re-check. (The alternative is to track slots being created
> + * differently to slots actively decoding in shmem, which seems unnecessary. Or
> + * to separate the 'tentative catalog_xmin reservation' of a slot from its
> + * actual needed catalog_xmin.)
> + *
> + * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
> + * walsender-based logical decoding sessions won't have any virtualxid for much
> + * of their life and the end of their virtualxids doesn't mean the end of a
> + * potential conflict. It would also cancel too aggressively, since it cares
> + * about the backend's xmin and logical decoding only needs the catalog_xmin.
> + */

The use of "we" seems confusing here, because it's not the same process.

Generally I think your comments need to be edited a bit for brevity and
preciseness.


> +void
> +ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
> +{
> +    int i;
> +
> +    if (!InHotStandby)
> +        /* nobody can be actively using logical slots */
> +        return;
> +
> +    /* Already applied new limit, can't have replayed later one yet */
> +    Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
> +
> +    /*
> +     * Find the first conflicting active slot and wait for it to be free,
> +     * signalling it if necessary, then repeat until there are no more
> +     * conflicts.
> +     */
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {

I'm pretty strongly against any code outside of slot.c doing this.


> @@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>          Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
>
>          /*
> -         * All conflicts apart from database cause dynamic errors where the
> -         * command or transaction can be retried at a later point with some
> -         * potential for success. No need to reset this, since non-retryable
> -         * conflict errors are currently FATAL.
> +         * All conflicts apart from database and catalog_xmin cause dynamic
> +         * errors where the command or transaction can be retried at a later
> +         * point with some potential for success. No need to reset this, since
> +         * non-retryable conflict errors are currently FATAL.
>           */
> -        if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
> +        if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
> +            reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
>              RecoveryConflictRetryable = false;
>      }

Hm. Why is this a non-retryable error?


Ok, landing soon.  Gotta finish here.


0002 should be doable as a whole this release, I have severe doubts that
0003 as a whole has a chance for 10 - the code is in quite a raw shape,
there's a significant number of open ends.  I'd suggest breaking of bits
that are independently useful, and work on getting those committed.

- Andres



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 19 March 2017 at 22:12, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

> I am slightly worried about impact of the readTimeLineHistory() call but
> I think it should be called so little that it should not matter.

Pretty much my thinking too.

> That brings us to the big patch 0003.
>
> I still don't like the "New in 10.0" comments in documentation, for one
> it's 10, not 10.0 and mainly we don't generally write stuff like this to
> documentation, that's what release notes are for.

OK. Personally I think it's worthwhile for protocol docs, which are
more dev-focused. But I agree it's not consistent with the rest of the
docs, so removed.

(Frankly I wish we did this consistently throughout the Pg docs, too,
and it'd be much more user-friendly if we did, but that's just not
going to happen.)

> There is large amounts of whitespace mess (empty lines with only
> whitespace, spaces at the end of the lines), nothing horrible, but
> should be cleaned up.

Fixed.

> One thing I don't understand much is the wal_level change and turning
> off catalogXmin tracking. I don't really see anywhere that the
> catalogXmin would be reset in control file for example. There is TODO in
> UpdateOldestCatalogXmin() that seems related but tbh I don't follow
> what's happening there - comment says about not doing anything, but the
> code inside the if block are only Asserts.

UpdateOldestCatalogXmin(...) with force=true forces a
XactLogCatalogXminUpdate(...) call to write the new
procArray->replication_slot_catalog_xmin .

We call it with force=true from XLogReportParameters(...) when
wal_level has been lowered; see XLogReportParameters. This will write
out a xl_xact_catalog_xmin_advance with
procArray->replication_slot_catalog_xmin's value then update
ShmemVariableCache->oldestCatalogXmin in shmem.
ShmemVariableCache->oldestCatalogXmin will get written out in the next
checkpoint, which gets incorporated in the control file.

There is a problem though - StartupReplicationSlots() and
RestoreSlotFromDisk() don't care if catalog_xmin is set on a slot but
wal_level is < logical and will happily restore a logical slot, or a
physical slot with a catalog_xmin. So we can't actually assume that
procArray->replication_slot_catalog_xmin will be 0 if we're not
writing new logical WAL. This isn't a big deal, it just means we can't
short-circuit UpdateOldestCatalogXmin() calls if
!XLogLogicalInfoActive(). It also means the XLogReportParameters()
stuff can be removed since we don't care about wal_level for tracking
oldestCatalogXmin.

Fixed in updated patch.

I'm now reading over Andres's review.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
.On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> Have you checked how high the overhead of XLogReadDetermineTimeline is?
> A non-local function call, especially into a different translation-unit
> (no partial inlining), for every single page might end up being
> noticeable.  That's fine in the cases it actually adds functionality,
> but for a master streaming out data, that's not actually adding
> anything.

I don't anticipate any significant effect given the large amount of
indirection via decoding, reorder buffer, snapshot builder, output
plugin, etc that we already do and how much memory allocation gets
done ... but it's worth checking. I could always move the fast path
into a macro or inline function if it does turn out to make a
detectable difference.

One of the things I want to get to is refactoring all the xlog page
reading stuff into a single place, shared between walsender and normal
backends, to get rid of this confusing mess we currently have. The
only necessary difference is how we wait for new WAL, the rest should
be as common as possible allowing for xlogreader's needs. I
particularly want to get rid of the two identically named static
XLogRead functions. But all that probably means making timeline.c
FRONTEND friendly and it's way too intrusive to contemplate at this
stage.

> Did you check whether you changes to read_local_xlog_page could cause
> issues with twophase.c? Because that now also uses it.

Thanks, that's a helpful point. The commit in question is 978b2f65. I
didn't notice that it introduced XLogReader use in twophase.c, though
I should've realised given the discussion about fetching recent 2pc
info from xlog. I don't see any potential for issues at first glance,
but I'll go over it in more detail. The main concern is validity of
ThisTimeLineID, but since it doesn't run in recovery I don't see much
of a problem there. That also means it can afford to use the current
timeline-oblivious read_local_xlog_page safely.

TAP tests for 2pc were added by 3082098. I'll check to make sure they
have appropriate coverage for this.

> Did you check whether ThisTimeLineID is actually always valid in the
> processes logical decoding could run in?  IIRC it's not consistently
> update during recovery in any process but the startup process.

I share your concerns that it may not be well enough maintained.
Thankyou for the reminder, that's been on my TODO and got lost when I
had to task-hop to other priorities.

I have some TAP tests to validate promotion that need finishing off.
My main concern is around live promotions, both promotion of standby
to master, and promotion of upstream master when streaming from a
cascaded replica.

[Will cover review of 0003 separately, next]

> 0002 should be doable as a whole this release, I have severe doubts that
> 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> there's a significant number of open ends.  I'd suggest breaking of bits
> that are independently useful, and work on getting those committed.

That would be my preference too.

I do not actually feel strongly about the need for logical decoding on
standby, and would in many ways prefer to defer it until we have
two-way hot standby feedback and the ability to have the master
confirm the actual catalog_xmin locked in to eliminate the current
race and ugly workaround for it. I'd rather have solid timeline
following in place now and bare-minimum failover capability.

I'm confident that the ability for xlogreader to follow timeline
switches will also be independently useful.

The parts I think are important for Pg10 are:

* Teach xlogreader to follow timeline switches
* Ability to create logical slots on replicas
* Ability to advance (via feedback or via SQL function) - no need to
actually decode and call output plugins at all.
* Ability to drop logical slots on replicas

That would be enough to provide minimal standby promotion without hideous hacks.

Unfortunately, slot creation on standby is probably the ugliest part
of the patch. It can be considerably simplified by imposing the rule
that the application must ensure catalog_xmin on the master is already
held down (with a replication slot) before creating a slot on the
standby, and it's the application's job to send feedback to the master
before any standbys it's keeping slots on. If the app fails to do so,
the slot on the downstream will become unusable and attempts to decode
changes from it will fail with conflict with recovery.

That'd get rid of a lot of the code including some of the ugliest
bits, since we'd no longer make any special effort with catalog_xmin
during slot creation. We're already pushing complexity onto apps for
this, after concluding that the transparent failover slots approach
wasn't the way forward, so I'm OK with that. Let the apps that want
logical decoding to support physical replica promotion pay most of the
cost.

I'd then like to revisit full decoding on standby later, once we have
2-way hot standby feedback, where the upstream can reply with
confirmation xmin is locked in, including cascading handling.

Getting there would mostly involve trimming this patch down, which is
nice. It would be necessary to add a SQL function and/or walsender
command to send feedback on a slot we're not currently replaying
changes from, but I see that as independently valuable and have wanted
it for a number of things already. We'd still have to decode (so we
found the right restart_lsn), but we'd suppress output plugin calls
entirely.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

>> Subject: [PATCH 2/3] Follow timeline switches in logical decoding
>
> FWIW, the title doesn't really seem accurate to me.

Yeah, it's not really at the logical decoding layer at all.

"Teach xlogreader to follow timeline switches" ?

>> Logical slots cannot actually be created on a replica without use of
>> the low-level C slot management APIs so this is mostly foundation work
>> for subsequent changes to enable logical decoding on standbys.
>
> Everytime I read references to anything like this my blood starts to
> boil.  I kind of regret not having plastered RecoveryInProgress() errors
> all over this code.

In fairness, I've been trying for multiple releases to get a "right"
way in. I have no intention of using such hacks, and only ever did so
for testing xlogreader timeline following without full logical
decoding on standby being available.

>> From 8854d44e2227b9d076b0a25a9c8b9df9270b2433 Mon Sep 17 00:00:00 2001
>> From: Craig Ringer <craig@2ndquadrant.com>
>> Date: Mon, 5 Sep 2016 15:30:53 +0800
>> Subject: [PATCH 3/3] Logical decoding on standby
>>
>> * Make walsender aware of ProcSignal and recovery conflicts, make walsender
>>   exit with recovery conflict on upstream drop database when it has an active
>>   logical slot on that database.
>> * Allow GetOldestXmin to omit catalog_xmin, be called already locked.
>
> "be called already locked"?

To be called with ProcArrayLock already held. But that's actually
outdated due to changes Petr requested earlier, thanks for noticing.

>> * Send catalog_xmin separately in hot_standby_feedback messages.
>> * Store catalog_xmin separately on a physical slot if received in hot_standby_feedback
>
> What does separate mean?

Currently, hot standby feedback sends effectively the
min(catalog_xmin, xmin) to the upstream, which in turn records that
either in the PGPROC entry or, if there's a slot in use, in the xmin
field on the slot.

So catalog_xmin on the standby gets promoted to xmin on the master's
physical slot. Lots of unnecessary bloat results.

This splits it up, so we can send catalog_xmin separately on the wire,
and store it on the physical replication slot as catalog_xmin, not
xmin.

>> * Separate the catalog_xmin used by vacuum from ProcArray's replication_slot_catalog_xmin,
>>   requiring that xlog be emitted before vacuum can remove no longer needed catalogs, store
>>   it in checkpoints, make vacuum and bgwriter advance it.
>
> I can't parse that sentence.

We now write an xlog record before allowing the catalog_xmin in
ProcArray replication_slot_catalog_xmin to advance and allow catalog
tuples to be removed. This is achieved by making vacuum use a separate
field in ShmemVariableCache, oldestCatalogXmin. When vacuum looks up
the new xmin from GetOldestXmin, it copies
ProcArray.replication_slot_catalog_xmin to
ShmemVariableCache.oldestCatalogXmin, writing an xlog record to ensure
we remember the new value and ensure standbys know about it.

This provides a guarantee to standbys that all catalog tuples >=
ShmemVariableCache.oldestCatalogXmin are protected from vacuum and
lets them discover when that threshold advances.

The reason we cannot use the xid field in existing vacuum xlog records
is that the downstream has no way to know if the xact affected
catalogs and therefore whether it should advance its idea of
catalog_xmin or not. It can't get a Relation for the affected
relfilenode because it can't use the relcache during redo. We'd have
to add a flag to every vacuum record indicating whether it affected
catalogs, which is not fun, and vacuum might not always even know. And
the standby would still need a way to keep track of the oldest valid
catalog_xmin across restart without the ability to write it to
checkpoints.

It's a lot simpler and cheaper to have the master do it.

>> * Add a new recovery conflict type for conflict with catalog_xmin. Abort
>>   in-progress logical decoding sessions with conflict with recovery where needed
>>   catalog_xmin is too old
>
> Are we retaining WAL for slots broken in that way?

Yes, until the slot is dropped.

If I added a persistent flag on the slot to indicate that the slot is
invalid, then we could ignore it for purposes of WAL retention. It
seemed unnecessary at this stage.

>> * Make extra efforts to reserve master's catalog_xmin during decoding startup
>>   on standby.
>
> What does that mean?

WaitForMasterCatalogXminReservation(...)

I don't like it. At all. I'd rather have hot standby feedback replies
so we can know for sure when the master has locked in our feedback.
It's my most disliked part of this patch.

>> * Remove checks preventing starting logical decoding on standby
>
> To me that's too many different things in one commit.  A bunch of them
> seem like it'd be good if they'd get independent buildfarm cycles too.

I agree with you. I had them all separate before and was told that
there were too many patches. I also had fixes that spanned multiple
patches and were difficult to split up effectively.

I'd like to split it roughly along the lines of the bulletted items,
but I don't want to do it only to have someone else tell me to just
squash it again and waste all the work (again). I'll take the risk I
guess.

>> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
>> index d7f65a5..36bbb98 100644
>> --- a/src/backend/access/heap/rewriteheap.c
>> +++ b/src/backend/access/heap/rewriteheap.c
>> @@ -812,7 +812,8 @@ logical_begin_heap_rewrite(RewriteState state)
>>       if (!state->rs_logical_rewrite)
>>               return;
>>
>> -     ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin);
>> +     /* Use the catalog_xmin being retained by vacuum */
>> +     ProcArrayGetReplicationSlotXmin(NULL, &logical_xmin, NULL);
>
> What does that comment mean? Vacuum isn't the only thing that prunes old
> records.

I mean to refer to ShmemVariableCache.oldestCatalogXmin, the effective
catalog xmin used for record removal, not
ProcArray.replication_slot_catalog_xmin, the pending catalog_xmin for
local slots.

i.e. use the catalog_xmin that we've recorded in WAL and promised to standbys.

I agree the comment is unclear. Not sure how to improve it without
making it overly long though.

>> +/*
>> + * Set the global oldest catalog_xmin used to determine when tuples
>> + * may be removed from catalogs and user-catalogs accessible from logical
>> + * decoding.
>> + *
>> + * Only to be called from the startup process or by UpdateOldestCatalogXmin(),
>> + * which ensures the update is properly written to xlog first.
>> + */
>> +void
>> +SetOldestCatalogXmin(TransactionId oldestCatalogXmin)
>> +{
>> +     Assert(InRecovery || !IsUnderPostmaster || AmStartupProcess() || LWLockHeldByMe(ProcArrayLock));
>
> Uh, that's long-ish.  And doesn't agree with the comment above
> (s/startup process/process performing recovery/?).
>
> This is a long enough list that I'd consider just dropping the assert.

Fair enough.

>> +     else if (info == XLOG_XACT_CATALOG_XMIN_ADV)
>> +     {
>> +             xl_xact_catalog_xmin_advance *xlrec = (xl_xact_catalog_xmin_advance *) XLogRecGetData(record);
>> +
>> +             /*
>> +              * Unless logical decoding is possible on this node, we don't care about
>> +              * this record.
>> +              */
>> +             if (!XLogLogicalInfoActive() || max_replication_slots == 0)
>> +                     return;
>
> Too many negatives for my taste, but whatever.

Also removed in latest version, since it turns out not be accurate.

I had made the incorrect assumption that our global catalog_xmin was
necessarily 0 when wal_level < logical. But this is not the case, per
the new TAP tests in latest patch. We can have logical slots from when
wal_level was logical still existing with a valid catalog_xmin after
we restart into wal_level = replica.

>> +             /*
>> +              * Apply the new catalog_xmin limit immediately. New decoding sessions
>> +              * will refuse to start if their slot is past it, and old ones will
>> +              * notice when we signal them with a recovery conflict. There's no
>> +              * effect on the catalogs themselves yet, so it's safe for backends
>> +              * with older catalog_xmins to still exist.
>> +              *
>> +              * We don't have to take ProcArrayLock since only the startup process
>> +              * is allowed to change oldestCatalogXmin when we're in recovery.
>> +              */
>> +             SetOldestCatalogXmin(xlrec->new_catalog_xmin);
>
> Which seems to rely on ResolveRecoveryConflictWithLogicalDecoding's
> lwlock acquisition for barriers?

I don't yet have a really solid grasp of memory ordering and barrier
issues in multiprocessing. As I understand it, processes created after
this point aren't going to see the old value, they'll fork() with a
current snapshot of memory, so either they'll see the new value or
they'll be captured by our
ResolveRecoveryConflictWithLogicalDecoding() run (assuming they don't
exit first).

New decoding sessions for existing backends would be an issue. They
call EnsureActiveLogicalSlotValid() which performs a volatile read on
ShmemVariableCache->oldestCatalogXmin . But that isn't sufficient, is
it? We need a write barrier in SetOldestCatalogXmin and a read barrier
in EnsureActiveLogicalSlotValid.

I'll fix that. Thanks very much.


>> +/*
>> + * Record when we advance the catalog_xmin used for tuple removal
>> + * so standbys find out before we remove catalog tuples they might
>> + * need for logical decoding.
>> + */
>> +XLogRecPtr
>> +XactLogCatalogXminUpdate(TransactionId new_catalog_xmin)
>> +{
>> +     XLogRecPtr ptr = InvalidXLogRecPtr;
>> +
>> +     if (XLogInsertAllowed())
>> +     {
>> +             xl_xact_catalog_xmin_advance xlrec;
>> +
>> +             xlrec.new_catalog_xmin = new_catalog_xmin;
>> +
>> +             XLogBeginInsert();
>> +             XLogRegisterData((char *) &xlrec, SizeOfXactCatalogXminAdvance);
>> +
>> +             ptr = XLogInsert(RM_XACT_ID, XLOG_XACT_CATALOG_XMIN_ADV);
>> +     }
>
> Huh, why is this test needed and ok?

Good point. It isn't anymore.

I previously had catalog_xmin advances on replicas running a similar
path and skipping xlog. But that was fragile. So now
UpdateOldestCatalogXmin() is only called from the master, per the
assertion at the start, so it's unnecessary to test for
XLogInsertAllowed( ) here.

Fixed.


>> @@ -9449,6 +9456,16 @@ XLogReportParameters(void)
>>                       XLogFlush(recptr);
>>               }
>>
>> +             /*
>> +              * If wal_level was lowered from WAL_LEVEL_LOGICAL we no longer
>> +              * require oldestCatalogXmin in checkpoints and it no longer
>> +              * makes sense, so update shmem and xlog the change. This will
>> +              * get written out in the next checkpoint.
>> +              */
>> +             if (ControlFile->wal_level >= WAL_LEVEL_LOGICAL &&
>> +                     wal_level < WAL_LEVEL_LOGICAL)
>> +                     UpdateOldestCatalogXmin(true);
>
> What if we crash before this happens?

We run XLogReportParameters before we set ControlFile->state =
DB_IN_PRODUCTION, so we'd re-run recovery and call it again next time
through.

But as it turns out the above is neither necessary nor correct anyway,
it relies on the invalid assumption that catalog_xmin is 0 when
wal_level is < logical. Per above, not the case, so we can't
short-circuit catalog_xmin logging tests when wal_level = replica.

>> diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
>> index ff633fa..2d16bf0 100644
>> --- a/src/backend/commands/vacuum.c
>> +++ b/src/backend/commands/vacuum.c
>> @@ -518,6 +518,15 @@ vacuum_set_xid_limits(Relation rel,
>>       MultiXactId safeMxactLimit;
>>
>>       /*
>> +      * When logical decoding is enabled, we must write any advance of
>> +      * catalog_xmin to xlog before we allow VACUUM to remove those tuples.
>> +      * This ensures that any standbys doing logical decoding can cancel
>> +      * decoding sessions and invalidate slots if we remove tuples they
>> +      * still need.
>> +      */
>> +     UpdateOldestCatalogXmin(false);
>
> I'm on a first read-through through this, but it appears you don't do
> anything similar in heap_page_prune()?  And we can't just start emitting
> loads of additional records there, because it's called much more often...

vacuum_set_xid_limits sets OldestXmin in lazy_vacuum_rel, which is the
OldestXmin passed to heap_page_prune.

I could Assert that heap_page_prune's OldestXmin PrecedesOrEquals
ShmemVariableCache->oldestCatalogXmin I guess. It seemed unnecessary.


>> +             /*
>> +              * The walreceiver should be running when we try to create a slot. If
>> +              * we're unlucky enough to catch the walreceiver just as it's
>> +              * restarting after an error, well, the client can just retry. We don't
>> +              * bother to sleep and re-check.
>> +              */
>> +             if (!walrcv_running)
>> +                     ereport(ERROR,
>> +                                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +                                      errmsg("streaming replication is not active"),
>> +                                      errhint("Logical decoding on standby requires that streaming replication be
configuredand active. Ensure that primary_conninfo is correct in recovery.conf and check for streaming replication
errorsin the logs."))); 
>
>
> That seems quite problematic. What if there's a momentaneous connection
> failure?  This also has the issue that just because you checked that
> walrcv_running at some point, doesn't guarantee anything by the time you
> actually check.  Seems like life were easier if recovery.conf were
> guc-ified already - checking for primary_conninfo/primary_slot_name etc
> wouldn't have that issue (and can't be changed while running).

Yes, I very much wish walreceiver were already GUC-ified. I'd rather
test primary_conninfo and primary_slot_name .

> Usage of a slot doesn't actually guarantee much in cascased setups, does
> it?

It doesn't entirely eliminate the potential for a race with catalog
removal, but neither does hot_standby_feedback on a non-cascading
setup. Right now we tolerate that race and the risk that the slot may
become invalid. The application can prevent that by making sure it has
a slot on the master and the standby has caught up past the master's
lsn at the time of that slot's creation before it creates a slot on
the standby.

That's part of why the hoop jumping for catalog_xmin advance. To make
sure we know, for sure, if it's safe to decode from a slot given that
we haven't been able to directly enforce our xmin on the master.

To get rid of that race without application intervention, we need the
ability for a feedback message to flow up the cascade, and a reply
that specifically matches that feedback message (or at least that
individual downstream node) to flow down the cascade.

I'm working on just that, but there's no way it'll be ready for pg10
obviously, and it has some difficult issues. It's actually intended to
help prevent conflict with standby cancels shortly after hot_standby
starts up, but it'll help with slot creation too.

Even with all that, we'll still need some kind of xlog'd catalog_xmin
knowledge, because users can do silly things like drop the slot
connecting standby to master and re-create it, causing the standby's
needed catalog_xmin on the master to become un-pinned. We don't want
to risk messily crashing if that happens.

>> @@ -266,7 +306,9 @@ CreateInitDecodingContext(char *plugin,
>>        * xmin horizons by other backends, get the safe decoding xid, and inform
>>        * the slot machinery about the new limit. Once that's done the
>>        * ProcArrayLock can be released as the slot machinery now is
>> -      * protecting against vacuum.
>> +      * protecting against vacuum - if we're on the master. If we're running on
>> +      * a replica, we have to wait until hot_standby_feedback locks in our
>> +      * needed catalogs, per details on WaitForMasterCatalogXminReservation().
>>        * ----
>>        */
>>       LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> @@ -276,6 +318,12 @@ CreateInitDecodingContext(char *plugin,
>>
>>       ReplicationSlotsComputeRequiredXmin(true);
>>
>> +     if (RecoveryInProgress())
>> +             WaitForMasterCatalogXminReservation(slot);
>> +
>> +     Assert(TransactionIdPrecedesOrEquals(ShmemVariableCache->oldestCatalogXmin,
>> +                                                                              slot->data.catalog_xmin));
>> +
>>       LWLockRelease(ProcArrayLock);
>
> I think it's quite a bad idea to do a blocking operation like
> WaitForMasterCatalogXminReservation while holding ProcArrayLock.
>
>
>> +/*
>> + * Wait until the master's catalog_xmin is set, advancing our catalog_xmin
>> + * if needed. Caller must hold exclusive ProcArrayLock, which this function will
>> + * temporarily release while sleeping but will re-acquire.
>
> Ah. I see. Hm :(.

Exactly.

I'm increasingly inclined to rip that out and make preventing races
with master catalog removal the application's problem. Create a slot
on the master first, or accept that you may have to retry.

>
>> + * When we're creating a slot on a standby we can't directly set the
>> + * master's catalog_xmin; the catalog_xmin is set locally, then relayed
>> + * over hot_standby_feedback. The master may remove the catalogs we
>> + * asked to reserve between when we set a local catalog_xmin and when
>> + * hs feedback makes that take effect on the master. We need a feedback
>> + * reply mechanism here, where:
>> + *
>> + * - we tentatively reserve catalog_xmin locally
>
> Will that already trigger recovery conflicts?

If we already have local active slots, we'll be using their existing
catalog_xmin and there's no issue.

If we don't already have local slots the only conflict potential is
this backend. It could potentially cause a conflict if we replayed a
greatly advanced catalog_xmin from the master before we got the chance
to advance our local one accordingly.

>> + * - we wake the walreceiver by setting its latch
>> + * - walreceiver sends hs_feedback
>> + * - upstream walsender sends a new 'hs_feedback reply' message with
>> + *   actual (xmin, catalog_xmin) reserved.
>> + * - walreceiver sees reply and updates ShmemVariableCache or some other
>> + *   handy bit of shmem with hs feedback reservations from reply
>
> "or some other handy bit"?

Ha.  Will fix.

>> + * - we poll the reservations while we wait
>> + * - we set our catalog_xmin to that value, which might be later if
>> + *   we missed our requested reservation, or might be earlier if
>> + *   someone else is holding down catalog_xmin on master. We got a hs
>> + *   feedback reply so we know it's reserved.
>> + *
>> + * For cascading, the actual reservation will need to cascade up the
>> + * chain by walsender setting its own walreceiver's latch in turn, etc.
>> + *
>> + * For now, we just set the local slot catalog_xmin and sleep until
>> + * oldestCatalogXmin equals or passes our reservation. This is fine if we're
>> + * the only decoding session, but it is vulnerable to races if slots on the
>> + * master or other decoding sessions on other standbys connected to the same
>> + * master exist. They might advance their reservation before our hs_feedback
>> + * locks it down, allowing vacuum to remove tuples we need. So we might start
>> + * decoding on our slot then error with a conflict with recovery when we see
>> + * catalog_xmin advance.
>> + */
>
> I was about to list some of these issues.  That's a bit unsatisfying.

I concur. I just don't have a better answer.

I think I'd like to rip it out and make it the application's problem
until we can do it right.

>
>
> Pondering this for a bit, but I'm ~9h into a flight, so maybe not
> tonight^Wthis morning^Wwhaaaa.
>
>
>> +static void
>> +WaitForMasterCatalogXminReservation(ReplicationSlot *slot)
>> +{
>
>
> This comment seems to duplicate some of the function header
> comment. Such duplication usually leads to either or both getting out of
> date rather quickly.
>
>
> Not commenting line-by-line on the code here, but I'm extremely doubtful
> that this approach is stable enough, and that the effect of holding
> ProcArrayLock exclusively over prolonged amounts of time is acceptable.
>
>> +     ReplicationSlotsComputeRequiredXmin(true);
>>
> Why do we need this? The caller does it too, no?

Because we force a walsender update immediately and want a current value.

It's cheap enough for running in slot creation.

>> +     /* Tell the master what catalog_xmin we settled on */
>> +     WalRcvForceReply();
>> +
>> +     /* Reset ps display if we changed it */
>> +     if (new_status)
>> +     {
>> +             set_ps_display(new_status, false);
>> +             pfree(new_status);
>> +     }
>
> We really shouldn't do stuff like this while holding ProcArrayLock.

Yeah, good point.

>> +/*
>> + * Test to see if the active logical slot is usable.
>> + */
>> +static void
>> +EnsureActiveLogicalSlotValid()
>> +{
>
> Missing (void).

Augh, C++ still has its tentacles in my brain.

>> +/*
>> + * ReplicationSlotsDropDBSlots -- Drop all db-specific slots relating to the
>> + * passed database oid. The caller should hold an exclusive lock on the database
>> + * to ensure no replication slots on the database are in use.
>
> Stuff like this really should be it's own commit.  It can trivially be
> tested on its own, is useful on its own (just have DROP DATABASE do it),

Agreed, will do.

>> + * If we fail here we'll leave the in-memory state of replication slots
>> + * inconsistent with its on-disk state, so we need to PANIC.
>
> We worked quite hard to make it extremely unlikely for that to happen in
> practice.  I also don't see why there should be any new PANICs in this
> code.

I didn't figure out a sensible way not to. I'll revisit that.

>> +void
>> +ReplicationSlotsDropDBSlots(Oid dboid)
>> +{
>> +     int                     i;
>> +
>> +     if (max_replication_slots <= 0)
>> +             return;
>> +
>> +     /*
>> +      * We only need a shared lock here even though we activate slots,
>> +      * because we have an exclusive lock on the database we're dropping
>> +      * slots on and don't touch other databases' slots.
>> +      */
>> +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>
> Hm? Acquiring a slot always only takes a shared lock, no?
>
> I don't really see how "database is locked" guarantees enough for your
> logic - it's already possible to drop slots from other databases, and
> dropping a slot acquires it temporarily?

You can drop slots from other DBs.

Ugh. Right. That's a frustrating oversight. I'll have to revisit that logic.

>
>> +     for (i = 0; i < max_replication_slots; i++)
>> +     {
>> +             ReplicationSlot *s;
>> +             NameData slotname;
>> +             int active_pid;
>> +
>> +             s = &ReplicationSlotCtl->replication_slots[i];
>> +
>> +             /* cannot change while ReplicationSlotCtlLock is held */
>> +             if (!s->in_use)
>> +                     continue;
>> +
>> +             /* only logical slots are database specific, skip */
>> +             if (!SlotIsLogical(s))
>> +                     continue;
>> +
>> +             /* not our database, skip */
>> +             if (s->data.database != dboid)
>> +                     continue;
>> +
>> +             /* Claim the slot, as if ReplicationSlotAcquire()ing */
>> +             SpinLockAcquire(&s->mutex);
>> +             strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
>> +             NameStr(slotname)[NAMEDATALEN-1] = '\0';
>> +             active_pid = s->active_pid;
>> +             if (active_pid == 0)
>> +             {
>> +                     MyReplicationSlot = s;
>> +                     s->active_pid = MyProcPid;
>> +             }
>> +             SpinLockRelease(&s->mutex);
>> +
>> +             /*
>> +              * The caller should have an exclusive lock on the database so
>> +              * we'll never have any in-use slots, but just in case...
>> +              */
>> +             if (active_pid)
>> +                     elog(PANIC, "replication slot %s is in use by pid %d",
>> +                              NameStr(slotname), active_pid);
>
> So, yea, this doesn't seem ok. Why don't we just ERROR out, instead of
> PANICing? There seems to be absolutely no correctness reason for a PANIC
> here?

We've acquired the slot but it's active by another backend. Something
broke. But you're right, PANIC is an over-reaction.

>> +             /*
>> +              * To avoid largely duplicating ReplicationSlotDropAcquired() or
>> +              * complicating it with already_locked flags for ProcArrayLock,
>> +              * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
>> +              * just release our ReplicationSlotControlLock to drop the slot.
>> +              *
>> +              * There's no race here: we acquired this slot, and no slot "behind"
>> +              * our scan can be created or become active with our target dboid due
>> +              * to our exclusive lock on the DB.
>> +              */
>> +             LWLockRelease(ReplicationSlotControlLock);
>> +             ReplicationSlotDropAcquired();
>> +             LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>
> I don't see much problem with this, but I'd change the code so you
> simply do a goto restart; if you released the slot.  Then there's a lot
> less chance / complications around temporarily releasing
> ReplicationSlotControlLock.

Good idea.

>> +                                              *
>> +                                              * If logical decoding information is enabled, we also
>> +                                              * send immediate hot standby feedback so as to reduce
>> +                                              * the delay before our needed catalogs are locked in.
>
> "logical decoding information ... enabled"

XLogLogicalInfoActive()

> and "catalogs are locked in"

Yeah, fair.

> are a bit too imprecise descriptions for my taste.

Will adjust.


>
>
>> +             xmin = GetOldestXmin(NULL,
>> +                                                      false, /* don't ignore vacuum */
>> +                                                      true /* ignore catalog xmin */);
>> +
>> +             /*
>> +              * The catalog_Xmin reported by GetOldestXmin is the effective
>> +              * catalog_xmin used by vacuum, as set by xl_xact_catalog_xmin_advance
>> +              * records from the master. Sending it back to the master would be
>> +              * circular and prevent its catalog_xmin ever advancing once set.
>> +              * We should only send the catalog_xmin we actually need for slots.
>> +              */
>> +             ProcArrayGetReplicationSlotXmin(NULL, NULL, &catalog_xmin);
>
> Given that you don't have catalog_xmin set by GetOldestXmin that comment
> is a bit misleading.

It is. Too many revisions with too much time between them. Fixing.

>> @@ -1427,19 +1436,93 @@ GetOldestXmin(Relation rel, bool ignoreVacuum)
>>               NormalTransactionIdPrecedes(replication_slot_xmin, result))
>>               result = replication_slot_xmin;
>>
>> +     if (!ignoreCatalogXmin && (rel == NULL || RelationIsAccessibleInLogicalDecoding(rel)))
>> +     {
>> +             /*
>> +              * After locks have been released and defer_cleanup_age has been applied,
>> +              * check whether we need to back up further to make logical decoding
>> +              * safe. We need to do so if we're computing the global limit (rel =
>> +              * NULL) or if the passed relation is a catalog relation of some kind.
>> +              */
>> +             if (TransactionIdIsValid(replication_slot_catalog_xmin) &&
>> +                     NormalTransactionIdPrecedes(replication_slot_catalog_xmin, result))
>> +                     result = replication_slot_catalog_xmin;
>> +     }
>
> The nesting of these checks, and the comments about them, is a bit
> weird.

Agree. I didn't find it readable in one test, and it wasn't clear how
to comment on just the inner part of the tests without splitting it
up. But it's easy enough to merge, I just found it less readable and
harder to understand.

>> +/*
>> + * Return true if ShmemVariableCache->oldestCatalogXmin needs to be updated
>> + * to reflect an advance in procArray->replication_slot_catalog_xmin or
>> + * it becoming newly set or unset.
>> + *
>> + */
>> +static bool
>> +CatalogXminNeedsUpdate(TransactionId vacuum_catalog_xmin, TransactionId slots_catalog_xmin)
>> +{
>> +     return (TransactionIdPrecedes(vacuum_catalog_xmin, slots_catalog_xmin)
>> +                     || (TransactionIdIsValid(vacuum_catalog_xmin) != TransactionIdIsValid(slots_catalog_xmin)));
>> +}
>
> Your lines are really long - pgindent (which you really should run) will
> much this.  I think it'd be better to rephrase this.

Thanks. Will.

IIRC pgindent created a LOT of unrelated noise at the time I was
working on it, but I'll recheck.

>
>
>> +/*
>> + * If necessary, copy the current catalog_xmin needed by repliation slots to
>
> Typo: repliation

Thanks.

>> + * the effective catalog_xmin used for dead tuple removal.
>> + *
>> + * When logical decoding is enabled we write a WAL record before advancing the
>> + * effective value so that standbys find out if catalog tuples they still need
>> + * get removed, and can properly cancel decoding sessions and invalidate slots.
>> + *
>> + * The 'force' option is used when we're turning WAL_LEVEL_LOGICAL off
>> + * and need to clear the shmem state, since we want to bypass the wal_level
>> + * check and force xlog writing.
>> + */
>> +void
>> +UpdateOldestCatalogXmin(bool force)
>
> I'm a bit confused by this function and variable name.  What does
>
> +       TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
> +                                                                         * is guaranteed to still exist */
>
> mean?  I complained about the overall justification in the commit
> already, but looking at this commit alone, the justification for this
> part of the change is quite hard to understand.

Standbys have no way to know what catalog row versions are guaranteed to exist.

They know, from vacuum xlog records, when we remove row versions,
index entries, etc associated with a transaction. But the standby has
no way to know if the affected relation is a catalog or not, it only
knows the relfilenode. So it can't maintain a local notion of
"effective global catalog_xmin on the master as of the last xlog
record I replayed".

I could add is_catalog flags to all the vacuum xlog records via a
secondary struct that's only added when wal_level = logical, but that
seems pretty awful and likely to be very noisy. It also wouldn't help
the standby know, at startup, what the current catalog_xmin of the
master is since it won't be in a checkpoint or the control file.

>
>
>> +{
>> +     TransactionId vacuum_catalog_xmin;
>> +     TransactionId slots_catalog_xmin;
>> +
>> +     /*
>> +      * If we're not recording logical decoding information, catalog_xmin
>> +      * must be unset and we don't need to do any work here.
>
> If we don't need to do any work, shouldn't we return early?

Yes.

>> +     if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin) || force)
>> +     {
>> +             XactLogCatalogXminUpdate(slots_catalog_xmin);
>> +
>> +             LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> +             /*
>> +              * A concurrent updater could've changed these values so we need to re-check
>> +              * under ProcArrayLock before updating.
>> +              */
>> +             vacuum_catalog_xmin = *((volatile TransactionId*)&ShmemVariableCache->oldestCatalogXmin);
>> +             slots_catalog_xmin = *((volatile TransactionId*)&procArray->replication_slot_catalog_xmin);
>
> why are there volatile reads here?

Because I didn't understand volatile well enough. It's not a memory
barrier and provides no guarantee that we're seeing recent values.

It should probably just take ProcArrayLock.

>> +             if (CatalogXminNeedsUpdate(vacuum_catalog_xmin, slots_catalog_xmin))
>> +                     SetOldestCatalogXmin(slots_catalog_xmin);
>
> Why don't we check force here, but above?

Good point.

I've removed force anyway, in the latest revision. Same reason as
given above re the StartupXLOG parameters check stuff.

>> @@ -2167,14 +2250,20 @@ GetOldestSafeDecodingTransactionId(void)
>>       oldestSafeXid = ShmemVariableCache->nextXid;
>>
>>       /*
>> -      * If there's already a slot pegging the xmin horizon, we can start with
>> -      * that value, it's guaranteed to be safe since it's computed by this
>> -      * routine initially and has been enforced since.
>> +      * If there's already an effectiveCatalogXmin held down by vacuum
>> +      * it's definitely safe to start there, and it can't advance
>> +      * while we hold ProcArrayLock.
>
> What does "held down by vacuum" mean?

Brain fart. Held down by an existing slot. Comment also needs rewording.

>>  /*
>> + * Notify a logical decoding session that it conflicts with a
>> + * newly set catalog_xmin from the master.
>> + */
>> +void
>> +CancelLogicalDecodingSessionWithRecoveryConflict(pid_t session_pid)
>> +{
>> +     ProcArrayStruct *arrayP = procArray;
>> +     int                     index;
>> +
>> +     /*
>> +      * We have to scan ProcArray to find the process and set a pending recovery
>> +      * conflict even though we know the pid. At least we can get the BackendId
>> +      * and void a ProcSignal scan later.
>> +      *
>> +      * The pid might've gone away, in which case we got the desired
>> +      * outcome anyway.
>> +      */
>> +     LWLockAcquire(ProcArrayLock, LW_SHARED);
>> +
>> +     for (index = 0; index < arrayP->numProcs; index++)
>> +     {
>> +             int                     pgprocno = arrayP->pgprocnos[index];
>> +             volatile PGPROC *proc = &allProcs[pgprocno];
>> +
>> +             if (proc->pid == session_pid)
>> +             {
>> +                     VirtualTransactionId procvxid;
>> +
>> +                     GET_VXID_FROM_PGPROC(procvxid, *proc);
>> +
>> +                     proc->recoveryConflictPending = true;
>> +
>> +                     /*
>> +                      * Kill the pid if it's still here. If not, that's what we
>> +                      * wanted so ignore any errors.
>> +                      */
>> +                     (void) SendProcSignal(session_pid,
>> +                             PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN, procvxid.backendId);
>> +
>> +                     break;
>> +             }
>> +     }
>> +
>> +     LWLockRelease(ProcArrayLock);
>
> Doesn't seem ok to do this while holding ProcArrayLock.

Fair enough. And I guess it's safe enough to take and release it,
since new processes that start won't be at risk of cancellation so we
don't care about whether or not we scan them.



>> +/*
>> + * Scan to see if any clients are using replication slots that are below the
>> + * new catalog_xmin theshold and sigal them to terminate with a recovery
>> + * conflict.
>> + *
>> + * We already applied the new catalog_xmin record and updated the shmem
>> + * catalog_xmin state, so new clients that try to use a replication slot
>> + * whose on-disk catalog_xmin is below the new threshold will ERROR, and we
>> + * don't have to guard against them here.
>> + *
>> + * Replay can only continue safely once every slot that needs the catalogs
>> + * we're going to free for removal is gone. So if any conflicting sessions
>> + * exist, wait for any standby conflict grace period then signal them to exit.
>> + *
>> + * The master might clear its reserved catalog_xmin if all upstream slots are
>> + * removed or clear their feedback reservations, sending us
>> + * InvalidTransactionId. If we're concurrently trying to create a new slot and
>> + * reserve catalogs the InvalidXid reservation report might come in while we
>> + * have a slot waiting for hs_feedback confirmation of its reservation. That
>> + * would cause the waiting process to get canceled with a conflict with
>> + * recovery here since its tentative reservation conflicts with the master's
>> + * report of 'nothing reserved'. To allow it to continue to seek a startpoint
>> + * we ignore slots whose catalog_xmin is >= nextXid, indicating that they're
>> + * still looking for where to start. We'll sometimes notice a conflict but the
>> + * slot will advance its catalog_xmin to a more recent nextXid and cease to
>> + * conflict when we re-check. (The alternative is to track slots being created
>> + * differently to slots actively decoding in shmem, which seems unnecessary. Or
>> + * to separate the 'tentative catalog_xmin reservation' of a slot from its
>> + * actual needed catalog_xmin.)
>> + *
>> + * We can't use ResolveRecoveryConflictWithVirtualXIDs() here because
>> + * walsender-based logical decoding sessions won't have any virtualxid for much
>> + * of their life and the end of their virtualxids doesn't mean the end of a
>> + * potential conflict. It would also cancel too aggressively, since it cares
>> + * about the backend's xmin and logical decoding only needs the catalog_xmin.
>> + */
>
> The use of "we" seems confusing here, because it's not the same process.
>
> Generally I think your comments need to be edited a bit for brevity and
> preciseness.

Will work on it.

Me, verbose? Really?

>> +void
>> +ResolveRecoveryConflictWithLogicalDecoding(TransactionId new_catalog_xmin)
>> +{
>> +     int i;
>> +
>> +     if (!InHotStandby)
>> +             /* nobody can be actively using logical slots */
>> +             return;
>> +
>> +     /* Already applied new limit, can't have replayed later one yet */
>> +     Assert(ShmemVariableCache->oldestCatalogXmin == new_catalog_xmin);
>> +
>> +     /*
>> +      * Find the first conflicting active slot and wait for it to be free,
>> +      * signalling it if necessary, then repeat until there are no more
>> +      * conflicts.
>> +      */
>> +     LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>> +     for (i = 0; i < max_replication_slots; i++)
>> +     {
>
> I'm pretty strongly against any code outside of slot.c doing this.

IIRC I originally tried to do that as part of slot.c but found that it
resulted in other ugliness relating to access to other structures. But
I can't remember what anymore, so I'll revisit it.

>
>
>> @@ -2789,12 +2797,13 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
>>               Assert(RecoveryConflictPending && (QueryCancelPending || ProcDiePending));
>>
>>               /*
>> -              * All conflicts apart from database cause dynamic errors where the
>> -              * command or transaction can be retried at a later point with some
>> -              * potential for success. No need to reset this, since non-retryable
>> -              * conflict errors are currently FATAL.
>> +              * All conflicts apart from database and catalog_xmin cause dynamic
>> +              * errors where the command or transaction can be retried at a later
>> +              * point with some potential for success. No need to reset this, since
>> +              * non-retryable conflict errors are currently FATAL.
>>                */
>> -             if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE)
>> +             if (reason == PROCSIG_RECOVERY_CONFLICT_DATABASE ||
>> +                     reason == PROCSIG_RECOVERY_CONFLICT_CATALOG_XMIN)
>>                       RecoveryConflictRetryable = false;
>>       }
>
> Hm. Why is this a non-retryable error?

The global catalog_xmin isn't going to go backwards, so if the slot
needs a given catalog_xmin and we want to discard it....

... then we should give it a while to catch up. Right. It should be retryable.


> Ok, landing soon.  Gotta finish here.

Greatly appreciated, I know it's not the nicest to review.
>
> 0002 should be doable as a whole this release, I have severe doubts that
> 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> there's a significant number of open ends.  I'd suggest breaking of bits
> that are independently useful, and work on getting those committed.

I'll be doing that, yes.

I really want some way to create slots on replicas, advance them to
follow the master's position, and have them able to be used after
promotion to master.

I don't think actually live decoding on replica is ready yet, though
I'd find the ability to shift decoding workloads to replicas rather
nice when it is ready.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 21 March 2017 at 02:21, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:
>
>>> Subject: [PATCH 2/3] Follow timeline switches in logical decoding
>>
>> FWIW, the title doesn't really seem accurate to me.
>
> Yeah, it's not really at the logical decoding layer at all.
>
> "Teach xlogreader to follow timeline switches" ?

Happy with that. I think Craig has addressed Andres' issues with this
patch, so I will apply later today as planned using that name.

The longer Logical Decoding on Standby will not be applied yet and not
without further changess, per review.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 21 March 2017 at 09:05, Craig Ringer <craig@2ndquadrant.com> wrote:

> Thanks, that's a helpful point. The commit in question is 978b2f65. I
> didn't notice that it introduced XLogReader use in twophase.c, though
> I should've realised given the discussion about fetching recent 2pc
> info from xlog. I don't see any potential for issues at first glance,
> but I'll go over it in more detail. The main concern is validity of
> ThisTimeLineID, but since it doesn't run in recovery I don't see much
> of a problem there. That also means it can afford to use the current
> timeline-oblivious read_local_xlog_page safely.
>
> TAP tests for 2pc were added by 3082098. I'll check to make sure they
> have appropriate coverage for this.

The TAP tests pass fine, and I can't see any likely issues either.

XLogReader for 2PC doesn't happen on standby, and RecoveryInProgress()
will update ThisTimeLineID on promotion.

>> Did you check whether ThisTimeLineID is actually always valid in the
>> processes logical decoding could run in?  IIRC it's not consistently
>> update during recovery in any process but the startup process.
>
> I share your concerns that it may not be well enough maintained.
> Thankyou for the reminder, that's been on my TODO and got lost when I
> had to task-hop to other priorities.

The main place we maintain ThisTimeLineID (outside StartupXLOG of
course) is in walsender's GetStandbyFlushRecPtr, which calls
GetWalRcvWriteRecPtr. That's not used in walsender's logical decoding
or in the SQL interface.

I've changed the order of operations in read_local_xlog_page to ensure
that RecoveryInProgress() updates ThisTimeLineID if we're promoted,
and made it update ThisTimeLineID from GetXLogReplayRecPtr otherwise.

pg_logical_slot_get_changes_guts was fine already.

Because xlog read callbacks must not attempt to read pages past the
flush limit (master) or replay limit (standby), it doesn't matter if
ThisTimeLineID is completely up to date, only that it's valid as-of
that LSN.

I did identify one problem. The startup process renames the last
segment in a timeline to .partial when it processes a timeline switch.
See xlog.c:7597. So if we have the following order of operations:

* Update ThisTimeLineID to 2 at latest redo ptr
* XLogReadDetermineTimeline chooses timeline 2 to read from
* startup process replays timeline switch to TL 3 and renames last
segment in old timeline to .partial
* XLogRead() tries to open segment with TL 2

we'll fail. I don't think it matters much though. We're not actually
supporting streaming decoding from standby this release by the looks,
and even if we did the effect would be limited to an ERROR and a
reconnect. It doesn't look like there's really any sort of lock or
other synchronisation we can rely on to prevent this, and we should
probably just live with it. If we have already opened the segment
we'll just keep reading from it without noticing the rename; if we
haven't and are switching to it just as it's renamed we'll ERROR when
we try to open it.

I had cascading and promotion tests in progress for decoding on
standby, but doubt there's much point finishing them off now that it's
not likely that decoding on standby can be added for this CF.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
Hi all

Updated timeline following patch attached.

There's a change in read_local_xlog_page to ensure we maintain
ThisTimeLineID properly, otherwise it's just comment changes.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 22 March 2017 at 10:51, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> Updated timeline following patch attached.
>
> There's a change in read_local_xlog_page to ensure we maintain
> ThisTimeLineID properly, otherwise it's just comment changes.

OK, so we're looking OK with the TL following.

I'm splitting up the rest of the decoding on standby patch set with
the goal of getting minimal functionality for creating and managing
slots on standbys in, so we can maintain slots on standbys and use
them when the standby is promoted to master.

The first, to send catalog_xmin separately to the global xmin on
hot_standby_feedback and store it in the upstream physical slot's
catalog_xmin, is attached.

These are extracted directly from the logical decoding on standby
patch, with comments by Petr and Andres made re the relevant code
addressed.

I will next be working on a bare-minimum facility for creating and
advancing logical slots on a replica without support for buffering
changes, creating historic snapshots or invoking output plugin. The
slots will become usable after the replica is promoted. They'll track
their own restart_lsn, etc, and will keep track of xids so they can
manage their catalog_xmin, so there'll be no need for dodgy slot
syncing from the master, but they'll avoid most of the complex and
messy bits. The application will be expected to make sure a slot on
the master exists and is advanced before the corresponding slot on the
replica to protect required catalogs.

Then if there's agreement that it's the right way forward I can add
the catalog_xmin xlog'ing stuff as the next patch.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
Hi,

On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
> > 0002 should be doable as a whole this release, I have severe doubts that
> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> > there's a significant number of open ends.  I'd suggest breaking of bits
> > that are independently useful, and work on getting those committed.
> 
> That would be my preference too.


> The parts I think are important for Pg10 are:

> * Ability to create logical slots on replicas

Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
allowing to drop all slots using a database upon DROP DATABASE, is a
useful thing on its own.

But I have to admit, I've *severe* doubts about getting the whole
infrastructure for slot creation on replica into 10.  The work is far
from ready, and we're mere days away from freeze.


> * Ability to advance (via feedback or via SQL function) - no need to
> actually decode and call output plugins at al

That pretty much requires decoding, otherwise you really don't know how
much WAL you have to retain.


> * Ability to drop logical slots on replicas

That shouldn't actually require any changes, no?

Greetings,

Andres Freund



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

> But I have to admit, I've *severe* doubts about getting the whole
> infrastructure for slot creation on replica into 10.  The work is far
> from ready, and we're mere days away from freeze.

If Craig has to guess what would be acceptable, then its not long enough.

It would be better if you could outline a specific approach so he can
code it. Coding takes about a day for most things, since Craig knows
the code and what we're trying to achieve.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-22 14:58:29 +0000, Simon Riggs wrote:
> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:
> 
> > But I have to admit, I've *severe* doubts about getting the whole
> > infrastructure for slot creation on replica into 10.  The work is far
> > from ready, and we're mere days away from freeze.
> 
> If Craig has to guess what would be acceptable, then its not long enough.

I don't know what you're on about with that statement.  I've spent a
good chunk of time looking at the 0003 patch, even though it's large and
contains a lot of different things.  I suggested splitting things up. I
even suggested what to move earlier after Craig agreed with that
sentiment, in the mail you're replying to, because it seems
independently doable.


> It would be better if you could outline a specific approach so he can
> code it. Coding takes about a day for most things, since Craig knows
> the code and what we're trying to achieve.

I find that fairly unconvincing. What we have here is a patch that isn't
close to being ready, contains a lot of complicated pieces, a couple
days before freeze.  If we can pull useful pieces out: great.  But it's
too later for major new development.

Greetings,

Andres Freund



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:

>> The parts I think are important for Pg10 are:
>
>> * Ability to create logical slots on replicas
>
> Doesn't this also imply recovery conflicts on DROP DATABASE?

Not needed until the slot is in use, which is a later patch.

> Besides,
> allowing to drop all slots using a database upon DROP DATABASE, is a
> useful thing on its own.

Sure but that's a separate feature unrelated to this patch and we're
running out of time.

>> * Ability to advance (via feedback or via SQL function) - no need to
>> actually decode and call output plugins at al
>
> That pretty much requires decoding, otherwise you really don't know how
> much WAL you have to retain.

Knowing how much WAL to retain is key.

Why would decoding tell you how much WAL to retain?

We tried to implement this automatically from the master, which was
rejected. So the only other way is manually. We need one or the other.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Simon Riggs
Date:
On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:

> I'm splitting up the rest of the decoding on standby patch set with
> the goal of getting minimal functionality for creating and managing
> slots on standbys in, so we can maintain slots on standbys and use
> them when the standby is promoted to master.
>
> The first, to send catalog_xmin separately to the global xmin on
> hot_standby_feedback and store it in the upstream physical slot's
> catalog_xmin, is attached.
>
> These are extracted directly from the logical decoding on standby
> patch, with comments by Petr and Andres made re the relevant code
> addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:
> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:
> 
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?
> 
> Not needed until the slot is in use, which is a later patch.

Hm? We need to drop slots, if they can exist / be created, on a standby,
and they're on a dropped database.  Otherwise we'll reserve resources,
while anyone connecting to the slot will likely just receive errors
because the database doesn't exist anymore.  It's also one of the
patches that can quite easily be developed / reviewed, because there
really isn't anything complicated about it.  Most of the code is already
in Craig's patch, it just needs some adjustments.


> > Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Sure but that's a separate feature unrelated to this patch and we're
> running out of time.

Hm? The patch implemented it.


> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Knowing how much WAL to retain is key.
> 
> Why would decoding tell you how much WAL to retain?

Because decoding already has the necessary logic?  (You need to retain
enough WAL to restart decoding for all in-progress transactions etc).


> We tried to implement this automatically from the master, which was
> rejected. So the only other way is manually. We need one or the other.

I think the current approach is roughly the right way - but that doesn't
make the patch ready.



Greetings,

Andres Freund



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
>> > 0002 should be doable as a whole this release, I have severe doubts that
>> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
>> > there's a significant number of open ends.  I'd suggest breaking of bits
>> > that are independently useful, and work on getting those committed.
>>
>> That would be my preference too.
>
>
>> The parts I think are important for Pg10 are:
>
>> * Ability to create logical slots on replicas
>
> Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> allowing to drop all slots using a database upon DROP DATABASE, is a
> useful thing on its own.

Definitely beneficial, otherwise recovery will stop until you drop
slots, which isn't ideal.

>> * Ability to advance (via feedback or via SQL function) - no need to
>> actually decode and call output plugins at al
>
> That pretty much requires decoding, otherwise you really don't know how
> much WAL you have to retain.

Yes, and to update restart_lsn and catalog_xmin correctly.

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

>> * Ability to drop logical slots on replicas
>
> That shouldn't actually require any changes, no?

It doesn't, it works as-is. I have NFI why I wrote that.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> On 22 March 2017 at 21:06, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
> >> > 0002 should be doable as a whole this release, I have severe doubts that
> >> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> >> > there's a significant number of open ends.  I'd suggest breaking of bits
> >> > that are independently useful, and work on getting those committed.
> >>
> >> That would be my preference too.
> >
> >
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Definitely beneficial, otherwise recovery will stop until you drop
> slots, which isn't ideal.

s/isn't ideal/not acceptable/ ;)


> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Yes, and to update restart_lsn and catalog_xmin correctly.

> I was thinking that by disallowing snapshot use and output plugin
> invocation we'd avoid the need to support cancellation on recovery
> conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

Greetings,

Andres Freund



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 00:17, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-22 15:59:42 +0000, Simon Riggs wrote:
>> On 22 March 2017 at 13:06, Andres Freund <andres@anarazel.de> wrote:
>>
>> >> The parts I think are important for Pg10 are:
>> >
>> >> * Ability to create logical slots on replicas
>> >
>> > Doesn't this also imply recovery conflicts on DROP DATABASE?
>>
>> Not needed until the slot is in use, which is a later patch.
>
> Hm? We need to drop slots, if they can exist / be created, on a standby,
> and they're on a dropped database.  Otherwise we'll reserve resources,
> while anyone connecting to the slot will likely just receive errors
> because the database doesn't exist anymore.  It's also one of the
> patches that can quite easily be developed / reviewed, because there
> really isn't anything complicated about it.  Most of the code is already
> in Craig's patch, it just needs some adjustments.

Right, I'm not too concerned about doing that, and it's next on my
TODO as I clean up the split patch series.

>> >> * Ability to advance (via feedback or via SQL function) - no need to
>> >> actually decode and call output plugins at al
>> >
>> > That pretty much requires decoding, otherwise you really don't know how
>> > much WAL you have to retain.
>>
>> Knowing how much WAL to retain is key.
>>
>> Why would decoding tell you how much WAL to retain?
>
> Because decoding already has the necessary logic?  (You need to retain
> enough WAL to restart decoding for all in-progress transactions etc).

Indeed; after all, standby status updates from the decoding client
only contain the *flushed* LSN. The downstream doesn't know the
restartpoint LSN, it must be tracked by the upstream.

It's also necessary to maintain our catalog_xmin correctly on the
standby so we can send it via hot_standby_feedback to a physical
replication slot used on the master, ensuring the master doesn't
remove catalog tuples we may still need.

> I don't know what you're on about with that statement.  I've spent a
> good chunk of time looking at the 0003 patch, even though it's large
> and contains a lot of different things.  I suggested splitting things up.
> I even suggested what to move earlier after Craig agreed with that
> sentiment, in the mail you're replying to, because it seems
> independently doable.

I really appreciate the review, as I'm all too aware of how time
consuming it can be.

From my PoV, the difficulty I'm in is that this patch series has
languished for most of the Pg 10 release cycle with no real input from
stakeholders in the logical decoding area, so while the review is
important, the fact that it's now means that it pretty comprehensively
blocks the patch for Pg 10. I asked on list for input on structure
(if/how to split it up) literally months ago, for example.

I've been struggling to get some kind of support for logical decoding
on standbys for most of two release cycles, and there are people
climbing the walls wanting it. I'm trying to make sure it's done
right, but I can't do that alone, and it's hard to progress when I
don't know what will be expected until it's too late to do anything
about it.

I guess all we can do at this point is get the foundations in place
and carry on for Pg 11, where the presence of in-core logical
replication will offer a lever to actually push this in. In the mean
time I'll have to continue carrying the out-of-tree failover slots
patch for people who use logical decoding and want it to be reliable.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

>> I was thinking that by disallowing snapshot use and output plugin
>> invocation we'd avoid the need to support cancellation on recovery
>> conflicts, etc, simplifying things considerably.
>
> That seems like it'd end up being pretty hacky - the likelihood that
> we'd run into snapbuild error cross-checks seems very high.

TBH I'm not following this. But I haven't touched snapbuild much yet,
Petr's done much more with snapbuild than I have.

We're not going to have robust logical replication that's suitable for
HA and failover use on high load systems until 2020 or so, with Pg 12.
We'll need concurrent decoding and apply, which nobody's even started
on AFAIK, we'll need sequence replication, and more.

So I'd really, really like to get some kind of HA picture other than
"none" in for logical decoding based systems. If it's imperfect, it's
still something.

I wish we'd just proceeded with failover slots. They were blocked in
favour of decoding on standby, and HA is possible if we have decoding
on standby with some more work by the application. But now we'll have
neither. If we'd just done failover slots we could've had logical
replication able to follow failover in Pg 10.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another
approach only to have it knocked back once the work is done.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-23 09:14:07 +0800, Craig Ringer wrote:
> On 23 March 2017 at 07:31, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> 
> >> I was thinking that by disallowing snapshot use and output plugin
> >> invocation we'd avoid the need to support cancellation on recovery
> >> conflicts, etc, simplifying things considerably.
> >
> > That seems like it'd end up being pretty hacky - the likelihood that
> > we'd run into snapbuild error cross-checks seems very high.
> 
> TBH I'm not following this. But I haven't touched snapbuild much yet,
> Petr's done much more with snapbuild than I have.

We can't just assume that snapbuild is going to work correctly when it's
prerequisites - pinned xmin horizon - isn't working.


> We're not going to have robust logical replication that's suitable for
> HA and failover use on high load systems until 2020 or so, with Pg 12.
> We'll need concurrent decoding and apply, which nobody's even started
> on AFAIK, we'll need sequence replication, and more.

These seem largely unrelated to the topic at hand(nor do I agree on all
of them).


> So I'd really, really like to get some kind of HA picture other than
> "none" in for logical decoding based systems. If it's imperfect, it's
> still something.

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep.  It's a nice later-on
feature.  But that's an irrelevant aside.

I don't understand why you're making a "fundamental" argument here - I'm
not arguing against the goals of the patch at all. I want as much stuff
committed as we can in a good shape.


> What do _you_ see as the minimum acceptable way to achieve the ability
> for a logical decoding client to follow failover of an upstream to a
> physical standby? In the end, you're one of the main people whose view
> carries weight in this area, and I don't want to develop yet another

I think your approach here wasn't that bad? There's a lot of cleaning
up/shoring up needed, and we probably need a smarter feedback system.  I
don't think anybody here has objected to the fundamental approach?


Greetings,

Andres Freund



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:

> We can't just assume that snapbuild is going to work correctly when it's
> prerequisites - pinned xmin horizon - isn't working.

Makes sense.

>> What do _you_ see as the minimum acceptable way to achieve the ability
>> for a logical decoding client to follow failover of an upstream to a
>> physical standby? In the end, you're one of the main people whose view
>> carries weight in this area, and I don't want to develop yet another
>
> I think your approach here wasn't that bad? There's a lot of cleaning
> up/shoring up needed, and we probably need a smarter feedback system.  I
> don't think anybody here has objected to the fundamental approach?

That's useful, thanks.

I'm not arguing that the patch as it stands is ready, and appreciate
the input re the general design.


> I still think decoding-on-standby is simply not the right approach as
> the basic/first HA approach for logical rep.  It's a nice later-on
> feature.  But that's an irrelevant aside.

I don't really agree that it's irrelevant.

Right now Pg has no HA capability for logical decoding clients. We've
now added logical replication, but it has no way to provide for
upstream node failure and ensure a consistent switch-over, whether to
a logical or physical replica. Since real world servers fail or need
maintenance, this is kind of a problem for practical production use.

Because of transaction serialization for commit-time order replay,
logical replication experiences saw-tooth replication lag, where large
or long xacts such as batch jobs effectively stall all later xacts
until they are fully replicated. We cannot currently start replicating
a big xact until it commits on the upstream, so that lag can easily be
~2x the runtime on the upstream.

So while you can do sync rep on a logical standby, it tends to result
in big delays on COMMITs relative to physical rep, even if app are
careful to keep transactions small. When the app DR planning people
come and ask you what the max data loss window / max sync rep lag is,
you have to say ".... dunno? depends on what else was running on the
server at the time."

AFAICT, changing those things will require the ability to begin
streaming reorder buffers for big xacts before commit, which as the
logical decoding on 2PC thread shows is not exactly trivial. We'll
also need to be able to apply them concurrently with other xacts on
the other end. Those are both big and complex things IMO, and I'll be
surprised if we can do either in Pg11 given that AFAIK nobody has even
started work on either of them or has a detailed plan.

Presuming we get some kind of failover to logical replica upstreams
into Pg11, it'll have significant limitations relative to what we can
deliver to users by using physical replication. Especially when it
comes to bounded-time lag for HA, sync rep, etc. And I haven't seen a
design for it, though Petr and I have discussed some with regards to
pglogical.

That's why I think we need to do HA on the physical side first.
Because it's going to take a long time to get equivalent functionality
for logical rep based upstreams, and when it is we'll still have to
teach management tools and other non-logical-rep logical decoding
clients about the new way of doing things.  Wheras for physical HA
setups to support logical downstreams requires only relatively minor
changes and gets us all the physical HA features _now_.

That's why we pursued failover slots - as a simple, minimal solution
to allowing logical decoding clients to inter-operate with Pg in a
physical HA configuration. TBH, I still think we should just add them.
Sure, they don't help us achieve decoding on standby, but they're a
lot simpler and they help Pg's behaviour with slots match user
expectations for how the rest of Pg behaves, i.e. if it's on the
master it'll be on the replica too.  And as you've said, decoding on
standby is a nice-to-have, wheras I think some kind of HA support is
rather more important.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I'm splitting up the rest of the decoding on standby patch set with
>> the goal of getting minimal functionality for creating and managing
>> slots on standbys in, so we can maintain slots on standbys and use
>> them when the standby is promoted to master.
>>
>> The first, to send catalog_xmin separately to the global xmin on
>> hot_standby_feedback and store it in the upstream physical slot's
>> catalog_xmin, is attached.
>>
>> These are extracted directly from the logical decoding on standby
>> patch, with comments by Petr and Andres made re the relevant code
>> addressed.
>
> I've reduced your two patches back to one with a smaller blast radius.
>
> I'll commit this tomorrow morning, barring objections.

Thanks. I was tempted to refactor GetOldestXmin to use flags myself,
but thought it might be at higher risk of objections. Since Eiji Seki
has shown that there are other uses for excluding particular things
from GetOldestXmin it and that's committed now, it's nice to have the
impact of this patch reduced.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
> On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:
> > I still think decoding-on-standby is simply not the right approach as
> > the basic/first HA approach for logical rep.  It's a nice later-on
> > feature.  But that's an irrelevant aside.
> 
> I don't really agree that it's irrelevant.

I'm not sure we have enough time for either getting some parts of your
patch in, or for figuring out long term goals. But we definitely don't
have time for both.

- Andres



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 12:41, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
>> On 23 March 2017 at 09:39, Andres Freund <andres@anarazel.de> wrote:
>> > I still think decoding-on-standby is simply not the right approach as
>> > the basic/first HA approach for logical rep.  It's a nice later-on
>> > feature.  But that's an irrelevant aside.
>>
>> I don't really agree that it's irrelevant.
>
> I'm not sure we have enough time for either getting some parts of your
> patch in, or for figuring out long term goals. But we definitely don't
> have time for both.

Fair.



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 00:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 22 March 2017 at 08:53, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I'm splitting up the rest of the decoding on standby patch set with
>> the goal of getting minimal functionality for creating and managing
>> slots on standbys in, so we can maintain slots on standbys and use
>> them when the standby is promoted to master.
>>
>> The first, to send catalog_xmin separately to the global xmin on
>> hot_standby_feedback and store it in the upstream physical slot's
>> catalog_xmin, is attached.
>>
>> These are extracted directly from the logical decoding on standby
>> patch, with comments by Petr and Andres made re the relevant code
>> addressed.
>
> I've reduced your two patches back to one with a smaller blast radius.
>
> I'll commit this tomorrow morning, barring objections.

This needs rebasing on top of

commit af4b1a0869bd3bb52e5f662e4491554b7f611489
Author: Simon Riggs <simon@2ndQuadrant.com>
Date:   Wed Mar 22 16:51:01 2017 +0000

    Refactor GetOldestXmin() to use flags

    Replace ignoreVacuum parameter with more flexible flags.

    Author: Eiji Seki
    Review: Haribabu Kommi


That patch landed up using PROCARRAY flags directly as flags to
GetOldestXmin, so it doesn't make much sense to add a flag like
PROCARRAY_REPLICATION_SLOTS . There won't be any corresponding PROC_
flag for PGXACT->vacuumFlags, replication slot xmin and catalog_xmin
are global state not tracked in individual proc entries.

Rather than add some kind of "PROC_RESERVED" flag in proc.h that would
never be used and only exist to reserve a bit for use for
PROCARRAY_REPLICATION_SLOTS, which we'd use a flag to GetOldestXmin, I
added a new argument to GetOldestXmin like the prior patch did.

If preferred I can instead add

proc.h:

#define PROC_RESERVED 0x20

procarray.h:

#define PROCARRAY_REPLICATION_SLOTS 0x20

and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

but that's kind of ugly to say the least, I'd rather just add another argument.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 16:07, Craig Ringer <craig@2ndquadrant.com> wrote:

> If preferred I can instead add
>
> proc.h:
>
> #define PROC_RESERVED 0x20
>
> procarray.h:
>
> #define PROCARRAY_REPLICATION_SLOTS 0x20
>
> and then test for (flags & PROCARRAY_REPLICATION_SLOTS)

Attached done that way.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical decoding on standby

From
Craig Ringer
Date:
On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:

Minor update to catalog_xmin walsender patch to fix failure to
parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .

This one's ready to go. Working on drop slots on DB drop now.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: Logical decoding on standby

From
Simon Riggs
Date:
On 24 March 2017 at 06:23, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 23 March 2017 at 17:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> Minor update to catalog_xmin walsender patch to fix failure to
> parenthesize definition of PROCARRAY_PROC_FLAGS_MASK .
>
> This one's ready to go. Working on drop slots on DB drop now.

Committed. Next!

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 20 March 2017 at 17:33, Andres Freund <andres@anarazel.de> wrote:

> Have you checked how high the overhead of XLogReadDetermineTimeline is?
> A non-local function call, especially into a different translation-unit
> (no partial inlining), for every single page might end up being
> noticeable.  That's fine in the cases it actually adds functionality,
> but for a master streaming out data, that's not actually adding
> anything.

I haven't been able to measure any difference. But, since we require
the caller to ensure a reasonably up to date ThisTimeLineID, maybe
it's worth adding an inlineable function for the fast-path that tests
the "page cached" and "timeline is current and unchanged" conditions?


//xlogutils.h:
static inline void XLogReadDetermineTimeline(...)
{  ... first test for page already read-in and valid ...  ... second test for ThisTimeLineId ...
XLogReadCheckTimeLineChange(...)
}

XLogReadCheckTimeLineChange(...)
{  ... rest of function
}

(Yes, I know "inline" means little, but it's a hint for readers)

I'd rather avoid using a macro since it'd be pretty ugly, but it's
also an option if an inline func is undesirable.

#define XLOG_READ_DETERMINE_TIMELINE \ do { \   ... same as above ... } while (0);


Can be done after CF if needed anyway, it's just fiddling some code
around. Figured I'd mention though.

>> +             /*
>> +              * To avoid largely duplicating ReplicationSlotDropAcquired() or
>> +              * complicating it with already_locked flags for ProcArrayLock,
>> +              * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
>> +              * just release our ReplicationSlotControlLock to drop the slot.
>> +              *
>> +              * There's no race here: we acquired this slot, and no slot "behind"
>> +              * our scan can be created or become active with our target dboid due
>> +              * to our exclusive lock on the DB.
>> +              */
>> +             LWLockRelease(ReplicationSlotControlLock);
>> +             ReplicationSlotDropAcquired();
>> +             LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
>
> I don't see much problem with this, but I'd change the code so you
> simply do a goto restart; if you released the slot.  Then there's a lot
> less chance / complications around temporarily releasing
> ReplicationSlotControlLock

I don't quite get this. I suspect I'm just not seeing the implications
as clearly as you do.

Do you mean we should restart the whole scan of the slot array if we
drop any slot? That'll be O(n log m) but since we don't expect to be
working on a big array or a lot of slots it's unlikely to matter.

The patch coming soon will assume we'll restart the whole scan, anyway.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
Hi

Here's the next patch in the split-up series, drop db-specific
(logical) replication slots on DROP DATABASE.

Current behaviour is to ERROR if logical slots exist on the DB,
whether in-use or not.

With this patch we can DROP a database if it has logical slots so long
as they are not active. I haven't added any sort of syntax for this,
it's just done unconditionally.

I don't see any sensible way to stop a slot becoming active after we
check for active slots and begin the actual database DROP, since
ReplicationSlotAcquire will happily acquire a db-specific slot for a
different DB and the only lock it takes is a shared lock on
ReplicationSlotControlLock, which we presumably don't want to hold
throughout DROP DATABASE.

So this patch makes ReplicationSlotAcquire check that the slot
database matches the current database and refuse to acquire the slot
if it does not. The only sensible reason to acquire a slot from a
different DB is to drop it, and then it's only a convenience at best.
Slot drop is the only user-visible behaviour change, since all other
activity on logical slots happened when the backend was already
connected to the slot's DB. Appropriate docs changes have been made
and tests added.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:

> So this patch makes ReplicationSlotAcquire check that the slot
> database matches the current database and refuse to acquire the slot
> if it does not.

New patch attached that drops above requirement, so slots can still be
dropped from any DB.

This introduces a narrow race window where DROP DATABASE may ERROR if
somebody connects to a different database and runs a
pg_drop_replication_slot(...) for one of the slots being dropped by
DROP DATABASE after we check for active slots but before we've dropped
the slot. But it's hard to hit and it's pretty harmless; the worst
possible result is dropping one or more of the slots before we ERROR
out of the DROP. But you clearly didn't want them anyway, since you
were dropping the DB and dropping some slots at the same time.

I think this one's ready to go.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Simon Riggs
Date:
On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:

> I think this one's ready to go.

Looks like something I could commit. Full review by me while offline
today, aiming to commit tomorrow barring issues raised.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 27 March 2017 at 16:20, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 27 March 2017 at 09:03, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I think this one's ready to go.
>
> Looks like something I could commit. Full review by me while offline
> today, aiming to commit tomorrow barring issues raised.

Great.

Meanwhile I'm going to be trying to work with Stas on 2PC logical
decoding, while firming up the next patches in this series to see if
we can progress a bit further.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Andres Freund
Date:
Hi,

On 2017-03-27 16:03:48 +0800, Craig Ringer wrote:
> On 27 March 2017 at 14:08, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > So this patch makes ReplicationSlotAcquire check that the slot
> > database matches the current database and refuse to acquire the slot
> > if it does not.
> 
> New patch attached that drops above requirement, so slots can still be
> dropped from any DB.
> 
> This introduces a narrow race window where DROP DATABASE may ERROR if
> somebody connects to a different database and runs a
> pg_drop_replication_slot(...) for one of the slots being dropped by
> DROP DATABASE after we check for active slots but before we've dropped
> the slot. But it's hard to hit and it's pretty harmless; the worst
> possible result is dropping one or more of the slots before we ERROR
> out of the DROP. But you clearly didn't want them anyway, since you
> were dropping the DB and dropping some slots at the same time.
> 
> I think this one's ready to go.
> 
> -- 
>  Craig Ringer                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

> From 99d5313d3a265bcc57ca6845230b9ec49d188710 Mon Sep 17 00:00:00 2001
> From: Craig Ringer <craig@2ndquadrant.com>
> Date: Wed, 22 Mar 2017 13:21:09 +0800
> Subject: [PATCH] Make DROP DATABASE drop logical slots for the DB
> 
> Automatically drop all logical replication slots associated with a
> database when the database is dropped.
> ---
>  doc/src/sgml/func.sgml                             |  3 +-
>  doc/src/sgml/protocol.sgml                         |  2 +
>  src/backend/commands/dbcommands.c                  | 32 +++++---
>  src/backend/replication/slot.c                     | 88 ++++++++++++++++++++++
>  src/include/replication/slot.h                     |  1 +
>  src/test/recovery/t/006_logical_decoding.pl        | 40 +++++++++-
>  .../recovery/t/010_logical_decoding_timelines.pl   | 30 +++++++-
>  7 files changed, 182 insertions(+), 14 deletions(-)
> 
> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index ba6f8dd..78508d7 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18876,7 +18876,8 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
>         <entry>
>          Drops the physical or logical replication slot
>          named <parameter>slot_name</parameter>. Same as replication protocol
> -        command <literal>DROP_REPLICATION_SLOT</>.
> +        command <literal>DROP_REPLICATION_SLOT</>. For logical slots, this must
> +        be called when connected to the same database the slot was created on.
>         </entry>
>        </row>
>  
> diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
> index b3a5026..5f97141 100644
> --- a/doc/src/sgml/protocol.sgml
> +++ b/doc/src/sgml/protocol.sgml
> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>       <para>
>        Drops a replication slot, freeing any reserved server-side resources. If
>        the slot is currently in use by an active connection, this command fails.
> +      If the slot is a logical slot that was created in a database other than
> +      the database the walsender is connected to, this command fails.
>       </para>
>       <variablelist>
>        <varlistentry>

Shouldn't the docs in the drop database section about this?


> +void
> +ReplicationSlotsDropDBSlots(Oid dboid)
> +{
> +    int            i;
> +
> +    if (max_replication_slots <= 0)
> +        return;
> +
> +restart:
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    for (i = 0; i < max_replication_slots; i++)
> +    {
> +        ReplicationSlot *s;
> +        NameData slotname;
> +        int active_pid;
> +
> +        s = &ReplicationSlotCtl->replication_slots[i];
> +
> +        /* cannot change while ReplicationSlotCtlLock is held */
> +        if (!s->in_use)
> +            continue;
> +
> +        /* only logical slots are database specific, skip */
> +        if (!SlotIsLogical(s))
> +            continue;
> +
> +        /* not our database, skip */
> +        if (s->data.database != dboid)
> +            continue;
> +
> +        /* Claim the slot, as if ReplicationSlotAcquire()ing. */
> +        SpinLockAcquire(&s->mutex);
> +        strncpy(NameStr(slotname), NameStr(s->data.name), NAMEDATALEN);
> +        NameStr(slotname)[NAMEDATALEN-1] = '\0';
> +        active_pid = s->active_pid;
> +        if (active_pid == 0)
> +        {
> +            MyReplicationSlot = s;
> +            s->active_pid = MyProcPid;
> +        }
> +        SpinLockRelease(&s->mutex);
> +
> +        /*
> +         * We might fail here if the slot was active. Even though we hold an
> +         * exclusive lock on the database object a logical slot for that DB can
> +         * still be active if it's being dropped by a backend connected to
> +         * another DB or is otherwise acquired.
> +         *
> +         * It's an unlikely race that'll only arise from concurrent user action,
> +         * so we'll just bail out.
> +         */
> +        if (active_pid)
> +            elog(ERROR, "replication slot %s is in use by pid %d",
> +                  NameStr(slotname), active_pid);
> +
> +        /*
> +         * To avoid largely duplicating ReplicationSlotDropAcquired() or
> +         * complicating it with already_locked flags for ProcArrayLock,
> +         * ReplicationSlotControlLock and ReplicationSlotAllocationLock, we
> +         * just release our ReplicationSlotControlLock to drop the slot.
> +         *
> +         * For safety we'll restart our scan from the beginning each
> +         * time we release the lock.
> +         */
> +        LWLockRelease(ReplicationSlotControlLock);
> +        ReplicationSlotDropAcquired();
> +        goto restart;
> +    }
> +    LWLockRelease(ReplicationSlotControlLock);
> +
> +    /* recompute limits once after all slots are dropped */
> +    ReplicationSlotsComputeRequiredXmin(false);
> +    ReplicationSlotsComputeRequiredLSN();

I was concerned for a second that we'd skip doing
ReplicationSlotsComputeRequired* if we ERROR out above - but
ReplicationSlotDropAcquired already calls these as necessary. I.e. they
should be dropped from here.


- Andres



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:

>> --- a/doc/src/sgml/protocol.sgml
>> +++ b/doc/src/sgml/protocol.sgml
>> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>>       <para>
>>        Drops a replication slot, freeing any reserved server-side resources. If
>>        the slot is currently in use by an active connection, this command fails.
>> +      If the slot is a logical slot that was created in a database other than
>> +      the database the walsender is connected to, this command fails.
>>       </para>
>>       <variablelist>
>>        <varlistentry>
>
> Shouldn't the docs in the drop database section about this?

DROP DATABASE doesn't really discuss all the resources it drops, but
I'm happy to add mention of replication slots handling.

I just notice that I failed to remove the docs changes regarding
dropping slots becoming db-specific, so I'll post a follow-up for that
in a sec.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:

> I just notice that I failed to remove the docs changes regarding
> dropping slots becoming db-specific, so I'll post a follow-up for that
> in a sec.

Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 29 March 2017 at 08:11, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 08:01, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> I just notice that I failed to remove the docs changes regarding
>> dropping slots becoming db-specific, so I'll post a follow-up for that
>> in a sec.
>
> Attached.

... and here's the next in the patch series. Both this and the
immediately prior minor patch fix-drop-slot-docs.patch are pending
now.


Notable changes in this patch since review:

* Split oldestCatalogXmin tracking into separate patch

* Critically, fix use of procArray->replication_slot_catalog_xmin in
GetSnapshotData's setting of RecentGlobalXmin and RecentGlobalDataXmin
so it instead uses ShmemVariableCache->oldestCatalogXmin . This
could've led to tuples newer than oldestCatalogXmin being removed.

* Memory barrier in UpdateOldestCatalogXmin and SetOldestCatalogXmin.
It still does a pre-check before deciding if it needs to take
ProcArrayLock, recheck, and advance, since we don't want to
unnecessarily contest ProcArrayLock.

* Remove unnecessary volatile usage (retained in
UpdateOldestCatalogXmin due to barrier)

* Remove unnecessary test for XLogInsertAllowed() in XactLogCatalogXminUpdate

* EnsureActiveLogicalSlotValid(void)  - add (void)

* pgidented changes in this diff; have left unrelated changes alone




Re:

> what does
>
> +       TransactionId oldestCatalogXmin; /* oldest xid where complete catalog state
> +                                                                         * is guaranteed to still exist */
>
> mean?  I complained about the overall justification in the commit
> already, but looking at this commit alone, the justification for this
> part of the change is quite hard to understand.

The patch now contains

    TransactionId oldestCatalogXmin; /* oldest xid it is guaranteed to be safe
                                      * to create a historic snapshot for; see
                                      * also
                                      * procArray->replication_slot_catalog_xmin
                                      * */

which I think is an improvement.

I've also sought to explain the purpose of this change better with

/*
 * If necessary, copy the current catalog_xmin needed by replication slots to
 * the effective catalog_xmin used for dead tuple removal and write a WAL
 * record recording the change.
 *
 * This allows standbys to know the oldest xid for which it is safe to create
 * a historic snapshot for logical decoding. VACUUM or other cleanup may have
 * removed catalog tuple versions needed to correctly decode transactions older
 * than this threshold. Standbys can use this information to cancel conflicting
 * decoding sessions and invalidate slots that need discarded information.
 *
 * (We can't use the transaction IDs in WAL records emitted by VACUUM etc for
 * this, since they don't identify the relation as a catalog or not.  Nor can a
 * standby look up the relcache to get the Relation for the affected
 * relfilenode to check if it is a catalog. The standby would also have no way
 * to know the oldest safe position at startup if it wasn't in the control
 * file.)
 */
void
UpdateOldestCatalogXmin(void)
{
...

Does that help?




(Sidenote for later: ResolveRecoveryConflictWithLogicalDecoding will
need a read barrier too, when the next patch adds it.)

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:

> * Split oldestCatalogXmin tracking into separate patch

Regarding this, Simon raised concerns about xlog volume here.

It's pretty negligible.

We only write a new record when a vacuum runs after catalog_xmin
advances on the slot with the currently-lowest catalog_xmin (or, if
vacuum doesn't run reasonably soon, when the bgworker next looks).

So at worst on a fairly slow moving system or one with a super high
vacuum rate we'll write one per commit. But in most cases we'll write
a lot fewer than that. When running t/006_logical_decoding.pl for
example:

$ ../../../src/bin/pg_waldump/pg_waldump
tmp_check/data_master_daPa/pgdata/pg_wal/000000010000000000000001  |
grep CATALOG
rmgr: Transaction len (rec/tot):      4/    30, tx:          0, lsn:
0/01648D50, prev 0/01648D18, desc: CATALOG_XMIN catalog_xmin 555
rmgr: Transaction len (rec/tot):      4/    30, tx:          0, lsn:
0/0164C840, prev 0/0164C378, desc: CATALOG_XMIN catalog_xmin 0
pg_waldump: FATAL:  error in WAL record at 0/16BBF10: invalid record
length at 0/16BBF88: wanted 24, got 0


and of course, none at all unless you use logical decoding.



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Simon Riggs
Date:
On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> * Split oldestCatalogXmin tracking into separate patch
>
> Regarding this, Simon raised concerns about xlog volume here.
>
> It's pretty negligible.
>
> We only write a new record when a vacuum runs after catalog_xmin
> advances on the slot with the currently-lowest catalog_xmin (or, if
> vacuum doesn't run reasonably soon, when the bgworker next looks).

I'd prefer to slow things down a little, not be so eager.

If we hold back update of the catalog_xmin until when we run
GetRunningTransactionData() we wouldn't need to produce any WAL
records at all AND we wouldn't need to have VACUUM do
UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
task.

That would also make this patch about half the length it is.

Let me know what you think.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
> On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
>> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>>> * Split oldestCatalogXmin tracking into separate patch
>>
>> Regarding this, Simon raised concerns about xlog volume here.
>>
>> It's pretty negligible.
>>
>> We only write a new record when a vacuum runs after catalog_xmin
>> advances on the slot with the currently-lowest catalog_xmin (or, if
>> vacuum doesn't run reasonably soon, when the bgworker next looks).
>
> I'd prefer to slow things down a little, not be so eager.
>
> If we hold back update of the catalog_xmin until when we run
> GetRunningTransactionData() we wouldn't need to produce any WAL
> records at all AND we wouldn't need to have VACUUM do
> UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
> task.
>
> That would also make this patch about half the length it is.
>
> Let me know what you think.

Good idea.

We can always add a heuristic later to make xl_running_xacts get
emitted more often at high transaction rates if it's necessary.

Patch coming soon.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 30 March 2017 at 11:34, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 29 March 2017 at 23:13, Simon Riggs <simon.riggs@2ndquadrant.com> wrote:
>> On 29 March 2017 at 10:17, Craig Ringer <craig@2ndquadrant.com> wrote:
>>> On 29 March 2017 at 16:44, Craig Ringer <craig@2ndquadrant.com> wrote:
>>>
>>>> * Split oldestCatalogXmin tracking into separate patch
>>>
>>> Regarding this, Simon raised concerns about xlog volume here.
>>>
>>> It's pretty negligible.
>>>
>>> We only write a new record when a vacuum runs after catalog_xmin
>>> advances on the slot with the currently-lowest catalog_xmin (or, if
>>> vacuum doesn't run reasonably soon, when the bgworker next looks).
>>
>> I'd prefer to slow things down a little, not be so eager.
>>
>> If we hold back update of the catalog_xmin until when we run
>> GetRunningTransactionData() we wouldn't need to produce any WAL
>> records at all AND we wouldn't need to have VACUUM do
>> UpdateOldestCatalogXmin(). Bgwriter wouldn't need to perform an extra
>> task.
>>
>> That would also make this patch about half the length it is.
>>
>> Let me know what you think.
>
> Good idea.
>
> We can always add a heuristic later to make xl_running_xacts get
> emitted more often at high transaction rates if it's necessary.
>
> Patch coming soon.

Attached.

A bit fiddlier than expected, but I like the result more.

In the process I identified an issue with both the prior patch and
this one where we don't check slot validity for slots that existed on
standby prior to promotion of standby to master. We were just assuming
that being the master was good enough, since it controls
replication_slot_catalog_xmin, but that's not true for pre-existing
slots.

Fixed by forcing update of the persistent safe catalog xmin after the
first slot is created on the master - which is now done by doing an
immediate LogStandbySnapshot() after assigning the slot's
catalog_xmin.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Simon Riggs
Date:
On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:

> Attached.

* Cleaned up in 3 places
* Added code for faked up RunningTransactions in xlog.c
* Ensure catalog_xmin doesn't go backwards

All else looks good. Comments before commit?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:
> On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:
> 
> > Attached.
> 
> * Cleaned up in 3 places
> * Added code for faked up RunningTransactions in xlog.c
> * Ensure catalog_xmin doesn't go backwards
> 
> All else looks good. Comments before commit?

Can you give me till after lunch?

- Andres



Re: Logical decoding on standby

From
Simon Riggs
Date:
On 30 March 2017 at 15:27, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-30 15:26:02 +0100, Simon Riggs wrote:
>> On 30 March 2017 at 09:07, Craig Ringer <craig@2ndquadrant.com> wrote:
>>
>> > Attached.
>>
>> * Cleaned up in 3 places
>> * Added code for faked up RunningTransactions in xlog.c
>> * Ensure catalog_xmin doesn't go backwards
>>
>> All else looks good. Comments before commit?
>
> Can you give me till after lunch?

Sure, np

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Logical decoding on standby

From
Andres Freund
Date:
> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>          SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>  
>          /*
> +         * There can be no concurrent writers to oldestCatalogXmin during
> +         * recovery, so no need to take ProcArrayLock.
> +         */
> +        ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

s/writers/writes/?

> @@ -9731,6 +9748,15 @@ xlog_redo(XLogReaderState *record)
>                                    checkPoint.oldestXid))
>              SetTransactionIdLimit(checkPoint.oldestXid,
>                                    checkPoint.oldestXidDB);
> +
> +        /*
> +         * There can be no concurrent writers to oldestCatalogXmin during
> +         * recovery, so no need to take ProcArrayLock.
> +         */
> +        if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin,
> +                                    checkPoint.oldestCatalogXmin)
> +            ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;

dito.



> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>  
>      ReplicationSlotsComputeRequiredXmin(true);
>  
> +    /*
> +     * If this is the first slot created on the master we won't have a
> +     * persistent record of the oldest safe xid for historic snapshots yet.
> +     * Force one to be recorded so that when we go to replay from this slot we
> +     * know it's safe.
> +     */
> +    force_standby_snapshot =
> +        !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);

s/first slot/first logical slot/. Also, the reference to master doesn't
seem right.


>      LWLockRelease(ProcArrayLock);
>  
> +    /* Update ShmemVariableCache->oldestCatalogXmin */
> +    if (force_standby_snapshot)
> +        LogStandbySnapshot();

The comment and code don't quite square to me - it's far from obvious
that LogStandbySnapshot does something like that. I'd even say it's a
bad idea to have it do that.


>      /*
>       * tell the snapshot builder to only assemble snapshot once reaching the
>       * running_xact's record with the respective xmin.
> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>          start_lsn = slot->data.confirmed_flush;
>      }
>  
> +    EnsureActiveLogicalSlotValid();

This seems like it should be in a separate patch, and seperately
reviewed. It's code that's currently unreachable (and thus untestable).


> +/*
> + * Test to see if the active logical slot is usable.
> + */
> +static void
> +EnsureActiveLogicalSlotValid(void)
> +{
> +    TransactionId shmem_catalog_xmin;
> +
> +    Assert(MyReplicationSlot != NULL);
> +
> +    /*
> +     * A logical slot can become unusable if we're doing logical decoding on a
> +     * standby or using a slot created before we were promoted from standby
> +     * to master.

Neither of those is currently possible...


> If the master advanced its global catalog_xmin past the
> +     * threshold we need it could've removed catalog tuple versions that
> +     * we'll require to start decoding at our restart_lsn.
> +     *
> +     * We need a barrier so that if we decode in recovery on a standby we
> +     * don't allow new decoding sessions to start after redo has advanced
> +     * the threshold.
> +     */
> +    if (RecoveryInProgress())
> +        pg_memory_barrier();

I don't think this is a meaningful locking protocol.  It's a bad idea to
use lock-free programming without need, especially when the concurrency
protocol isn't well defined.  With what other barrier does this pair
with?  What prevents the data being out of date by the time we actually
do the check?


> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
> index cfc3fba..cdc5f95 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
>       * be energy wasted - the worst lost information can do here is give us
>       * wrong information in a statistics view - we'll just potentially be more
>       * conservative in removing files.
> +     *
> +     * We don't have to do any effective_xmin / effective_catalog_xmin testing
> +     * here either, like for LogicalConfirmReceivedLocation. If we received
> +     * the xmin and catalog_xmin from downstream replication slots we know they
> +     * were already confirmed there,
>       */
>  }

This comment reads as if LogicalConfirmReceivedLocation had
justification for not touching / checking catalog_xmin. But it does.



>      /*
> +     * Update our knowledge of the oldest xid we can safely create historic
> +     * snapshots for.
> +     *
> +     * There can be no concurrent writers to oldestCatalogXmin during
> +     * recovery, so no need to take ProcArrayLock.

By now I think is pretty flawed logic, because there can be concurrent
readers, that need to be safe against oldestCatalogXmin advancing
concurrently.


>      /*
> -     * It's important *not* to include the limits set by slots here because
> +     * It's important *not* to include the xmin set by slots here because
>       * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>       * were to be included here the initial value could never increase because
>       * of a circular dependency where slots only increase their limits when
>       * running xacts increases oldestRunningXid and running xacts only
>       * increases if slots do.
> +     *
> +     * We can include the catalog_xmin limit here; there's no similar
> +     * circularity, and we need it to log xl_running_xacts records for
> +     * standbys.
>       */

Those comments seem to need some more heavyhanded reconciliation.

>   *
>   * Return the current slot xmin limits. That's useful to be able to remove
>   * data that's older than those limits.
> + *
> + * For logical replication slots' catalog_xmin, we return both the effective

This seems to need some light editing.


>  /*
>   * Record an enhanced snapshot of running transactions into WAL.
>   *
> + * We also record the value of procArray->replication_slot_catalog_xmin
> + * obtained from GetRunningTransactionData here, so standbys know we're about
> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
> + * removing dead catalog tuples below that threshold.

I think needs some rephrasing. We're not necessarily about to remove
catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

> +static void
> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
> +{
> +    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +    if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
> +        || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
> +        ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
> +    LWLockRelease(ProcArrayLock);
> +}

Doing TransactionIdPrecedes before ensuring
ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
me as a good idea.  Generally, the expression as it stands is hard to
understand.

> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
> index d25a2dd..a4ecfb7 100644
> --- a/src/include/access/transam.h
> +++ b/src/include/access/transam.h
> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>                                           * aborted */
>  
>      /*
> +     * This field is protected by ProcArrayLock except
> +     * during recovery, when it's set unlocked.
> +     *
> +     * oldestCatalogXmin is the oldest xid it is
> +     * guaranteed to be safe to create a historic
> +     * snapshot for. See also
> +     * procArray->replication_slot_catalog_xmin
> +     */
> +    TransactionId oldestCatalogXmin;

Maybe it'd be better to rephrase that do something like
"oldestCatalogXmin guarantees that no valid catalog tuples >= than it
are removed. That property is used for logical decoding.". or similar?

>  /*
>   * Each page of XLOG file has a header like this:
>   */
> -#define XLOG_PAGE_MAGIC 0xD097    /* can be used as WAL version indicator */
> +#define XLOG_PAGE_MAGIC 0xD100    /* can be used as WAL version indicator */

We normally only advance this by one, it's not tied to the poistgres version.


I'm sorry, but to me this patch isn't ready.  I'm also doubtful that it
makes a whole lot of sense on its own, without having finished the
design for decoding on a standby - it seems quite likely that we'll have
to redesign the mechanisms in here a bit for that.  For 10 this seems to
do nothing but add overhead?

Greetings,

Andres Freund



Re: Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:
> On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:
> 
> >> --- a/doc/src/sgml/protocol.sgml
> >> +++ b/doc/src/sgml/protocol.sgml
> >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
> >>       <para>
> >>        Drops a replication slot, freeing any reserved server-side resources. If
> >>        the slot is currently in use by an active connection, this command fails.
> >> +      If the slot is a logical slot that was created in a database other than
> >> +      the database the walsender is connected to, this command fails.
> >>       </para>
> >>       <variablelist>
> >>        <varlistentry>
> >
> > Shouldn't the docs in the drop database section about this?
> 
> DROP DATABASE doesn't really discuss all the resources it drops, but
> I'm happy to add mention of replication slots handling.

I don't think that's really comparable, because the other things aren't
global objects, which replication slots are.

- Andres



Re: Logical decoding on standby

From
Simon Riggs
Date:
On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:

>>  /*
>>   * Each page of XLOG file has a header like this:
>>   */
>> -#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
>> +#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */
>
> We normally only advance this by one, it's not tied to the poistgres version.

That was my addition. I rounded it up cos this is release 10. No biggie.

(Poistgres? Is that the Manhattan spelling?)

> I'm sorry, but to me this patch isn't ready.  I'm also doubtful that it
> makes a whole lot of sense on its own, without having finished the
> design for decoding on a standby - it seems quite likely that we'll have
> to redesign the mechanisms in here a bit for that.  For 10 this seems to
> do nothing but add overhead?

I'm sure we can reword the comments.

We've been redesigning the mechanisms for 2 years now, so it seems
unlikely that further redesign can be required. If it is required,
this patch is fairly low touch and change is possible later,
incremental development etc. As regards overhead, this adds a small
amount of time to a background process executed every 10 secs,
generates no new WAL records.

So I don't see any reason not to commit this feature, after the minor
corrections.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Logical decoding on standby

From
Andres Freund
Date:
On 2017-03-30 19:40:08 +0100, Simon Riggs wrote:
> On 30 March 2017 at 18:16, Andres Freund <andres@anarazel.de> wrote:
> 
> >>  /*
> >>   * Each page of XLOG file has a header like this:
> >>   */
> >> -#define XLOG_PAGE_MAGIC 0xD097       /* can be used as WAL version indicator */
> >> +#define XLOG_PAGE_MAGIC 0xD100       /* can be used as WAL version indicator */
> >
> > We normally only advance this by one, it's not tied to the poistgres version.
> 
> That was my addition. I rounded it up cos this is release 10. No biggie.

We'll probably upgrade that more than once again this release...


> (Poistgres? Is that the Manhattan spelling?)

Tiredness spelling ;)


> We've been redesigning the mechanisms for 2 years now, so it seems
> unlikely that further redesign can be required.

I don't think that's true *at all* - the mechanism previously
fundamentally different.

The whole topic has largely seen activity shortly before the code
freeze, both last time round and now.  I don't think it's surprising
that it thus doesn't end up being ready.


> If it is required,
> this patch is fairly low touch and change is possible later,
> incremental development etc. As regards overhead, this adds a small
> amount of time to a background process executed every 10 secs,
> generates no new WAL records.
> 
> So I don't see any reason not to commit this feature, after the minor
> corrections.

It doesn't have any benefit on its own, the locking model doesn't seem
fully there.  I don't see much reason to get this in before the release.


- Andres



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
> On 2017-03-29 08:01:34 +0800, Craig Ringer wrote:
>> On 28 March 2017 at 23:22, Andres Freund <andres@anarazel.de> wrote:
>>
>> >> --- a/doc/src/sgml/protocol.sgml
>> >> +++ b/doc/src/sgml/protocol.sgml
>> >> @@ -2034,6 +2034,8 @@ The commands accepted in walsender mode are:
>> >>       <para>
>> >>        Drops a replication slot, freeing any reserved server-side resources. If
>> >>        the slot is currently in use by an active connection, this command fails.
>> >> +      If the slot is a logical slot that was created in a database other than
>> >> +      the database the walsender is connected to, this command fails.
>> >>       </para>
>> >>       <variablelist>
>> >>        <varlistentry>
>> >
>> > Shouldn't the docs in the drop database section about this?
>>
>> DROP DATABASE doesn't really discuss all the resources it drops, but
>> I'm happy to add mention of replication slots handling.
>
> I don't think that's really comparable, because the other things aren't
> global objects, which replication slots are.

Fine by me.

Patch fix-slot-drop-docs.patch, upthread, adds the passage

+
+  <para>
+   Active <link linkend="logicaldecoding-replication-slots">logical
+   replication slots</> count as connections and will prevent a
+   database from being dropped. Inactive slots will be automatically
+   dropped when the database is dropped.
+  </para>

to the notes section of the DROP DATABASE docs; that should do the
trick. I'm not convinced it's worth going into the exceedingly
unlikely race with concurrent slot drop, and we don't seem to in other
places in the docs, like the race you mentioned with connecting to a
db as it's being dropped.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>>               SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>
>>               /*
>> +              * There can be no concurrent writers to oldestCatalogXmin during
>> +              * recovery, so no need to take ProcArrayLock.
>> +              */
>> +             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>
> s/writers/writes/?

I meant writers, i.e. nothing else can be writing to it. But writes works too.


>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>
>>       ReplicationSlotsComputeRequiredXmin(true);
>>
>> +     /*
>> +      * If this is the first slot created on the master we won't have a
>> +      * persistent record of the oldest safe xid for historic snapshots yet.
>> +      * Force one to be recorded so that when we go to replay from this slot we
>> +      * know it's safe.
>> +      */
>> +     force_standby_snapshot =
>> +             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>
> s/first slot/first logical slot/. Also, the reference to master doesn't
> seem right.

Unsure what you mean re reference to master not seeming right.

If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
from the new slot so we need to make sure it gets advanced one we've
decided on our starting catalog_xmin.

>>       LWLockRelease(ProcArrayLock);
>>
>> +     /* Update ShmemVariableCache->oldestCatalogXmin */
>> +     if (force_standby_snapshot)
>> +             LogStandbySnapshot();
>
> The comment and code don't quite square to me - it's far from obvious
> that LogStandbySnapshot does something like that. I'd even say it's a
> bad idea to have it do that.

So you prefer the prior approach with separate xl_catalog_xmin advance records?

I don't have much preference; I liked the small code reduction of
Simon's proposed approach, but it landed up being a bit awkward in
terms of ordering and locking. I don't think catalog_xmin tracking is
really closely related to the standby snapshot stuff and it feels a
bit like it's a tacked-on afterthought where it is now.

>>       /*
>>        * tell the snapshot builder to only assemble snapshot once reaching the
>>        * running_xact's record with the respective xmin.
>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>               start_lsn = slot->data.confirmed_flush;
>>       }
>>
>> +     EnsureActiveLogicalSlotValid();
>
> This seems like it should be in a separate patch, and seperately
> reviewed. It's code that's currently unreachable (and thus untestable).

It is reached and is run, those checks run whenever creating a
non-initial decoding context on master or replica.

The failure case is reachable if a replica has hot_standby_feedback
off or it's not using a physical slot and loses its connection. If
promoted, any slot existing on that replica (from a file system level
copy when the replica was created) will fail. I agree it's contrived
since we can't create and maintain slots on replicas, which is the
main point of it.


>> +/*
>> + * Test to see if the active logical slot is usable.
>> + */
>> +static void
>> +EnsureActiveLogicalSlotValid(void)
>> +{
>> +     TransactionId shmem_catalog_xmin;
>> +
>> +     Assert(MyReplicationSlot != NULL);
>> +
>> +     /*
>> +      * A logical slot can become unusable if we're doing logical decoding on a
>> +      * standby or using a slot created before we were promoted from standby
>> +      * to master.
>
> Neither of those is currently possible...

Right. Because it's foundations for decoding on standby.

>> If the master advanced its global catalog_xmin past the
>> +      * threshold we need it could've removed catalog tuple versions that
>> +      * we'll require to start decoding at our restart_lsn.
>> +      *
>> +      * We need a barrier so that if we decode in recovery on a standby we
>> +      * don't allow new decoding sessions to start after redo has advanced
>> +      * the threshold.
>> +      */
>> +     if (RecoveryInProgress())
>> +             pg_memory_barrier();
>
> I don't think this is a meaningful locking protocol.  It's a bad idea to
> use lock-free programming without need, especially when the concurrency
> protocol isn't well defined.

Yeah. The intended interaction is with recovery conflict on standby
which doesn't look likely to land in this release due to patch
split/cleanup etc. (Not a complaint).

Better to just take a brief shared ProcArrayLock.

>> diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
>> index cfc3fba..cdc5f95 100644
>> --- a/src/backend/replication/walsender.c
>> +++ b/src/backend/replication/walsender.c
>> @@ -1658,6 +1658,11 @@ PhysicalConfirmReceivedLocation(XLogRecPtr lsn)
>>        * be energy wasted - the worst lost information can do here is give us
>>        * wrong information in a statistics view - we'll just potentially be more
>>        * conservative in removing files.
>> +      *
>> +      * We don't have to do any effective_xmin / effective_catalog_xmin testing
>> +      * here either, like for LogicalConfirmReceivedLocation. If we received
>> +      * the xmin and catalog_xmin from downstream replication slots we know they
>> +      * were already confirmed there,
>>        */
>>  }
>
> This comment reads as if LogicalConfirmReceivedLocation had
> justification for not touching / checking catalog_xmin. But it does.

It touches it, what it doesn't do is test and only advance if the new
value is greater, like for xmin as referenced in the prior par. Will
clarify.

>>       /*
>> +      * Update our knowledge of the oldest xid we can safely create historic
>> +      * snapshots for.
>> +      *
>> +      * There can be no concurrent writers to oldestCatalogXmin during
>> +      * recovery, so no need to take ProcArrayLock.
>
> By now I think is pretty flawed logic, because there can be concurrent
> readers, that need to be safe against oldestCatalogXmin advancing
> concurrently.

You're right, we'll need a lock or suitable barriers here to ensure
that slot conflict with recovery and startup of new decoding sessions
doesn't see outdated values. This would be the peer of the
pg_memory_barrier() above. Or could just take a lock; there's enough
other locking activity in redo that it should be fine.

>>       /*
>> -      * It's important *not* to include the limits set by slots here because
>> +      * It's important *not* to include the xmin set by slots here because
>>        * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>>        * were to be included here the initial value could never increase because
>>        * of a circular dependency where slots only increase their limits when
>>        * running xacts increases oldestRunningXid and running xacts only
>>        * increases if slots do.
>> +      *
>> +      * We can include the catalog_xmin limit here; there's no similar
>> +      * circularity, and we need it to log xl_running_xacts records for
>> +      * standbys.
>>        */
>
> Those comments seem to need some more heavyhanded reconciliation.

OK. To me it seems clear that the first refers to xmin, the second to
catalog_xmin. But after all I wrote it, and the important thing is
what it says to people who are not me. Will adjust.

>>   *
>>   * Return the current slot xmin limits. That's useful to be able to remove
>>   * data that's older than those limits.
>> + *
>> + * For logical replication slots' catalog_xmin, we return both the effective
>
> This seems to need some light editing.

catalog_xmin => catalog_xmins I guess.

>>  /*
>>   * Record an enhanced snapshot of running transactions into WAL.
>>   *
>> + * We also record the value of procArray->replication_slot_catalog_xmin
>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>> + * removing dead catalog tuples below that threshold.
>
> I think needs some rephrasing. We're not necessarily about to remove
> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.

Agreed

>> +static void
>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>> +{
>> +     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>> +     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>> +             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
>> +             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>> +     LWLockRelease(ProcArrayLock);
>> +}
>
> Doing TransactionIdPrecedes before ensuring
> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
> me as a good idea.  Generally, the expression as it stands is hard to
> understand.

OK.

I found other formulations to be long and hard to read. Expressing it
as "if validity has changed or value has increased" made more sense.
Agree order should change.

>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>> index d25a2dd..a4ecfb7 100644
>> --- a/src/include/access/transam.h
>> +++ b/src/include/access/transam.h
>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>>                                                                                * aborted */
>>
>>       /*
>> +      * This field is protected by ProcArrayLock except
>> +      * during recovery, when it's set unlocked.
>> +      *
>> +      * oldestCatalogXmin is the oldest xid it is
>> +      * guaranteed to be safe to create a historic
>> +      * snapshot for. See also
>> +      * procArray->replication_slot_catalog_xmin
>> +      */
>> +     TransactionId oldestCatalogXmin;
>
> Maybe it'd be better to rephrase that do something like
> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
> are removed. That property is used for logical decoding.". or similar?

Fine by me.

I'll adjust this per discussion and per a comment Simon made
separately. Whether we use it right away or not it's worth having it
updated while it's still freshly in mind.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:

>> The comment and code don't quite square to me - it's far from obvious
>> that LogStandbySnapshot does something like that. I'd even say it's a
>> bad idea to have it do that.
>
> So you prefer the prior approach with separate xl_catalog_xmin advance records?

Alternately, we can record the creation timeline on slots, so we know
if there's been a promotion. It wouldn't make sense to do this if that
were the only use of timelines on slots. But I'm aware you'd rather
keep slots timeline-agnostic and I tend to agree.

Anyway, per your advice will split out the validation step.

(I'd like replication origins to be able to track time alongside lsn,
and to pass the timeline of each LSN to output plugin callbacks so we
can detect if a physical promotion results in us backtracking down a
fork in history, but that doesn't affect slots.)

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 31 March 2017 at 12:49, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 31 March 2017 at 01:16, Andres Freund <andres@anarazel.de> wrote:
>>> @@ -9633,6 +9643,12 @@ xlog_redo(XLogReaderState *record)
>>>               SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
>>>
>>>               /*
>>> +              * There can be no concurrent writers to oldestCatalogXmin during
>>> +              * recovery, so no need to take ProcArrayLock.
>>> +              */
>>> +             ShmemVariableCache->oldestCatalogXmin = checkPoint.oldestCatalogXmin;
>>
>> s/writers/writes/?
>
> I meant writers, i.e. nothing else can be writing to it. But writes works too.
>

Fixed.

>>> @@ -276,8 +279,21 @@ CreateInitDecodingContext(char *plugin,
>>>
>>>       ReplicationSlotsComputeRequiredXmin(true);
>>>
>>> +     /*
>>> +      * If this is the first slot created on the master we won't have a
>>> +      * persistent record of the oldest safe xid for historic snapshots yet.
>>> +      * Force one to be recorded so that when we go to replay from this slot we
>>> +      * know it's safe.
>>> +      */
>>> +     force_standby_snapshot =
>>> +             !TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin);
>>
>> s/first slot/first logical slot/. Also, the reference to master doesn't
>> seem right.
>
> Unsure what you mean re reference to master not seeming right.
>
> If oldestCatalogXmin is 0 we'll ERROR when trying to start decoding
> from the new slot so we need to make sure it gets advanced one we've
> decided on our starting catalog_xmin.

Moved to next patch, will address there.

>>>       LWLockRelease(ProcArrayLock);
>>>
>>> +     /* Update ShmemVariableCache->oldestCatalogXmin */
>>> +     if (force_standby_snapshot)
>>> +             LogStandbySnapshot();
>>
>> The comment and code don't quite square to me - it's far from obvious
>> that LogStandbySnapshot does something like that. I'd even say it's a
>> bad idea to have it do that.
>
> So you prefer the prior approach with separate xl_catalog_xmin advance records?
>
> I don't have much preference; I liked the small code reduction of
> Simon's proposed approach, but it landed up being a bit awkward in
> terms of ordering and locking. I don't think catalog_xmin tracking is
> really closely related to the standby snapshot stuff and it feels a
> bit like it's a tacked-on afterthought where it is now.

This code moved to next patch. But we do need to agree on the best approach.

If we're not going to force a standby snapshot here, then it's
probably better to use the separate xl_catalog_xmin design instead.

>>>       /*
>>>        * tell the snapshot builder to only assemble snapshot once reaching the
>>>        * running_xact's record with the respective xmin.
>>> @@ -376,6 +392,8 @@ CreateDecodingContext(XLogRecPtr start_lsn,
>>>               start_lsn = slot->data.confirmed_flush;
>>>       }
>>>
>>> +     EnsureActiveLogicalSlotValid();
>>
>> This seems like it should be in a separate patch, and seperately
>> reviewed. It's code that's currently unreachable (and thus untestable).
>
> It is reached and is run, those checks run whenever creating a
> non-initial decoding context on master or replica.

Again, moved to next patch.

>>>       /*
>>> +      * Update our knowledge of the oldest xid we can safely create historic
>>> +      * snapshots for.
>>> +      *
>>> +      * There can be no concurrent writers to oldestCatalogXmin during
>>> +      * recovery, so no need to take ProcArrayLock.
>>
>> By now I think is pretty flawed logic, because there can be concurrent
>> readers, that need to be safe against oldestCatalogXmin advancing
>> concurrently.
>
> You're right, we'll need a lock or suitable barriers here to ensure
> that slot conflict with recovery and startup of new decoding sessions
> doesn't see outdated values. This would be the peer of the
> pg_memory_barrier() above. Or could just take a lock; there's enough
> other locking activity in redo that it should be fine.

Now takes ProcArrayLock briefly.

oldestCatalogXmin is also used in GetOldestSafeDecodingTransactionId,
and there we want to prevent it from being advanced. But on further
thought, relying on oldestCatalogXmin there is actually unsafe; on the
master, we might've already logged our intent to advance it to some
greater value of procArray->replication_slot_catalog_xmin and will do
so as ProcArrayLock is released. On standby we're also better off
relying on procArray->replication_slot_catalog_xmin since that's what
we'll be sending in feedback.

Went back to using replication_slot_catalog_xmin here, with comment

     *
     * We don't use ShmemVariableCache->oldestCatalogXmin here because another
     * backend may have already logged its intention to advance it to a higher
     * value (still <= replication_slot_catalog_xmin) and just be waiting on
     * ProcArrayLock to actually apply the change. On a standby
     * replication_slot_catalog_xmin is what the walreceiver will be sending in
     * hot_standby_feedback, not oldestCatalogXmin.
     */


>>>       /*
>>> -      * It's important *not* to include the limits set by slots here because
>>> +      * It's important *not* to include the xmin set by slots here because
>>>        * snapbuild.c uses oldestRunningXid to manage its xmin horizon. If those
>>>        * were to be included here the initial value could never increase because
>>>        * of a circular dependency where slots only increase their limits when
>>>        * running xacts increases oldestRunningXid and running xacts only
>>>        * increases if slots do.
>>> +      *
>>> +      * We can include the catalog_xmin limit here; there's no similar
>>> +      * circularity, and we need it to log xl_running_xacts records for
>>> +      * standbys.
>>>        */
>>
>> Those comments seem to need some more heavyhanded reconciliation.
>
> OK. To me it seems clear that the first refers to xmin, the second to
> catalog_xmin. But after all I wrote it, and the important thing is
> what it says to people who are not me. Will adjust.

Changed to

     * We can safely report the catalog_xmin limit for replication slots here
     * because it's only used to advance oldestCatalogXmin. Slots'
     * catalog_xmin advance does not depend on it so there's no circularity.



>
>>>   *
>>>   * Return the current slot xmin limits. That's useful to be able to remove
>>>   * data that's older than those limits.
>>> + *
>>> + * For logical replication slots' catalog_xmin, we return both the effective
>>
>> This seems to need some light editing.
>
> catalog_xmin => catalog_xmins I guess.

Amended.

>>>  /*
>>>   * Record an enhanced snapshot of running transactions into WAL.
>>>   *
>>> + * We also record the value of procArray->replication_slot_catalog_xmin
>>> + * obtained from GetRunningTransactionData here, so standbys know we're about
>>> + * to advance ShmemVariableCache->oldestCatalogXmin to its value and start
>>> + * removing dead catalog tuples below that threshold.
>>
>> I think needs some rephrasing. We're not necessarily about to remove
>> catalog tuples here, nor are we necessarily advancing oldestCatalogXmin.
>
> Agreed

 * We also record the value of procArray->replication_slot_catalog_xmin
 * obtained from GetRunningTransactionData here. We intend to advance
 * ShmemVariableCache->oldestCatalogXmin to it once standbys have been informed
 * of the new value, which will permit removal of previously-protected dead
 * catalog tuples. The standby needs to know about that before any WAL
 * records containing such tuple removals could possibly arrive.


>>> +static void
>>> +UpdateOldestCatalogXmin(TransactionId pendingOldestCatalogXmin)
>>> +{
>>> +     LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>> +     if (TransactionIdPrecedes(ShmemVariableCache->oldestCatalogXmin, pendingOldestCatalogXmin)
>>> +             || (TransactionIdIsValid(ShmemVariableCache->oldestCatalogXmin) !=
TransactionIdIsValid(pendingOldestCatalogXmin)))
>>> +             ShmemVariableCache->oldestCatalogXmin = pendingOldestCatalogXmin;
>>> +     LWLockRelease(ProcArrayLock);
>>> +}
>>
>> Doing TransactionIdPrecedes before ensuring
>> ShmemVariableCache->oldestCatalogXmin is actually valid doesn't strike
>> me as a good idea.  Generally, the expression as it stands is hard to
>> understand.
>
> OK.
>
> I found other formulations to be long and hard to read. Expressing it
> as "if validity has changed or value has increased" made more sense.
> Agree order should change.

Re-ordered, otherwise left the same.

Could add a comment like

"we must set oldestCatalogXmin if its validity has changed or it is advancing"

but seems rather redundant to the code.


>>> diff --git a/src/include/access/transam.h b/src/include/access/transam.h
>>> index d25a2dd..a4ecfb7 100644
>>> --- a/src/include/access/transam.h
>>> +++ b/src/include/access/transam.h
>>> @@ -136,6 +136,17 @@ typedef struct VariableCacheData
>>>                                                                                * aborted */
>>>
>>>       /*
>>> +      * This field is protected by ProcArrayLock except
>>> +      * during recovery, when it's set unlocked.
>>> +      *
>>> +      * oldestCatalogXmin is the oldest xid it is
>>> +      * guaranteed to be safe to create a historic
>>> +      * snapshot for. See also
>>> +      * procArray->replication_slot_catalog_xmin
>>> +      */
>>> +     TransactionId oldestCatalogXmin;
>>
>> Maybe it'd be better to rephrase that do something like
>> "oldestCatalogXmin guarantees that no valid catalog tuples >= than it
>> are removed. That property is used for logical decoding.". or similar?
>
> Fine by me.
>
> I'll adjust this per discussion and per a comment Simon made
> separately. Whether we use it right away or not it's worth having it
> updated while it's still freshly in mind.

OK, updated catalog_xmin logging patch attached.

Important fix included: when faking up a RunningTransactions snapshot
in StartupXLOG for replay of shutdown checkpoints, copy the
checkpoint's oldestCatalogXmin so we apply it instead of clobbering
the replica's value. It's kind of roundabout to set this once when we
apply the checkpoint and again via ProcArrayApplyRecoveryInfo, but
it's necessary if we're using xl_running_xacts to carry
oldestCatalogXmin info.

Found another issue too. We log our intention to increase
oldestCatalogXmin in LogStandbySnapshot when we write
xl_running_xacts. We then release ProcArrayLock to re-acquire it
LW_EXCLUSIVE so we can increment oldestCatalogXmin in shmem. But a
checkpoint runs and copies the old oldestCatalogXmin value after we
wrote xlog but before we updated in shmem. On the standby, redo will
apply the new value then clobber it with the old one.

To fix this, take CheckpointLock in LogStandbySnapshot (if not called
during a checkpoint) so we can't have the xl_running_xacts with the
new oldestCatalogXmin land up in WAL before a checkpoint with an older
value. Also take oldestCatalogXmin's value after we've forced
LogStandbySnapshot in a checkpoint.

Extended tests a bit to cover redo on standbys.

Personally I'm not a huge fan of how integrating this with logging
standby snapshots has turned out. It seemed to make sense initially,
but I think the way it works out is more convoluted than necessary for
little benefit. I'll prep an updated version of the
xl_advance_catalog_xmin patch with the same fixes for side by side
comparison.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:

> OK, updated catalog_xmin logging patch attached.

Ahem, that should be v5.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
On 3 April 2017 at 15:27, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 3 April 2017 at 13:46, Craig Ringer <craig@2ndquadrant.com> wrote:
>
>> OK, updated catalog_xmin logging patch attached.
>
> Ahem, that should be v5.

... and here's v6, which returns to the separate
xl_xact_catalog_xmin_advance approach.

pgintented.

This is what I favour proceeding with.

Now updating/amending recovery conflict patch.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Craig Ringer
Date:
Hi all

Here's the final set of three patches on top of what's already committed.

The first is catalog_xmin logging, which is unchanged from the prior post.

The 2nd is support for conflict with recovery, with changes that
should address Andres's concerns there.

The 3rd actually enables decoding on standby. Unlike the prior
version, no attempt is made to check the walsender configuration for
slot use, etc. The ugly code to try to mitigate races is also removed.
Instead, if wal_level is logical the catalog_xmin sent by
hot_standby_feedback is now the same as the xmin if there's no local
slot holding it down. So we're always sending a catalog_xmin in
feedback and we should always expect to have a valid local
oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
race in slot creation no worse than the existing race between
hot_standby_feedback establishment and the first queries run on a
downstream, albeit with more annoying consequences. Apps can still
ensure a slot created on standby is guaranteed safe and conflict-free
by having a slot on the master first.

I'm much happier with this. I'm still fixing some issues in the tests
for 03 and tidying them up, but 03 should allow 01 and 02 to be
reviewed in their proper context now.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 4 April 2017 at 22:32, Craig Ringer <craig@2ndquadrant.com> wrote:
> Hi all
>
> Here's the final set of three patches on top of what's already committed.
>
> The first is catalog_xmin logging, which is unchanged from the prior post.
>
> The 2nd is support for conflict with recovery, with changes that
> should address Andres's concerns there.
>
> The 3rd actually enables decoding on standby. Unlike the prior
> version, no attempt is made to check the walsender configuration for
> slot use, etc. The ugly code to try to mitigate races is also removed.
> Instead, if wal_level is logical the catalog_xmin sent by
> hot_standby_feedback is now the same as the xmin if there's no local
> slot holding it down. So we're always sending a catalog_xmin in
> feedback and we should always expect to have a valid local
> oldestCatalogXmin once hot_standby_feedback kicks in. This makes the
> race in slot creation no worse than the existing race between
> hot_standby_feedback establishment and the first queries run on a
> downstream, albeit with more annoying consequences. Apps can still
> ensure a slot created on standby is guaranteed safe and conflict-free
> by having a slot on the master first.
>
> I'm much happier with this. I'm still fixing some issues in the tests
> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> reviewed in their proper context now.

Dammit. Attached.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Andres Freund
Date:
On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
> I'm much happier with this. I'm still fixing some issues in the tests
> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> reviewed in their proper context now.

To me this very clearly is too late for v10, and now should be moved to
the next CF.

- Andres



Re: Logical decoding on standby

From
Craig Ringer
Date:
On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
>> I'm much happier with this. I'm still fixing some issues in the tests
>> for 03 and tidying them up, but 03 should allow 01 and 02 to be
>> reviewed in their proper context now.
>
> To me this very clearly is too late for v10, and now should be moved to
> the next CF.

I tend to agree that it's late in the piece. It's still worth cleaning
it up into a state ready for early pg11 though.

I've just fixed an issue where hot_standby_feedback on a physical slot
could cause oldestCatalogXmin to go backwards. When the slot's
catalog_xmin was 0 and is being set for the first time the standby's
supplied catalog_xmin is trusted. To fix it, in
PhysicalReplicationSlotNewXmin when setting catalog_xmin from 0, clamp
the value to the master's GetOldestSafeDecodingTransactionId().

Tests are cleaned up and fixed.

This series adds full support for logical decoding on a standby.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Logical decoding on standby

From
Andres Freund
Date:
On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:
> On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
> >> I'm much happier with this. I'm still fixing some issues in the tests
> >> for 03 and tidying them up, but 03 should allow 01 and 02 to be
> >> reviewed in their proper context now.
> >
> > To me this very clearly is too late for v10, and now should be moved to
> > the next CF.
> 
> I tend to agree that it's late in the piece. It's still worth cleaning
> it up into a state ready for early pg11 though.

Totally agreed.

- Andres



Re: Logical decoding on standby

From
Robert Haas
Date:
On Wed, Apr 5, 2017 at 10:32 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-05 17:18:24 +0800, Craig Ringer wrote:
>> On 5 April 2017 at 04:19, Andres Freund <andres@anarazel.de> wrote:
>> > On 2017-04-04 22:32:40 +0800, Craig Ringer wrote:
>> >> I'm much happier with this. I'm still fixing some issues in the tests
>> >> for 03 and tidying them up, but 03 should allow 01 and 02 to be
>> >> reviewed in their proper context now.
>> >
>> > To me this very clearly is too late for v10, and now should be moved to
>> > the next CF.
>>
>> I tend to agree that it's late in the piece. It's still worth cleaning
>> it up into a state ready for early pg11 though.
>
> Totally agreed.

Based on this exchange, marked as "Moved to next CF".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company